Logo
Ll Oefentherapie

Senior Principal Software Engineer - GPU Cluster Performance and Benchmark Engin

Ll Oefentherapie, Seattle, WA, United States


We are seeking a highly skilled and experienced Large GPU Cluster Performance and Benchmark Engineer to join our advanced technology team as a Senior Principal. In this role, you will be responsible for designing, optimizing, and benchmarking large-scale GPU clusters, specifically focusing on running MLPerf benchmarks from MLCommons across thousands of NVIDIA and AMD GPUs. You will play a critical role in optimizing performance, both for AI/ML and compute workloads, as well as ensuring efficient storage solutions.

Why Join Us?

  • Be at the forefront of GPU performance benchmarking and large-scale infrastructure design.
  • Opportunity to work with a highly skilled team of engineers, architects, and thought leaders in the AI/ML and HPC space.
  • Competitive salary and benefits package, with opportunities for growth and development.

If you are a highly motivated and experienced professional with a passion for pushing the boundaries of GPU cluster performance, we encourage you to apply and join our dynamic team!

Career Level - IC5


Benchmarking and Performance Optimization:

  • Execute and lead the performance benchmarking of large-scale GPU clusters using MLPerf from MLCommons, ensuring optimal performance across thousands of NVIDIA and AMD GPUs.
  • Conduct end-to-end STAC benchmarks, including STAC-M3, STAC-A2, and STAC-AI, for both compute and storage performance, with a focus on low-latency, high-throughput solutions.

Solution Architecture and Design:

  • Design and architect complex solutions that leverage OCI (Oracle Cloud Infrastructure) services for GPU cluster deployments, high-performance computing (HPC), and large-scale AI/ML workloads.
  • Collaborate with cross-functional teams to develop and implement cutting-edge, high-performance GPU cluster architectures that meet rigorous benchmarks and performance requirements.

Leadership and Collaboration:

  • Serve as a thought leader and subject matter expert in GPU cluster performance optimization, benchmarking standards, and cloud-native AI/ML infrastructure.
  • Mentor and guide junior engineers and provide technical leadership in areas related to GPU performance, benchmarking, and solution architecture.

Continuous Improvement and Innovation:

  • Stay abreast of the latest industry trends, research, and developments in GPU performance and benchmarking tools, HPC, and cloud-native AI/ML infrastructures.
  • Drive innovation by recommending and implementing new approaches, technologies, and tools to enhance the performance of large-scale GPU clusters and benchmarking methodologies.

Required Qualifications:

Experience:

  • Proven experience running MLPerf benchmarks from MLCommons across large-scale environments with thousands of NVIDIA and AMD GPUs.
  • Demonstrated expertise in conducting STAC benchmarks, including STAC-M3, STAC-A2, and STAC-AI, for both compute and storage performance optimization.
  • Extensive experience in GPU cluster architecture, performance tuning, and benchmarking in cloud environments.

Technical Skills:

  • Strong knowledge of GPU architectures (NVIDIA, AMD) and parallel computing.
  • Proficiency with Oracle Cloud Infrastructure (OCI) services and solutions architecture.
  • Expertise in container orchestration (e.g., Kubernetes, Docker), HPC frameworks, and distributed computing.
  • Strong programming skills in Python, C++, or CUDA, and experience with performance profiling tools.

Soft Skills:

  • Exceptional analytical and problem-solving abilities.
  • Strong communication and collaboration skills with a proven ability to work effectively across diverse teams.
  • High attention to detail, with a strong focus on quality and performance optimization.

Preferred Qualifications:

  • Experience with AI/ML frameworks (TensorFlow, PyTorch, etc.) and their deployment in large-scale cloud environments.
  • Familiarity with other cloud platforms (AWS, Azure, GCP) and hybrid cloud architectures.
  • Active participation in MLCommons, STAC, or other industry consortiums or standards groups.
#J-18808-Ljbffr