Senior Principal Software Engineer - GPU Cluster Performance and Benchmark Engin
Ll Oefentherapie, Seattle, WA, United States
We are seeking a highly skilled and experienced Large GPU Cluster Performance and Benchmark Engineer to join our advanced technology team as a Senior Principal. In this role, you will be responsible for designing, optimizing, and benchmarking large-scale GPU clusters, specifically focusing on running MLPerf benchmarks from MLCommons across thousands of NVIDIA and AMD GPUs. You will play a critical role in optimizing performance, both for AI/ML and compute workloads, as well as ensuring efficient storage solutions.
Why Join Us?
- Be at the forefront of GPU performance benchmarking and large-scale infrastructure design.
- Opportunity to work with a highly skilled team of engineers, architects, and thought leaders in the AI/ML and HPC space.
- Competitive salary and benefits package, with opportunities for growth and development.
If you are a highly motivated and experienced professional with a passion for pushing the boundaries of GPU cluster performance, we encourage you to apply and join our dynamic team!
Career Level - IC5
Benchmarking and Performance Optimization:
- Execute and lead the performance benchmarking of large-scale GPU clusters using MLPerf from MLCommons, ensuring optimal performance across thousands of NVIDIA and AMD GPUs.
- Conduct end-to-end STAC benchmarks, including STAC-M3, STAC-A2, and STAC-AI, for both compute and storage performance, with a focus on low-latency, high-throughput solutions.
Solution Architecture and Design:
- Design and architect complex solutions that leverage OCI (Oracle Cloud Infrastructure) services for GPU cluster deployments, high-performance computing (HPC), and large-scale AI/ML workloads.
- Collaborate with cross-functional teams to develop and implement cutting-edge, high-performance GPU cluster architectures that meet rigorous benchmarks and performance requirements.
Leadership and Collaboration:
- Serve as a thought leader and subject matter expert in GPU cluster performance optimization, benchmarking standards, and cloud-native AI/ML infrastructure.
- Mentor and guide junior engineers and provide technical leadership in areas related to GPU performance, benchmarking, and solution architecture.
Continuous Improvement and Innovation:
- Stay abreast of the latest industry trends, research, and developments in GPU performance and benchmarking tools, HPC, and cloud-native AI/ML infrastructures.
- Drive innovation by recommending and implementing new approaches, technologies, and tools to enhance the performance of large-scale GPU clusters and benchmarking methodologies.
Required Qualifications:
Experience:
- Proven experience running MLPerf benchmarks from MLCommons across large-scale environments with thousands of NVIDIA and AMD GPUs.
- Demonstrated expertise in conducting STAC benchmarks, including STAC-M3, STAC-A2, and STAC-AI, for both compute and storage performance optimization.
- Extensive experience in GPU cluster architecture, performance tuning, and benchmarking in cloud environments.
Technical Skills:
- Strong knowledge of GPU architectures (NVIDIA, AMD) and parallel computing.
- Proficiency with Oracle Cloud Infrastructure (OCI) services and solutions architecture.
- Expertise in container orchestration (e.g., Kubernetes, Docker), HPC frameworks, and distributed computing.
- Strong programming skills in Python, C++, or CUDA, and experience with performance profiling tools.
Soft Skills:
- Exceptional analytical and problem-solving abilities.
- Strong communication and collaboration skills with a proven ability to work effectively across diverse teams.
- High attention to detail, with a strong focus on quality and performance optimization.
Preferred Qualifications:
- Experience with AI/ML frameworks (TensorFlow, PyTorch, etc.) and their deployment in large-scale cloud environments.
- Familiarity with other cloud platforms (AWS, Azure, GCP) and hybrid cloud architectures.
- Active participation in MLCommons, STAC, or other industry consortiums or standards groups.