Logo
Oracle

Senior Principal Software Engineer - GPU Cluster Performance and Benchmark Engin

Oracle, Seattle, Washington, us, 98127


We are seeking a highly skilled and experienced Large GPU Cluster Performance and Benchmark Engineer to join our advanced technology team as a Senior Principal. In this role, you will be responsible for designing, optimizing, and benchmarking large-scale GPU clusters, specifically focusing on running MLPerf benchmarks from MLCommons across thousands of NVIDIA and AMD GPUs. You will play a critical role in optimizing performance, both for AI/ML and compute workloads, as well as ensuring efficient storage solutions.

Why Join Us?

Be at the forefront of GPU performance benchmarking and large-scale infrastructure design.

Opportunity to work with a highly skilled team of engineers, architects, and thought leaders in the AI/ML and HPC space.

Competitive salary and benefits package, with opportunities for growth and development.

If you are a highly motivated and experienced professional with a passion for pushing the boundaries of GPU cluster performance, we encourage you to apply and join our dynamic team!

Career Level - IC5

Benchmarking and Performance Optimization:

Execute and lead the performance benchmarking of large-scale GPU clusters using MLPerf from MLCommons, ensuring optimal performance across thousands of NVIDIA and AMD GPUs.

Conduct end-to-end STAC benchmarks, including STAC-M3, STAC-A2, and STAC-AI, for both compute and storage performance, with a focus on low-latency, high-throughput solutions.

Solution Architecture and Design:

Design and architect complex solutions that leverage OCI (Oracle Cloud Infrastructure) services for GPU cluster deployments, high-performance computing (HPC), and large-scale AI/ML workloads.

Collaborate with cross-functional teams to develop and implement cutting-edge, high-performance GPU cluster architectures that meet rigorous benchmarks and performance requirements.

Leadership and Collaboration:

Serve as a thought leader and subject matter expert in GPU cluster performance optimization, benchmarking standards, and cloud-native AI/ML infrastructure.

Mentor and guide junior engineers and provide technical leadership in areas related to GPU performance, benchmarking, and solution architecture.

Continuous Improvement and Innovation:

Stay abreast of the latest industry trends, research, and developments in GPU performance and benchmarking tools, HPC, and cloud-native AI/ML infrastructures.

Drive innovation by recommending and implementing new approaches, technologies, and tools to enhance the performance of large-scale GPU clusters and benchmarking methodologies.

Required Qualifications:

Experience:

Proven experience running MLPerf benchmarks from MLCommons across large-scale environments with thousands of NVIDIA and AMD GPUs.

Demonstrated expertise in conducting STAC benchmarks, including STAC-M3, STAC-A2, and STAC-AI, for both compute and storage performance optimization.

Extensive experience in GPU cluster architecture, performance tuning, and benchmarking in cloud environments.

Technical Skills:

Strong knowledge of GPU architectures (NVIDIA, AMD) and parallel computing.

Proficiency with Oracle Cloud Infrastructure (OCI) services and solutions architecture.

Expertise in container orchestration (e.g., Kubernetes, Docker), HPC frameworks, and distributed computing.

Strong programming skills in Python, C++, or CUDA, and experience with performance profiling tools.

Soft Skills:

Exceptional analytical and problem-solving abilities.

Strong communication and collaboration skills with a proven ability to work effectively across diverse teams.

High attention to detail, with a strong focus on quality and performance optimization.

Preferred Qualifications:

Experience with AI/ML frameworks (TensorFlow, PyTorch, etc.) and their deployment in large-scale cloud environments.

Familiarity with other cloud platforms (AWS, Azure, GCP) and hybrid cloud architectures.

Active participation in MLCommons, STAC, or other industry consortiums or standards groups.

#J-18808-Ljbffr