Get new jobs for this search by email

Create Job Alerts

Senior Principal Software Engineer - GPU Cluster Performance and ...
Ll Oefentherapie - Seattle, Washington, us, 98127 9 days ago
We are seeking a highly skilled and experienced Large GPU Cluster Performance and Benchmark Engineer to join our advanced technology team as a Senior...
More...
Principal Observability Architect, AI and HPC
NVIDIA - Santa Clara, California, us, 95053 15 days ago
NVIDIA’s Hardware Infrastructure organization is seeking a Senior or Principal Data and Observability Architect. We serve and collaborate directly wi...
More...
Sr. System Engineer
Support Revolution - San Jose, California, United States, 95199 18 days ago
Select how often (in days) to receive an alert:Create Alert Location:San Jose, California, United States About Supermicro:
More...
Manager, Solution Engineering
Support Revolution - San Jose, California, United States, 95199 2 months ago
Select how often (in days) to receive an alert:Create Alert Location:San Jose, California, United States About Supermicro:
More...
ML engineer | Large Scale AI Infrastructure
GLO Comms - Palo Alto, California, United States, 94306 13 days ago
This position will sit within a company that is pioneering a new era of Biomedicine! Role Overview: GPU Cluster Management:Architect, deploy...
More...
Experienced C++ Developer, HPC Storage
Hudson River Trading - New York, New York, us, 10261 3 days ago
Hudson River Trading (HRT) is a leading quantitative trading and investment firm specializing in multi-asset class strategies. At the core of our succ...
More...
Senior Principal Software Engineer - GPU Cluster Performance and ...
Oracle - Washington, District of Columbia, us, 20022 25 days ago
Job DescriptionWe are seeking a highly skilled and experienced Large GPU Cluster Performance and Benchmark Engineer to join our advanced tec...
More...
Senior Principal Software Engineer - GPU Cluster Performance and ...
Oracle - Boston, Massachusetts, us, 02298 25 days ago
Job DescriptionWe are seeking a highly skilled and experienced Large GPU Cluster Performance and Benchmark Engineer to join our advanced tec...
More...
Solutions Architect, Cloud Providers and Hyperscale
NVIDIA Corporation - Santa Clara, California, us, 95053 1 months ago
Solutions Architect, Cloud Providers and Hyperscale Apply locations US, CA, Santa Clara US, WA, Remote US, CA, RemoteTime type: Full time
More...
HPC Software Architect
Cymertek Corporation - Honolulu, Hawaii, United States, 96814 4 days ago
HPC Software ArchitectKEY SUMMARYWe are seeking an experienced and visionary HPC (High-Performance Computing) Software Architect t...
More...

Go to next page

Ll Oefentherapie

Senior Principal Software Engineer - GPU Cluster Performance and ...

Ll Oefentherapie - Seattle, Washington, us, 98127

Work at Ll Oefentherapie

Overview
View job

Overview

We are seeking a highly skilled and experienced Large GPU Cluster Performance and Benchmark Engineer to join our advanced technology team as a Senior Principal. In this role, you will be responsible for designing, optimizing, and benchmarking large-scale GPU clusters, specifically focusing on running MLPerf benchmarks from MLCommons across thousands of NVIDIA and AMD GPUs. You will play a critical role in optimizing performance, both for AI/ML and compute workloads, as well as ensuring efficient storage solutions.

Why Join Us?

Be at the forefront of GPU performance benchmarking and large-scale infrastructure design.

Opportunity to work with a highly skilled team of engineers, architects, and thought leaders in the AI/ML and HPC space.

Competitive salary and benefits package, with opportunities for growth and development.

If you are a highly motivated and experienced professional with a passion for pushing the boundaries of GPU cluster performance, we encourage you to apply and join our dynamic team!

Career Level - IC5

Benchmarking and Performance Optimization:

Execute and lead the performance benchmarking of large-scale GPU clusters using MLPerf from MLCommons, ensuring optimal performance across thousands of NVIDIA and AMD GPUs.

Conduct end-to-end STAC benchmarks, including STAC-M3, STAC-A2, and STAC-AI, for both compute and storage performance, with a focus on low-latency, high-throughput solutions.

Solution Architecture and Design:

Design and architect complex solutions that leverage OCI (Oracle Cloud Infrastructure) services for GPU cluster deployments, high-performance computing (HPC), and large-scale AI/ML workloads.

Collaborate with cross-functional teams to develop and implement cutting-edge, high-performance GPU cluster architectures that meet rigorous benchmarks and performance requirements.

Leadership and Collaboration:

Serve as a thought leader and subject matter expert in GPU cluster performance optimization, benchmarking standards, and cloud-native AI/ML infrastructure.

Mentor and guide junior engineers and provide technical leadership in areas related to GPU performance, benchmarking, and solution architecture.

Continuous Improvement and Innovation:

Stay abreast of the latest industry trends, research, and developments in GPU performance and benchmarking tools, HPC, and cloud-native AI/ML infrastructures.

Drive innovation by recommending and implementing new approaches, technologies, and tools to enhance the performance of large-scale GPU clusters and benchmarking methodologies.

Required Qualifications:

Experience:

Proven experience running MLPerf benchmarks from MLCommons across large-scale environments with thousands of NVIDIA and AMD GPUs.

Demonstrated expertise in conducting STAC benchmarks, including STAC-M3, STAC-A2, and STAC-AI, for both compute and storage performance optimization.

Extensive experience in GPU cluster architecture, performance tuning, and benchmarking in cloud environments.

Technical Skills:

Strong knowledge of GPU architectures (NVIDIA, AMD) and parallel computing.

Proficiency with Oracle Cloud Infrastructure (OCI) services and solutions architecture.

Expertise in container orchestration (e.g., Kubernetes, Docker), HPC frameworks, and distributed computing.

Strong programming skills in Python, C++, or CUDA, and experience with performance profiling tools.

Soft Skills:

Exceptional analytical and problem-solving abilities.

Strong communication and collaboration skills with a proven ability to work effectively across diverse teams.

High attention to detail, with a strong focus on quality and performance optimization.

Preferred Qualifications:

Experience with AI/ML frameworks (TensorFlow, PyTorch, etc.) and their deployment in large-scale cloud environments.

Familiarity with other cloud platforms (AWS, Azure, GCP) and hybrid cloud architectures.

Active participation in MLCommons, STAC, or other industry consortiums or standards groups.

#J-18808-Ljbffr

See details and apply

Senior Principal Software Engineer - GPU Cluster Performance and ...

Get new jobs for this search by email

Senior Principal Software Engineer - GPU Cluster Performance and ...

Principal Observability Architect, AI and HPC

Sr. System Engineer

Manager, Solution Engineering

ML engineer | Large Scale AI Infrastructure

Experienced C++ Developer, HPC Storage

Senior Principal Software Engineer - GPU Cluster Performance and ...

Senior Principal Software Engineer - GPU Cluster Performance and ...

Solutions Architect, Cloud Providers and Hyperscale

HPC Software Architect

Overview

See details and apply