Logo
Glocomms

Staff Machine Learning Engineer | Large Scale AI Infrastructure

Glocomms, Stanford, California, United States, 94305


This position will sit within a company that is pioneering a new era of Biomedicine Role Overview: GPU Cluster Management: Architect, deploy, and sustain high-performance GPU clusters, ensuring they are stable, reliable, and scalable. Oversee and manage cluster resources to maximize efficiency and utilization. Distributed/Parallel Training: Apply distributed computing techniques to facilitate parallel training of extensive deep learning models across multiple GPUs and nodes. Optimize data distribution and synchronization for faster convergence and reduced training times. Performance Optimization: Enhance GPU clusters and deep learning frameworks to achieve peak performance for specific workloads. Identify and resolve performance bottlenecks through profiling and system analysis. Deep Learning Framework Integration: Work closely with data scientists and machine learning engineers to incorporate distributed training capabilities into the company's model development and deployment frameworks. Scalability and Resource Management: Ensure GPU clusters can scale effectively to meet growing computational demands. Develop strategies for resource management to prioritize and allocate computing resources based on project needs. Troubleshooting and Support: Diagnose and resolve issues related to GPU clusters, distributed training, and performance anomalies. Provide technical support to users and efficiently resolve technical challenges. Documentation: Develop and maintain documentation on GPU cluster configuration, distributed training workflows, and best practices to facilitate knowledge sharing and smooth onboarding of new team members. Qualifications: Master's or Ph.D. in computer science or a related field, with a focus on High-Performance Computing, Distributed Systems, or Deep Learning. Over 2 years of proven experience in managing GPU clusters, including installation, configuration, and optimization. Strong expertise in distributed deep learning and parallel training techniques. Proficiency in popular deep learning frameworks such as PyTorch, Megatron-LM, and DeepSpeed. Programming skills in Python and experience with GPU-accelerated libraries (e.g., CUDA, cuDNN). Knowledge of performance profiling and optimization tools for HPC and deep learning. Familiarity with resource management and scheduling systems (e.g., SLURM, Kubernetes). Solid background in distributed systems, cloud computing (AWS, GCP), and containerization (Docker, Kubernetes). Currently or previously holding a Staff or equivalent title | Currently sitting within a Senior leveled title for 3 years The company will provide a relocation package for candidates open to relocate