Scale AI

Software Engineer, ML Infrastructure - Training Platform

Scale AI, San Francisco, California, United States, 94199

Scale is looking for an AI Infrastructure Engineer to join our Machine Learning Infrastructure team to build out our Training Platform. You will partner closely with Machine Learning researchers to understand their requirements and apply your own domain expertise and our compute resources to accelerate experimentation throughput.The ideal candidate is someone who has strong fundamentals in machine learning, backend system design, and has prior ML Infrastructure experience. You should also be comfortable with infrastructure and large scale system design, as well as diagnosing both model performance and system failures.You will:

Build highly available, observable, performant, and cost-effective APIs for model training.Participate in our team’s on call process to ensure the availability of our services.Own projects end-to-end, from requirements, scoping, design, to implementation, in a highly collaborative and cross-functional environment.Exercise good taste in building systems and tools and know when to make build vs. buy tradeoffs, with an eye for cost efficiency.Ideally you'd have:

4+ years of experience building machine learning training pipelines or inference services in a production setting.Experience with distributed training techniques such as DeepSpeed, FSDP, etc.Experience building, deploying, and monitoring complex microservice architectures.Experience with Python, Docker, Kubernetes, and Infrastructure as code (e.g. terraform).Nice to haves:

Experience with LLM inference latency optimization techniques, e.g. kernel fusion, quantization, dynamic batching, etc.Experience working with a cloud technology stack (eg. AWS or GCP).

#J-18808-Ljbffr