ZipRecruiter
Professional Services Engineer (Cloud & AI Infra)
ZipRecruiter, Boston, Massachusetts, us, 02298
Job Description
About the Company
Our client is at the forefront of the AI revolution, providing cutting-edge
infrastructure
that's reshaping the landscape of artificial intelligence. They offer an
AI-centric cloud platform
that empowers Fortune 500 companies, top-tier innovative startups, and AI researchers to drive breakthroughs in AI. This publicly traded company is committed to building full-stack infrastructure to service the explosive growth of the global AI industry, including large-scale GPU clusters, cloud platforms, tools, and services for developers. Company Type: Publicly traded
Product: AI-centric GPU cloud platform & infrastructure for training AI models
Candidate Location: Remote anywhere in the US
Their
mission
is to democratize access to world-class AI infrastructure, enabling organizations of all sizes to turn bold AI ambitions into reality. At the core of their success is a culture that celebrates creativity, embraces challenges, and thrives on collaboration. The Opportunity As a
Professional Services Engineer (Remote) , you’ll play a key role in designing, implementing, and maintaining large-scale machine learning (ML) training and inference workflows for clients. Working closely with a Solutions Architect and support teams, you’ll provide expert, hands-on guidance to help clients achieve optimal ML pipeline performance and efficiency. What You'll Do Design and implement scalable ML training and inference workflows using Kubernetes and Slurm, focusing on containerization (e.g., Docker) and orchestration.
Optimize ML model training and inference performance with data scientists and engineers.
Develop and expand a library of training and inference solutions by designing, deploying, and managing Kubernetes and Slurm clusters for large-scale ML training with ready-to-deploy, standardized solutions.
Integrate with ML frameworks: integrate K8s and Slurm with popular ML frameworks like TensorFlow, PyTorch, or MXNet, ensuring seamless execution of distributed ML training workloads.
Develop monitoring and logging tools to track distributed training performance, identify bottlenecks, and troubleshoot issues.
Create automation scripts and tools to streamline ML training workflows, leveraging technologies like Ansible, Terraform, or Python.
Participate in industry conferences, meetups, and online forums to stay up-to-date with the latest developments in MLOps, K8S, Slurm, and ML.
What You Bring At least 3 years of experience in MLOps, DevOps, or a related field.
Strong experience with Kubernetes and containerization (e.g., Docker).
Experience with cloud providers like AWS, GCP, or Azure.
Familiarity with Slurm or other distributed computing frameworks.
Proficiency in Python, with experience in ML frameworks such as TensorFlow, PyTorch, or MXNet.
Knowledge of ML model serving and deployment.
Familiarity with CI/CD pipelines and tools like Jenkins, GitLab CI/CD or CircleCI.
Experience with monitoring and logging tools like Prometheus, Grafana or ELK Stack.
Solid understanding of distributed computing principles, parallel processing, and job scheduling.
Experience with automation tools like Ansible, Terraform.
Key Attributes for Success Passion for AI and transformative technologies.
A genuine interest in optimizing and scaling ML solutions for high-impact results.
Results-driven mindset and problem-solver mentality.
Adaptability and ability to thrive in a fast-paced startup environment.
Comfortable working with an international team and diverse client base.
Communication and collaboration skills, with experience working in cross-functional teams.
Why Join? Competitive compensation: $130,000-$175,000 (negotiable based on experience and skills).
Full medical benefits and life insurance: 100% coverage for health, vision, and dental insurance for employees and their families.
401(k) match program with up to a 4% company match.
PTO and paid holidays.
Flexible remote work environment.
Reimbursement of up to $85/month for mobile and internet.
Work with state-of-the-art AI and cloud technologies, including the latest NVIDIA GPUs (H100, L40S, with H200 and Blackwell chips coming soon).
Be part of a team that operates one of the most powerful commercially available supercomputers.
Contribute to sustainable AI infrastructure with energy-efficient data centers that recover waste heat to warm nearby residential buildings.
Interviewing Process Level 1:
Virtual interview with the Talent Acquisition Lead (General fit, Q&A).
Level 2:
Virtual interview with the Hiring Manager (Skills assessment).
Level 3:
Interview with the C-level (Final round).
Reference and Background Checks:
Conducted post-interviews.
Offer:
Extended to the selected candidate.
We are proud to be an equal opportunity workplace and are committed to equal employment opportunity regardless of marital status, ancestry, physical or mental disability, genetic information, veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by applicable federal, state or local law. Compensation Range: $130K - $175K
#J-18808-Ljbffr
infrastructure
that's reshaping the landscape of artificial intelligence. They offer an
AI-centric cloud platform
that empowers Fortune 500 companies, top-tier innovative startups, and AI researchers to drive breakthroughs in AI. This publicly traded company is committed to building full-stack infrastructure to service the explosive growth of the global AI industry, including large-scale GPU clusters, cloud platforms, tools, and services for developers. Company Type: Publicly traded
Product: AI-centric GPU cloud platform & infrastructure for training AI models
Candidate Location: Remote anywhere in the US
Their
mission
is to democratize access to world-class AI infrastructure, enabling organizations of all sizes to turn bold AI ambitions into reality. At the core of their success is a culture that celebrates creativity, embraces challenges, and thrives on collaboration. The Opportunity As a
Professional Services Engineer (Remote) , you’ll play a key role in designing, implementing, and maintaining large-scale machine learning (ML) training and inference workflows for clients. Working closely with a Solutions Architect and support teams, you’ll provide expert, hands-on guidance to help clients achieve optimal ML pipeline performance and efficiency. What You'll Do Design and implement scalable ML training and inference workflows using Kubernetes and Slurm, focusing on containerization (e.g., Docker) and orchestration.
Optimize ML model training and inference performance with data scientists and engineers.
Develop and expand a library of training and inference solutions by designing, deploying, and managing Kubernetes and Slurm clusters for large-scale ML training with ready-to-deploy, standardized solutions.
Integrate with ML frameworks: integrate K8s and Slurm with popular ML frameworks like TensorFlow, PyTorch, or MXNet, ensuring seamless execution of distributed ML training workloads.
Develop monitoring and logging tools to track distributed training performance, identify bottlenecks, and troubleshoot issues.
Create automation scripts and tools to streamline ML training workflows, leveraging technologies like Ansible, Terraform, or Python.
Participate in industry conferences, meetups, and online forums to stay up-to-date with the latest developments in MLOps, K8S, Slurm, and ML.
What You Bring At least 3 years of experience in MLOps, DevOps, or a related field.
Strong experience with Kubernetes and containerization (e.g., Docker).
Experience with cloud providers like AWS, GCP, or Azure.
Familiarity with Slurm or other distributed computing frameworks.
Proficiency in Python, with experience in ML frameworks such as TensorFlow, PyTorch, or MXNet.
Knowledge of ML model serving and deployment.
Familiarity with CI/CD pipelines and tools like Jenkins, GitLab CI/CD or CircleCI.
Experience with monitoring and logging tools like Prometheus, Grafana or ELK Stack.
Solid understanding of distributed computing principles, parallel processing, and job scheduling.
Experience with automation tools like Ansible, Terraform.
Key Attributes for Success Passion for AI and transformative technologies.
A genuine interest in optimizing and scaling ML solutions for high-impact results.
Results-driven mindset and problem-solver mentality.
Adaptability and ability to thrive in a fast-paced startup environment.
Comfortable working with an international team and diverse client base.
Communication and collaboration skills, with experience working in cross-functional teams.
Why Join? Competitive compensation: $130,000-$175,000 (negotiable based on experience and skills).
Full medical benefits and life insurance: 100% coverage for health, vision, and dental insurance for employees and their families.
401(k) match program with up to a 4% company match.
PTO and paid holidays.
Flexible remote work environment.
Reimbursement of up to $85/month for mobile and internet.
Work with state-of-the-art AI and cloud technologies, including the latest NVIDIA GPUs (H100, L40S, with H200 and Blackwell chips coming soon).
Be part of a team that operates one of the most powerful commercially available supercomputers.
Contribute to sustainable AI infrastructure with energy-efficient data centers that recover waste heat to warm nearby residential buildings.
Interviewing Process Level 1:
Virtual interview with the Talent Acquisition Lead (General fit, Q&A).
Level 2:
Virtual interview with the Hiring Manager (Skills assessment).
Level 3:
Interview with the C-level (Final round).
Reference and Background Checks:
Conducted post-interviews.
Offer:
Extended to the selected candidate.
We are proud to be an equal opportunity workplace and are committed to equal employment opportunity regardless of marital status, ancestry, physical or mental disability, genetic information, veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by applicable federal, state or local law. Compensation Range: $130K - $175K
#J-18808-Ljbffr