Optum
Principal AI/ML Infrastructure and Ops Engineer - Remote
Optum, San Francisco, California, United States, 94199
Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by diversity and inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health equity on a global scale. Join us to start
Caring. Connecting. Growing together.
Optum AI is chartered to drive value on high impact enterprise AI problems, democratize AI through the enterprise ML platform, accelerate the adoption of Generative Artificial Intelligence (Gen AI) and drive Responsible AI. Projecting to deliver $8.4B of benefit value over the next 5 years through these efforts as well as reduce risk through safe, accurate, and unbiased AI, this is a key focus of the enterprise.
As the Principal AI/ML Infrastructure and Ops Engineer, you will be responsible for the overall operations related to United AI Studio (enterprise AI/ML platform). This individual contributor (IC) role requires deep expertise in building and managing large-scale AI/ML platforms, providing strategic guidance, and hands-on technical leadership. You will play a critical role in ensuring the stability, reliability, scalability, and performance of United AI Studio in compliance with enterprise standards, working with other engineering teams, customers, and our leadership. Experience with modern Infrastructure and DevOps tools and paradigms, as well as hands-on knowledge with major cloud-based services like Azure, AWS and GCP is a must.
Primary Responsibilities:
Infrastructure Strategy & Planning: Lead the design and implementation of scalable infrastructure solutions that align with the company’s strategic goals and operational needs
Cloud & Hybrid Environment Management: Oversee the management of multi-cloud (Azure, AWS, GCP) and hybrid infrastructure environments, enabling secure & scalable solution hosting and ensuring optimal performance and cost-effectiveness balancing performance and budgetary constraints
Automation & DevOps: Drive automation across the infrastructure lifecycle, leveraging Infrastructure as Code (IaC) and DevOps principles to streamline deployment and management processes
Systems Monitoring & Performance Tuning: Develop and implement monitoring frameworks for infrastructure, identifying areas for performance improvement, optimization, and ensuring high availability
Disaster Recovery & Business Continuity: Design, test, and implement disaster recovery and business continuity plans to ensure minimal downtime and data integrity
Security & Compliance: Collaborate with cybersecurity teams to ensure all systems and operations comply with industry standards and are secure against evolving threats
Capacity Planning & Cost Optimization: Forecast and manage capacity requirements for the AI/ML infrastructure while identifying opportunities to reduce costs without compromising performance
Thought Leadership: Stay updated with the latest in cloud technologies, AI/ML infrastructure advancements, and DevOps practices, providing leadership within the organization on best practices
Mentorship & Leadership: Act as a technical mentor for junior team members, fostering a culture of continuous learning and professional development within the team
Cross-Departmental Collaboration: Work closely with software engineering, cybersecurity, and AI/ML teams to ensure infrastructure supports the broader technical ecosystem
Required Qualifications:
Bachelor’s degree in computer science, information technology, or a related field
10+ years of infrastructure experience: Proven experience managing large-scale, cloud-based, enterprise-level software platforms and deep understanding of multi-cloud architectures, specifically Azure, AWS, and GCP, with hands-on experience in cloud management
6+ years of practical experience in Infrastructure-as-Code and CI/CD tools like Terraform, Git Actions and alike
5+ years of practical experience in containerization technologies (Kubernetes, Docker) and orchestration for large-scale workloads
5+ years of practical experience in Scripting & Automation Skills: Advanced proficiency in scripting languages such as Python and Bash to support automation and system integration efforts
Preferred Qualifications:
Master’s degree in computer science, information technology, or a related field
Experience in monitoring and optimizing performance of distributed systems, particularly AI/ML pipelines and data processing workflows
High-availability systems experience: demonstrated success in building and maintaining highly available, fault-tolerant infrastructure
Proven security & compliance knowledge: solid understanding of security best practices and experience ensuring compliance with relevant regulatory framework
Machine Learning and LLM Operations experience: exposure to modern tools and techniques in MLOps and LLMOps fields
Experience with AI/ML-specific infrastructure tools (e.g., MLflow, Kubeflow) for managing and deploying models at scale
Proven leadership in a Healthcare environment: experience working within a healthcare or regulated industry, with a deep understanding of the unique challenges and compliance requirements
Proven disaster recovery expertise: hands-on experience designing and implementing business continuity and disaster recovery solutions
Demonstrated familiarity with GPU-accelerated computing and the management of AI/ML hardware infrastructure, including AI-specific cloud services and GPU clusters
Ability to work independently, manage multiple projects simultaneously, and adapt to changing priorities in a fast-paced environment
Demonstrated innovative problem solving: track record of introducing innovative infrastructure solutions that improve efficiency, reduce costs, or enhance performance across the enterprise
Application Deadline : This will be posted for a minimum of 2 business days or until a sufficient candidate pool has been collected. Job posting may come down early due to volume of applicants.
At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone–of every race, gender, sexuality, age, location and income–deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes — an enterprise priority reflected in our mission.
Diversity creates a healthier atmosphere: UnitedHealth Group is an Equal Employment Opportunity/Affirmative Action employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, age, national origin, protected veteran status, disability status, sexual orientation, gender identity or expression, marital status, genetic information, or any other characteristic protected by law.
UnitedHealth Group is a drug - free workplace. Candidates are required to pass a drug test before beginning employment.
#J-18808-Ljbffr
Caring. Connecting. Growing together.
Optum AI is chartered to drive value on high impact enterprise AI problems, democratize AI through the enterprise ML platform, accelerate the adoption of Generative Artificial Intelligence (Gen AI) and drive Responsible AI. Projecting to deliver $8.4B of benefit value over the next 5 years through these efforts as well as reduce risk through safe, accurate, and unbiased AI, this is a key focus of the enterprise.
As the Principal AI/ML Infrastructure and Ops Engineer, you will be responsible for the overall operations related to United AI Studio (enterprise AI/ML platform). This individual contributor (IC) role requires deep expertise in building and managing large-scale AI/ML platforms, providing strategic guidance, and hands-on technical leadership. You will play a critical role in ensuring the stability, reliability, scalability, and performance of United AI Studio in compliance with enterprise standards, working with other engineering teams, customers, and our leadership. Experience with modern Infrastructure and DevOps tools and paradigms, as well as hands-on knowledge with major cloud-based services like Azure, AWS and GCP is a must.
Primary Responsibilities:
Infrastructure Strategy & Planning: Lead the design and implementation of scalable infrastructure solutions that align with the company’s strategic goals and operational needs
Cloud & Hybrid Environment Management: Oversee the management of multi-cloud (Azure, AWS, GCP) and hybrid infrastructure environments, enabling secure & scalable solution hosting and ensuring optimal performance and cost-effectiveness balancing performance and budgetary constraints
Automation & DevOps: Drive automation across the infrastructure lifecycle, leveraging Infrastructure as Code (IaC) and DevOps principles to streamline deployment and management processes
Systems Monitoring & Performance Tuning: Develop and implement monitoring frameworks for infrastructure, identifying areas for performance improvement, optimization, and ensuring high availability
Disaster Recovery & Business Continuity: Design, test, and implement disaster recovery and business continuity plans to ensure minimal downtime and data integrity
Security & Compliance: Collaborate with cybersecurity teams to ensure all systems and operations comply with industry standards and are secure against evolving threats
Capacity Planning & Cost Optimization: Forecast and manage capacity requirements for the AI/ML infrastructure while identifying opportunities to reduce costs without compromising performance
Thought Leadership: Stay updated with the latest in cloud technologies, AI/ML infrastructure advancements, and DevOps practices, providing leadership within the organization on best practices
Mentorship & Leadership: Act as a technical mentor for junior team members, fostering a culture of continuous learning and professional development within the team
Cross-Departmental Collaboration: Work closely with software engineering, cybersecurity, and AI/ML teams to ensure infrastructure supports the broader technical ecosystem
Required Qualifications:
Bachelor’s degree in computer science, information technology, or a related field
10+ years of infrastructure experience: Proven experience managing large-scale, cloud-based, enterprise-level software platforms and deep understanding of multi-cloud architectures, specifically Azure, AWS, and GCP, with hands-on experience in cloud management
6+ years of practical experience in Infrastructure-as-Code and CI/CD tools like Terraform, Git Actions and alike
5+ years of practical experience in containerization technologies (Kubernetes, Docker) and orchestration for large-scale workloads
5+ years of practical experience in Scripting & Automation Skills: Advanced proficiency in scripting languages such as Python and Bash to support automation and system integration efforts
Preferred Qualifications:
Master’s degree in computer science, information technology, or a related field
Experience in monitoring and optimizing performance of distributed systems, particularly AI/ML pipelines and data processing workflows
High-availability systems experience: demonstrated success in building and maintaining highly available, fault-tolerant infrastructure
Proven security & compliance knowledge: solid understanding of security best practices and experience ensuring compliance with relevant regulatory framework
Machine Learning and LLM Operations experience: exposure to modern tools and techniques in MLOps and LLMOps fields
Experience with AI/ML-specific infrastructure tools (e.g., MLflow, Kubeflow) for managing and deploying models at scale
Proven leadership in a Healthcare environment: experience working within a healthcare or regulated industry, with a deep understanding of the unique challenges and compliance requirements
Proven disaster recovery expertise: hands-on experience designing and implementing business continuity and disaster recovery solutions
Demonstrated familiarity with GPU-accelerated computing and the management of AI/ML hardware infrastructure, including AI-specific cloud services and GPU clusters
Ability to work independently, manage multiple projects simultaneously, and adapt to changing priorities in a fast-paced environment
Demonstrated innovative problem solving: track record of introducing innovative infrastructure solutions that improve efficiency, reduce costs, or enhance performance across the enterprise
Application Deadline : This will be posted for a minimum of 2 business days or until a sufficient candidate pool has been collected. Job posting may come down early due to volume of applicants.
At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone–of every race, gender, sexuality, age, location and income–deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes — an enterprise priority reflected in our mission.
Diversity creates a healthier atmosphere: UnitedHealth Group is an Equal Employment Opportunity/Affirmative Action employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, age, national origin, protected veteran status, disability status, sexual orientation, gender identity or expression, marital status, genetic information, or any other characteristic protected by law.
UnitedHealth Group is a drug - free workplace. Candidates are required to pass a drug test before beginning employment.
#J-18808-Ljbffr