UnitedHealth Group

Principal AI/ML Infrastructure and Ops Engineer - Remote

UnitedHealth Group, Seattle, WA

Optum is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The work you do with our team will directly improve health outcomes by connecting people with the care, pharmacy benefits, data and resources they need to feel their best. Here, you will find a culture guided by diversity and inclusion, talented peers, comprehensive benefits and career development opportunities. Come make an impact on the communities we serve as you help us advance health equity on a global scale. Join us to start Caring. Connecting. Growing together.Optum AI is UnitedHealth Group’s enterprise AI team. We are AI/ML scientists and engineers with deep expertise in AI/ML engineering for health care. We develop AI/ML solutions for the highest impact opportunities across UnitedHealth Group businesses including UnitedHealthcare, Optum Financial, Optum Health, Optum Insight, and Optum Rx. In addition to transforming the health care journey through responsible AI/ML innovation, our charter also includes developing and supporting an enterprise AI/ML development platform.Optum AI team members:Have impact at scale: We have the data and resources to make an impact at scale. When our solutions are deployed, they have the potential to make health care system work better for everyone.Do ground-breaking work: Many of our current projects involve cutting edge ML, NLP and LLM techniques. Generative AI methods for working with structured and unstructured health care data are continuously being developed and improved. We are working in one of the most important frontiers of AI/ML research and development.Partner with world-class experts on innovative solutions: Our team members are developing novel AI/ML solutions to business challenges. In some cases, this includes the opportunity to file patents and publish papers about the methods we develop. We also collaborate with AI/ML researchers at some of the world’s top universities.As the Principal AI/ML Infrastructure and Ops Engineer, you will be responsible for the overall operations related to United AI Studio (enterprise AI/ML platform). This individual contributor (IC) role requires deep expertise in building and managing large-scale AI/ML platforms, providing strategic guidance, and hands-on technical leadership. You will play a critical role in ensuring the stability, reliability, scalability, and performance of United AI Studio in compliance with enterprise standards, working with other engineering teams, customers, and our leadership. Experience with modern Infrastructure and DevOps tools and paradigms, as well as hands-on knowledge with major cloud-based services like Azure, AWS and GCP is a must.You’ll enjoy the flexibility to work remotely * from anywhere within the U.S. as you take on some tough challenges.Primary Responsibilities:Infrastructure Strategy & Planning: Lead the design and implementation of scalable infrastructure solutions that align with the company’s strategic goals and operational needsCloud & Hybrid Environment Management: Oversee the management of multi-cloud (Azure, AWS, GCP) and hybrid infrastructure environments, enabling secure & scalable solution hosting and ensuring optimal performance and cost-effectiveness balancing performance and budgetary constraintsAutomation & DevOps: Drive automation across the infrastructure lifecycle, leveraging Infrastructure as Code (IaC) and DevOps principles to streamline deployment and management processesSystems Monitoring & Performance Tuning: Develop and implement monitoring frameworks for infrastructure, identifying areas for performance improvement, optimization, and ensuring high availabilityDisaster Recovery & Business Continuity: Design, test, and implement disaster recovery and business continuity plans to ensure minimal downtime and data integritySecurity & Compliance: Collaborate with cybersecurity teams to ensure all systems and operations comply with industry standards and are secure against evolving threatsCapacity Planning & Cost Optimization: Forecast and manage capacity requirements for the AI/ML infrastructure while identifying opportunities to reduce costs without compromising performanceThought Leadership: Stay updated with the latest in cloud technologies, AI/ML infrastructure advancements, and DevOps practices, providing leadership within the organization on best practicesMentorship & Leadership: Act as a technical mentor for junior team members, fostering a culture of continuous learning and professional development within the teamCross-Departmental Collaboration: Work closely with software engineering, cybersecurity, and AI/ML teams to ensure infrastructure supports the broader technical ecosystemYou’ll be rewarded and recognized for your performance in an environment that will challenge you and give you clear direction on what it takes to succeed in your role as well as provide development for other roles you may be interested in.Required Qualifications:Bachelor’s degree in computer science, information technology, or a related field10+ years of infrastructure experience: Proven experience managing large-scale, cloud-based, enterprise-level software platforms and deep understanding of multi-cloud architectures, specifically Azure, AWS, and GCP, with hands-on experience in cloud management6+ years of practical experience in Infrastructure-as-Code and CI/CD tools like Terraform, Git Actions and alike5+ years of practical experience in containerization technologies (Kubernetes, Docker) and orchestration for large-scale workloads5+ years of practical experience in Scripting & Automation Skills: Advanced proficiency in scripting languages such as Python and Bash to support automation and system integration effortsPreferred Qualifications: Master’s degree in computer science, information technology, or a related fieldExperience in monitoring and optimizing performance of distributed systems, particularly AI/ML pipelines and data processing workflowsHigh-availability systems experience: demonstrated success in building and maintaining highly available, fault-tolerant infrastructureProven security & compliance knowledge: solid understanding of security best practices and experience ensuring compliance with relevant regulatory frameworkMachine Learning and LLM Operations experience: exposure to modern tools and techniques in MLOps and LLMOps fieldsExperience with AI/ML-specific infrastructure tools (e.g., MLflow, Kubeflow) for managing and deploying models at scaleProven leadership in a Healthcare environment: experience working within a healthcare or regulated industry, with a deep understanding of the unique challenges and compliance requirementsProven disaster recovery expertise: hands-on experience designing and implementing business continuity and disaster recovery solutionsDemonstrated familiarity with GPU-accelerated computing and the management of AI/ML hardware infrastructure, including AI-specific cloud services and GPU clustersAbility to work independently, manage multiple projects simultaneously, and adapt to changing priorities in a fast-paced environmentDemonstrated innovative problem solving: track record of introducing innovative infrastructure solutions that improve efficiency, reduce costs, or enhance performance across the enterprise*Allemployees working remotely will be required to adhere to UnitedHealth Group’s Telecommuter Policy.California, Colorado, Connecticut, Hawaii, Nevada, New Jersey, New York, Rhode Island, Washington, Washington, D.C. Residents Only: The salary range for this role is $122,100 to $234,700 annually. Pay is based on several factors including but not limited to local labor markets, education, work experience, certifications, etc. UnitedHealth Group complies with all minimum wage laws as applicable. In addition to your salary, UnitedHealth Group offers benefits such as, a comprehensive benefits package, incentive and recognition programs, equity stock purchase and 401k contribution (all benefits are subject to eligibility requirements). No matter where or when you begin a career with UnitedHealth Group, you’ll find a far-reaching choice of benefits and incentives. Application Deadline: This will be posted for a minimum of 2 business days or until a sufficient candidate pool has been collected. Job posting may come down early due to volume of applicants. At UnitedHealth Group, our mission is to help people live healthier lives and make the health system work better for everyone. We believe everyone–of every race, gender, sexuality, age, location and income–deserves the opportunity to live their healthiest life. Today, however, there are still far too many barriers to good health which are disproportionately experienced by people of color, historically marginalized groups and those with lower incomes. We are committed to mitigating our impact on the environment and enabling and delivering equitable care that addresses health disparities and improves health outcomes — an enterprise priority reflected in our mission.Diversity creates a healthier atmosphere: UnitedHealth Group is an Equal Employment Opportunity/Affirmative Action employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, age, national origin, protected veteran status, disability status, sexual orientation, gender identity or expression, marital status, genetic information, or any other characteristic protected by law.UnitedHealth Group is a drug - free workplace. Candidates are required to pass a drug test before beginning employment.Brand: Optum TechnologyJob ID: 2249635Employment Type: Full-timeJob Area: TechnologyFunction: Information Systems ManagerIndustry: Direct Health/Medical Insurance Carrier