Insulet Corporation

Senior Principal Site Reliability Engineer (SRE) (Remote/Flexible)

Insulet Corporation, Oklahoma City, Oklahoma, United States,

Job Profile Title:

Engineering Advisor

Business Title:

Senior Principal Site Reliability Engineer (SRE)

Department:

8140 - G&A - Global Technology Ops & Security

FLSA Status:

Exempt

Position Overview:

Insulet is a mission-driven company that develops extraordinary, innovative products that directly impact people’s lives and health. We are developing connected consumer medical solutions with a combination of hardware, software, mobile, cloud, and wearables for people living with diabetes and the people that support them. Our mission is to both simplify people’s lives while improving their outcomes.

Senior Principal Site Reliability Engineer (SRE) with a strong software engineering background. This role is pivotal in ensuring the reliability, scalability, and performance of our critical systems and services. The ideal candidate will have a deep understanding of SRE principles, a passion for automation, and a proven track record of leading technical teams.

Responsibilities:

Lead the adoption and implementation of SRE practices across the organization, promoting a culture of reliability and continuous improvement.

Develop and implement automation tools and frameworks to enhance system reliability and operational efficiency.

Design and maintain comprehensive monitoring and alerting systems to ensure the health and performance of applications and infrastructure.

Lead the response to high-severity incidents, conduct root cause analysis, and implement corrective actions to prevent recurrence.

Analyze system performance and reliability data to identify areas for improvement and implement optimization strategies.

Work closely with development, operations, and product teams to ensure seamless integration of SRE practices and to drive reliability improvements.

Mentor and train junior engineers in SRE best practices, develop a culture of knowledge sharing and continuous learning.

Conduct capacity planning and demand forecasting to ensure systems can handle future growth and spikes.

Maintain detailed documentation of SRE processes, tools, and best practices to ensure knowledge continuity and operational excellence.

Key Decision Rights:

Authority to define and implement the technical strategy for SRE practices, including tooling, automation, and monitoring solutions.

Lead and make final decisions during high-severity incident responses, including root cause analysis and remediation actions.

Decide on the allocation of resources, including team assignments and budget for SRE initiatives.

Set and enforce performance standards and service level objectives (SLOs) for systems and applications.

Identify and implement process improvements to enhance system reliability and operational efficiency.

Evaluate and select third-party tools and services that support SRE practices and objectives.

Develop and approve training programs for the SRE team to ensure continuous skill development and knowledge sharing.

Required Leadership/Interpersonal Skills & Behaviors:

Ability to set a clear vision and inspire the team to achieve long-term goals.

Making informed, timely decisions, especially during high-pressure situations.

Guiding and developing junior engineers, develop a culture of continuous learning and improvement.

Planning and executing strategies that align with organizational goals and drive reliability improvements.

Clearly articulating ideas, actively listening, and translating complex technical problems into understandable terms for non-technical stakeholders.

Working seamlessly with cross-functional teams, understanding different personalities and leveraging diverse skills to achieve common goals.

Navigating and resolving conflicts constructively, maintaining a positive team dynamic.

Proactively identifying issues and developing innovative solutions to complex problems.

Taking responsibility for the team’s performance and outcomes, ensuring high standards are maintained.

Required Skills and Competencies:

Experience with observability tools such as Datadog, Prometheus, Dynatrace, Grafana, ELK Stack, or similar.

Proficiency in programming languages such as Python, Go, or Java.

Strong understanding of cloud computing platforms (e.g., AWS, Azure, GCP) and container orchestration technologies (e.g., Docker, Kubernetes).

In-depth knowledge of AWS services including VPC, Lambda, IAM, ELB, EC2, ECS, CloudWatch, API Gateway, S3, SQS, SNS, WAF, and Route53.

Experience with infrastructure as code tools such as Terraform, Ansible, or similar.

Excellent troubleshooting and problem-solving skills.

Strong communication and leadership skills, with the ability to collaborate effectively with cross-functional teams.

Experience leading and mentoring engineering teams is highly desirable.

Knowledge of security best practices and experience implementing security controls and measures.

Experience with chaos engineering and resilience testing.

Familiarity with AI/ML applications in operational processes.

Knowledge of security best practices and compliance requirements.

Education and Experience:

Bachelor’s in computer science, Engineering, or a related field.

14 years of experience in the field including 6+ Site Reliability Engineering, DevOps, or a similar role.

Proven experience architecting and managing highly available, scalable, and fault-tolerant systems.

Additional Information:

This position is eligible for 100% remote working arrangements (may work from home/virtually 100%; may also work hybrid on-site/virtual as desired).

Travel is estimated at 10% but will flex depending on business need.

At Insulet Corporation all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran.

#J-18808-Ljbffr