Zscaler

Staff Site Reliability Engineer - Incident Response

Zscaler, Washington, District of Columbia, us, 20022

Our Engineering team built the world's largest cloud security platform from the ground up, and we keep building. With more than 100 patents and big plans for enhancing services and increasing our global footprint, the team has made us and our multitenant architecture today's cloud security leader, with more than 15 million users in 185 countries. Bring your vision and passion to our team of cloud architects, software engineers, security experts, and more who are enabling organizations worldwide to harness speed and agility with a cloud-first strategy.

NOTE: U.S. citizenship is required for this position due to the nature of the customers assigned to this role

We're looking for an experienced Staff Site Reliability Engineer-Incident Response to join our Shared Platform Engineer team. Reporting to the Director Cloud Operations and Incident Management, you'll be responsible for:

Lead and advocate for the transformation to a world-leading SRE organization, promoting SRE principles within the Engineering Department.

Provide expert leadership during critical outages, coordinating multiple teams to ensure streamlined decision-making and quick resolution.

Promote a customer-focused approach by addressing and mitigating global customer environment issues, and fostering a culture of continuous learning and technical excellence within the SRE team.

Develop and implement scalable process frameworks and observability strategies to ensure rapid problem diagnosis, response, and service reliability.

Collaborate with product teams to thoroughly analyze failures and integrate insights to improve service reliability, scalability, and operational efficiency.

What We're Looking for (Minimum Qualifications)

5+ years of experience as a Site Reliability Engineer, with relevant experience in an Operations or Engineering environment.

Hands-on experience troubleshooting Linux-based systems

Networking knowledge and able to troubleshoot TCP/IP, SSL/TLS, DNSSEC, IPsec, and BGP issues.

Coding experience (preferably Python) building tools, scripting, or automation

Bachelor's degree in Computer Science, a related technical field involving computer systems engineering, or equivalent practical experience.

What Will Make You Stand Out (Preferred Qualifications)

Experience supporting High/Moderate FedRAMP environments

Understanding of Observability practices and Tools - Grafana, DataDog, Splunk, etc

Experience Leading Major Incidents in large scale, high uptime environments

#LI-YC2

#LI-Remote

This role offers remote work option

#J-18808-Ljbffr