Staff Site Reliability Engineer jobs

Plume Design, Inc

Staff Site Reliability Engineer

Plume Design, Inc - Palo Alto

Work at Plume Design, Inc

Overview
View job

Overview

We’re looking for a seasoned Site Reliability Engineer, experienced with Customer Facing environments, to provide Technical Leadership for our Site Reliability Engineering Team. This team is focused on deployments, Production Infrastructure, Availability and Reliability. The right candidate has held several Infrastructure-oriented roles and needs to have strong technical knowledge in the DevOps/SRE technology stack while focusing on customer satisfaction.

What You’ll Do:

Supervise a team of Site Reliability Engineers who provide first-line support to Customer Clouds. Deployments, On-call, Application Provisioning are some of the routine tasks.
Run stand ups for the team, ticket management
Participate in the Sprints and close tickets with the team
Attend and conduct customer Meetings for Project and Roadmap specification.
Be able to step in and execute or triage issues. Some examples are as follows:
Provision and scale Kubernetes Infrastructure and Applications (EKS)
Deploy Software in multiple Production Environments
Own monitoring and alerting to production systems, improvements and changes
Contribute improvements to the current automation
Contribute improvements to our on-call process and alerting

What You’ll Bring

4+ years of Kubernetes Knowledge (operate)
2+ years of Terraform Knowledge
Experience both setting up and utilizing Monitoring and observability tools
e.g. New Relic, Nagios/Icinga, Grafana, Prometheus
2+ years of experience Programming/Scripting - one of the following
eg. Perl, Python, PHP, GoLang, Java, etc
8+ years of experience with modern Linux Operating systems
6+ years of experience with modern cloud infrastructure, preferably AWS
Availability to be in on-call rotation for Production issues
Availability to work with a distributed team in different timezones
Advanced communication skills
Experience leading efforts and reporting up

Desired Skill Set

10+ Years of experience with Production Troubleshooting
4+ Years of experience leading teams
Executive Communication skills
Bachelor’s degree in related field or equivalent experience, Advanced degree preferred.
This is a leadership role, but you must have Technical knowledge and working experience with:
Kubernetes (operate)
Basic Terraform Knowledge
Experience Programming/Scripting - one of the following (eg. Perl, Python, PHP, GoLang, Java, etc)
Experience with modern cloud infrastructure, preferably AWS
Experience with modern Linux Operating systems (Enterprise Linux or Debian based)
Experience both setting up and utilizing self-managed Monitoring and observability tools (e.g. Nagios/Icinga, Grafana, Prometheus)

Differentiators

Troubleshooting production performance/service degradation or outage issues at scale
Experience with Infrastructure Troubleshooting in VMs and/or Bare Metal (ssh/Linux)
Advanced Kubernetes knowledge
Advanced Terraform knowledge
Customer Facing experience in previous roles
Experience operating Kafka in Production
Experience operating NoSQL Databases in Production
Experience operating Relational Databases in Production
Configuration Management experience

Kindly note that this is a HYBRID position, with a requirement to work in the office 3 days a week. We’re looking for candidates who are within a commutable distance. At this time, we are unable to provide relocation assistance.

Total Compensation package would include: anticipated compensation range of $177,000 - $208,000 + bonus + equity + benefits. Benefits include: a 401k plan and a company match, basic life insurance plus unparalleled health, dental, vision and other benefits and perks. For more details please see:

An employee’s base salary and its position within the range may depend on a number of factors including job related knowledge, education, skills, experience and other business related considerations. Published ranges are provided in good faith at the time of posting.

See details and apply

Staff Site Reliability Engineer

Get new jobs for this search by email