Plume Design, Inc.

Manager, Site Reliability Engineering

Plume Design, Inc., Palo Alto, California, United States, 94306

We’re looking for a seasoned Technical Manager, experienced with Customer Facing environments, to Captain our Site Reliability Engineering Team. This team is focused on deployments, fixes, and sustainability. The right candidate needs to have strong technical knowledge in key areas while focusing on customer satisfaction.

What You’ll Do:

Supervise a team of Site Reliability Engineers who provide first-line support to Customer Clouds. Deployments, On-call, Application Provisioning are some of the routine tasks.

Attend and conduct customer Meetings for Project and Roadmap specification.

Manage growth and performance of SRE team members.

Be able to step in and execute or triage issues as much as the Engineers. Hands-on past experience is beneficial. Some examples are as follows:

Provision and scale multi-datacenter Kubernetes Infrastructure and Applications (EKS)

Deploy Software in multiple Production Environments

Own monitoring and alerting to production systems, improvements and changes

Contribute improvements to the current automation

Contribute improvements to our on-call process and alerting

Play a key role in the recruitment and retention of top talent.

What You’ll Bring:

Availability to be in on-call rotation for Production issues

Availability to work with a distributed team in different timezones

Advanced communication skills

Experience managing people

Desired Skill Set:

10+ Years of experience with

Production Troubleshooting

Minimum 3+ Years of experience leading or managing teams

Bachelor’s degree in related field or equivalent experience, Advanced degree preferred.

This is a leadership role, but you must have Technical knowledge and working experience with:

Kubernetes (operate)

Basic Terraform Knowledge

Experience Programming/Scripting - one of the following (eg. Perl, Python, PHP, GoLang, Java, etc)

Experience with modern cloud infrastructure, preferably AWS

Experience with modern Linux Operating systems (Enterprise Linux or Debian based)

Experience both setting up and utilizing self-managed Monitoring and observability tools (e.g. Nagios/Icinga, Grafana, Prometheus)

Differentiators:

Troubleshooting production performance/service degradation or outage issues at scale

Experience with Infrastructure Troubleshooting in VMs and/or Bare Metal (ssh/Linux)

Advanced Kubernetes knowledge

Advanced Terraform knowledge

Customer Facing experience in previous roles

Experience operating Kafka in Production

Experience operating NoSQL Databases in Production

Experience operating Relational Databases in Production

Configuration Management experience

#J-18808-Ljbffr