Principal/Architect- Availability Engineering & SRE
Salesforce, Inc., San Francisco, CA, United States
Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Salesforce services have reliability, capacity, performance and the availability to deliver our customer's needs and a rate of improvement that our customers expect.
Our software development focuses on enabling service owners to operate their services safely at scale, whether through paved path integrations onto observability frameworks, optimizing existing systems, designing infrastructure or eliminating work through AI/ML investments or traditional automation. On the SRE team, you’ll have the opportunity to manage the complex challenges of scale which are unique to Salesforce, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of diversity, intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.
The SRE practice at Salesforce is evolving, and this role will shape the technical strategy for SRE and influence the strategy for the Availability Cloud as a whole. You will embed with product owning teams, define the availability roadmap and deliver directly against it. Most importantly, you will mature the SRE practice, mentoring and actively developing the engineers around you. Your success is measured by scaling the impact and delivery of your community.
Responsibilities:
Spearhead and enable the culture of Service Ownership to flourish and thrive. Define healthy service ownership practices and work with embedded teams to develop the knowledge and ownership practice
Engage in and improve the whole lifecycle of services—from inception and design, through to deployment, operation and refinement.
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Develop full paved path observability platform integrations and necessary automations to maintain service, system and product health
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for and delivering changes that improve reliability and velocity.
Practice sustainable incident response and blameless post mortems. Uphold the quality and high standards of post mortems as part of the Architect community at Salesforce
Comfortable with hands on coding at least 25%
Develop and grow the engineering talent around you
Minimum Requirements
15+ years of software development and engineering experience, 5+ years in a technical leadership role
Hands-on experience designing, building and operating large scale distributed systems, identifying shortcomings and optimization opportunities, and making data driven cost performance tradeoffs to influence design decisions
Demonstrated experience of leading initiatives spanning multiple teams and leveraging deep domain expertise to influence tech roadmap planning and execution
Demonstrated ability to effectively collaborate across multiple teams and stakeholders to drive business outcomes
Experience, mentoring, and investing in the development of engineers and peers
Ability to reverse engineer solutions via independent code and architecture review, envision, define and then contribute to delivery of availability improvement refactoring projects
Mastery of one or more object oriented delivery with languages such as Java, Golang, Python, C++, C
Experience in: Kubernetes, Istio, Public Cloud (AWS or other)
Deep experience working with core web technologies: HTTP, JSON, REST, XML
Experience owning and operating multiple instances of a critical service
Running critical infrastructure services; monitoring, alerting, logging, tracing and reporting
Subject matter expertise on Service ownership best practices, SLO/I/A definition, driving proactive operational awareness and experience with Incident / Problem management
Thorough knowledge of Agile development methodology with experience in both Test / Behavioral Driven Development practice
Experience in fault modeling and tolerance, chaos engineering, performance and load testing.