Logo
CoreWeave

Site Reliability Engineer - Cloud Operations

CoreWeave, Roseland, New Jersey, us, 07068


CoreWeave

is a specialized cloud provider, delivering a massive scale of GPU compute resources on top of the industry's fastest and most flexible infrastructure. CoreWeave builds cloud solutions for compute intensive use cases — VFX and rendering, machine learning and AI, batch processing, and Pixel Streaming — that are up to 35 times faster and 80% less expensive than the large, generalized public clouds. Learn more at www.coreweave.com.

About the role:

The Cloud Operations Team is the heart of CoreWeave's operational practice. In this role, you'll help define and shape how Site Reliability Engineering (SRE) is implemented at CoreWeave. The Cloud Operations team defines and implements tooling and processes that enable operational best practices and continual improvement across all engineering teams.An 'SRE of SREs,' you'll define and implement system and workflow automation ensuring service owners can rapidly identify and mitigate availability and performance regressions. Collaborating across engineering, you support service owning SRE's with the 'picks and shovels' they need to excel at running their services.You will work with a team of 8-10 mixed-specialization engineers and have the opportunity to work on the full gamut of rewarding challenges that come with building the AI Cloud in a communicative, supportive, and high-performing environment.As a member of the Cloud Operations Team you have the opportunity to:

With a customer first mindset, establish reliability and quality assessment patterns for all CoreWeave systems.Improve the performance, security, reliability, and scalability of internal and externally facing services.Develop dashboards, alerts, automated remediation, and insights into the customer experience using observability tools.Create and maintain Kubernetes operators, custom controllers, and other tools to intelligently scale our operational capability.Establish and integrate incident and change management tools and workflows.Act as Incident Commander for priority incidents and lead post mortems.Participate in on-call rotation as needed as we establish and operationalize this new team.Enable and evangelize reliability engineering across CoreWeave's engineering teams.Grow, change, invest in your teammates, be invested-in, share your ideas, listen to others, be curious, have fun, and, above all, be yourself.Wondering if you're a good fit?

We believe in investing in our people, and value candidates who can bring their own diversified experiences to our teams – even if you aren't a 100% skill or experience match.Here are some qualities we've found compatible with our team. If a portion of this resonates with you, we'd love to talk.You have experience operating services in production and are interested in driving engineering practices such as: reliability at scale, testing (load, recovery, system etc.), progressive deployments, error budgets, observability, and fault-tolerant design.You have experience automating manual processes and integrating various operations and productivity tools.You've done some Linux shell scripting and/or can navigate a *nix-based operating system (with the right cheat sheet, if required).You are familiar with debugging and administration of linux and Kubernetes environments.You're comfortable with the idea of codifying practices into Kubernetes controllers, operators, and other applications using a modern programming language.You have experience with incident management for your team or an organization.You're comfortable in open source environments.You're excited to join a team with diverse perspectives and backgrounds that believe in tackling challenges, growing hand in hand, and winning together.Why CoreWeave?

At CoreWeave, we work hard, have fun, and move fast! We're in an exciting stage of hyper-growth that you will not want to miss out on. We're not afraid of a little chaos, and we're constantly learning. Our team cares deeply about how we build our product and how we work together, which is represented through our core values:Be Curious at your CoreAct like an OwnerEmpower EmployeesDeliver Best In-Class Client ExperienceAchieve More TogetherWe support and encourage an entrepreneurial outlook and independent thinking. We foster an environment that encourages collaboration and provides the opportunity to develop innovative solutions to complex problems. As we get set for take off, the growth opportunities within the organization are constantly expanding. You will be surrounded by some of the best talent in the industry, who will want to learn from you, too. Come join us!Benefits

We offer a competitive salary and benefits, including:Medical, dental and vision insurance - 100% paid for the employeeLife InsuranceShort and long-term disability insuranceFlexible Spending AccountFlexible, full-service childcare support with Kinside401(k) with a generous employer matchFlexible PTOCatered lunch each day in our officesWeekly massages in NJ officeA casual work environmentWork culture focused on innovative disruptionCoreWeave is an equal opportunity employer, committed to our diversity and inclusiveness. We will consider all qualified applicants without regard to race, color, nationality, gender, gender identity or expression, sexual orientation, religion, disability or age.

#J-18808-Ljbffr