Logo
Karkidi

Software Engineer, Infrastructure

Karkidi, San Francisco, California, United States, 94199


Anthropic is seeking talented and experienced Infrastructure Engineers to join our team and support the development, scaling, and maintenance of our cutting-edge AI systems. By joining our Infrastructure team, you will have the opportunity to work on groundbreaking AI technologies and contribute to the development of frontier models, supporting Anthropic's mission to create safe and reliable AI systems that benefit humanity.We currently have openings in the following areas:Data Infrastructure:

Responsible for designing, building, and maintaining the data infrastructure that powers our AI research and products. Collaborate with cross-functional teams to understand data requirements, deliver efficient and reliable data solutions, optimize data pipelines, implement data governance best practices, and set technical strategies for high-scale, reliable data infrastructure.Research Infrastructure:

Focused on developing and scaling systems that enable researchers to iterate quickly and scale key systems/components used during the development phase to work at production scale as our model footprint grows.Site Reliability Engineering:

Design and implement scalable solutions, collaborate with development teams to improve infrastructure reliability, establish monitoring systems, and implement fault-tolerant design patterns. Participate in an on-call rotation and ensure reliability and scalability in new features and services.Systems:

Support some of the largest, most sophisticated clusters in the industry used to train, research, and serve AI models. Responsible for building systems and running large Kubernetes clusters with GPU/TPU/Tranium workloads.Observability:

Design, build, and maintain the observability infrastructure that ensures the reliability, performance, and efficiency of our AI systems and services. Collaborate with cross-functional teams to understand observability requirements and deliver solutions using technologies such as Prometheus, Splunk, and Grafana.Responsibilities:Lead build-out of industry-leading AI clusters (thousands to hundreds of thousands of machines), partnering closely with cloud service providers.Consult with stakeholders to understand infrastructure, data, and compute needs, identifying potential solutions to support research and product development.Set technical strategy and oversee the development of high-scale, reliable infrastructure systems.Mentor top technical talent.Design processes (e.g., postmortem review, incident response, on-call rotations) that help the team operate effectively.You may be a good fit if you:Have 4+ years of relevant industry experience, with 1+ years leading large scale, complex projects or teams.Are passionate about distributed systems at scale, infrastructure reliability, and continuous improvement.Have strong proficiency in at least one programming language (e.g., Python, Rust, Go, Java).Possess strong problem-solving skills and the ability to work independently.Have excellent communication skills to build consensus with stakeholders.Possess deep knowledge of modern cloud infrastructure including Kubernetes, Infrastructure as Code, AWS, and GCP.Strong candidates may also:Have expertise in security and privacy best practices.Experience with machine learning infrastructure like GPUs, TPUs, or Trainium.Low-level systems experience, such as Linux kernel tuning and eBPF.Technical expertise in understanding systems design tradeoffs.Deadline to apply:

None. Applications will be reviewed on a rolling basis.The expected salary range for this position is:Annual Salary:

$280,000



$485,000 USD

Logistics:Location-based hybrid policy: Currently, we expect all staff to be in one of our offices at least 25% of the time.US visa sponsorship: We do sponsor visas! However, we aren't able to sponsor visas for every role and every candidate.We encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every single qualification, so we urge you not to exclude yourself prematurely.Compensation and Benefits:Anthropic’s compensation package consists of salary, equity, and benefits. We are committed to pay fairness and aim for these elements to be competitive with market rates.US Benefits:Comprehensive health, dental, and vision insurance.401(k) plan with 4% matching.22 weeks of paid parental leave.Unlimited PTO.UK Benefits:Private health, dental, and vision insurance.Pension contribution (matching 4% of your salary).21 weeks of paid parental leave.Unlimited PTO.

#J-18808-Ljbffr