Tbwa Chiat/Day Inc
ML/Dev Ops Engineer New York
Tbwa Chiat/Day Inc, Mission, Kansas, United States,
Radical AI, Inc.
is an artificial intelligence company that is accelerating scientific research & development. We are at the forefront of innovation in the field of materials R&D, a critical driver for advancing our most cutting-edge industries and shaping the future. Breaking away from the traditionally slow and costly R&D process, Radical AI leverages artificial intelligence and machine learning to pioneer generative materials science. This innovative field blends AI, engineering, and materials science, revolutionizing how materials are created and discovered. Radical AI's approach speeds up R&D and addresses global challenges, setting new benchmarks in technology and sustainability.The opportunityAs an ML Ops Engineer, you’ll be joining our AI Research and Development team. This role involves playing a key role in developing our ML and data platform by helping to design, implement, and maintain scalable ML and DevOps infrastructure. You will be involved in helping to stand-up and maintain clusters and pipelines that support the development, training, and deployment of machine learning models for materials research, as well as building, scaling, and automating general infrastructure for use across our software stack.MissionDeploy and manage GPU and CPU clusters for machine learning models and quantum chemistry by employing Slurm and Kubernetes.Enable seamless replication of clusters across various cloud services, including Lambda Labs, IBM, and hyperscalers.Implement and maintain monitoring, logging and alerting systems using Zabbix, Prometheus, or another similar tool.Develop and implement a CI/CD to enable safe and reproducible software.Experience with widely-used DevOps tools, such as Terraform and Ansible, among others.Optimize computing infrastructure by focusing on enhancing GPU utilization, distributed training, bandwidth efficiency between machines, and VPC connections to maximize system performance.Work closely with the AI research team and cross-functional teams, including engineering, to ensure effective model deployment and integration into production systems.Stay abreast of the latest developments in machine learning and data infrastructure, applying new techniques and methodologies to ongoing projects.Conduct rigorous testing and validation of machine learning models and data pipelines to ensure accuracy, efficiency, and scalability.Maintain comprehensive documentation of models, pipelines, algorithms, and experiments.Troubleshoot and optimize machine learning models and data infrastructure, addressing technical challenges and improving overall performance.Promote engineering best practices throughout the team.Ensure adherence to ethical AI standards and best practices in all aspects of work.About you3+ years of experience in a DevOps role, preferably in an AI/ML-focused environment.Strong knowledge of Slurm and Kubernetes.Experience leveraging cloud (AWS/GCP/Azure) and reserved (Lambda Labs) computing platforms for scalable AI model deployment.Experience with CI/CD tools such as Github Actions, CircleCI, Argo CD, etc.Experience working with and scaling model training across GPU clusters.Experience in building data pipelines and managing data infrastructure.The ability to navigate complex challenges, strategically manage resources, and improve system efficiency.Excellent written and verbal communication skills, with the ability to clearly convey complex technical information.Ability to work effectively in a collaborative team environment.PlusesMaster’s or PhD in Computer Science, AI, Data Science, or related field.Familiarity with infrastructure-as-code tools like Terraform or CloudFormation.Experience deploying and scaling quantum chemistry workloads with Vasp.Basic ML knowledge.Compensation
$175K – $275K + Equity + Benefits; base pay offered may vary depending on job-related knowledge, skills, and experience.What we offerA competitive compensation package also includes the best in benefits:Medical, dental, and vision insurance for you and your family.Mental health and wellness support.Unlimited PTO and 14+ company holidays per year.401K.Work closely with a team on the cutting edge of AI research.A mission: an opportunity to fundamentally change the way humanity makes progress through materials science discovery.Radical AI is committed to equal employment opportunity regardless of race, color, ancestry, national origin, religion, sex, age, sexual orientation, gender identity and expression, marital status, disability, or veteran status.
#J-18808-Ljbffr
is an artificial intelligence company that is accelerating scientific research & development. We are at the forefront of innovation in the field of materials R&D, a critical driver for advancing our most cutting-edge industries and shaping the future. Breaking away from the traditionally slow and costly R&D process, Radical AI leverages artificial intelligence and machine learning to pioneer generative materials science. This innovative field blends AI, engineering, and materials science, revolutionizing how materials are created and discovered. Radical AI's approach speeds up R&D and addresses global challenges, setting new benchmarks in technology and sustainability.The opportunityAs an ML Ops Engineer, you’ll be joining our AI Research and Development team. This role involves playing a key role in developing our ML and data platform by helping to design, implement, and maintain scalable ML and DevOps infrastructure. You will be involved in helping to stand-up and maintain clusters and pipelines that support the development, training, and deployment of machine learning models for materials research, as well as building, scaling, and automating general infrastructure for use across our software stack.MissionDeploy and manage GPU and CPU clusters for machine learning models and quantum chemistry by employing Slurm and Kubernetes.Enable seamless replication of clusters across various cloud services, including Lambda Labs, IBM, and hyperscalers.Implement and maintain monitoring, logging and alerting systems using Zabbix, Prometheus, or another similar tool.Develop and implement a CI/CD to enable safe and reproducible software.Experience with widely-used DevOps tools, such as Terraform and Ansible, among others.Optimize computing infrastructure by focusing on enhancing GPU utilization, distributed training, bandwidth efficiency between machines, and VPC connections to maximize system performance.Work closely with the AI research team and cross-functional teams, including engineering, to ensure effective model deployment and integration into production systems.Stay abreast of the latest developments in machine learning and data infrastructure, applying new techniques and methodologies to ongoing projects.Conduct rigorous testing and validation of machine learning models and data pipelines to ensure accuracy, efficiency, and scalability.Maintain comprehensive documentation of models, pipelines, algorithms, and experiments.Troubleshoot and optimize machine learning models and data infrastructure, addressing technical challenges and improving overall performance.Promote engineering best practices throughout the team.Ensure adherence to ethical AI standards and best practices in all aspects of work.About you3+ years of experience in a DevOps role, preferably in an AI/ML-focused environment.Strong knowledge of Slurm and Kubernetes.Experience leveraging cloud (AWS/GCP/Azure) and reserved (Lambda Labs) computing platforms for scalable AI model deployment.Experience with CI/CD tools such as Github Actions, CircleCI, Argo CD, etc.Experience working with and scaling model training across GPU clusters.Experience in building data pipelines and managing data infrastructure.The ability to navigate complex challenges, strategically manage resources, and improve system efficiency.Excellent written and verbal communication skills, with the ability to clearly convey complex technical information.Ability to work effectively in a collaborative team environment.PlusesMaster’s or PhD in Computer Science, AI, Data Science, or related field.Familiarity with infrastructure-as-code tools like Terraform or CloudFormation.Experience deploying and scaling quantum chemistry workloads with Vasp.Basic ML knowledge.Compensation
$175K – $275K + Equity + Benefits; base pay offered may vary depending on job-related knowledge, skills, and experience.What we offerA competitive compensation package also includes the best in benefits:Medical, dental, and vision insurance for you and your family.Mental health and wellness support.Unlimited PTO and 14+ company holidays per year.401K.Work closely with a team on the cutting edge of AI research.A mission: an opportunity to fundamentally change the way humanity makes progress through materials science discovery.Radical AI is committed to equal employment opportunity regardless of race, color, ancestry, national origin, religion, sex, age, sexual orientation, gender identity and expression, marital status, disability, or veteran status.
#J-18808-Ljbffr