Accenture

Full-Stack MLOps Systems Engineering Lead

Accenture, Charlotte, North Carolina, United States, 28245

This is a Senior Manager/Associate Director-level position:

We Are Nextira, now part of Accenture, builds cloud-based solutions and services with cutting-edge engineering skills, artificial intelligence (AI), machine learning (ML), and data analytics that enable clients to design, build, launch and optimize high-performance computing environments. Nextira joined the Accenture AWS Business Group and our AWS North America delivery practice in June 2023.

You Are:

An experienced, highly motivated MLOps Engineering Lead looking to join our team in supporting our large-scale GPU-based AI training and research cluster hosted on various cloud providers such as AWS, Azure, and GCP. The ideal candidate should have expert knowledge of Linux at the kernel level and be able to configure and troubleshoot NVIDIA drivers and utilities, particularly on virtual machines running in the cloud.

The Work:

You (Full Stack MLOps Engineering Lead) will lead the design, development, and operational management of cloud-native computing clusters to perform ML training and inference. You will lead the development and delivery of tooling needed to optimize performance and troubleshoot issues with their training workloads.

You will manage HPC (High Performance Computing) clusters, including schedulers such as Slurm, and compute nodes accelerated with NVIDIA GPUs. You’ll help experienced ML engineers configure and manage their Conda environments, optimizing them for their specific AI training and research needs.

Our Senior MLOps Engineers engage in clear and effective communication with highly technical users, providing support and guidance on a wide range of technical topics related to the cluster while utilizing their strong Linux skills to troubleshoot and resolve issues, optimize system performance, and ensure a stable and reliable environment for AI training and research.

Travel may be required for this role. The amount of travel will vary from 0 to 100% depending on business need and client requirements.

Here’s What You Need (Basic Qualifications):

Bachelor's degree or equivalent (minimum 12 years) work experience. (If Associate’s Degree, must have minimum 6 years of work experience)

Minimum of 6 years of professional experience working in a software engineering or DevOps role

Minimum of 4 years of experience in Linux Systems Administration, including kernel tuning, networking, and storage

Minimum of 3 years of experience with at least three of the following: Python, Docker / Kubernetes, C++, GPU stack (e.g. CUDA, SMI, ROCM)

Minimum of 4 years of experience in a platform engineering or developer role in a cloud environment

Minimum of 12 months of experience with IaC tools, e.g. Terraform

Excellent problem-solving and analytical skills

Bonus points if you have (Preferred Qualifications):

Experience in MLOps, Artificial Intelligence (AI), Large Language Models (LLMs), or High Performance Computing (HPC)

Experience with full-stack application development, particularly using cloud provider APIs

Experience with parallel file systems, e.g. Lustre, GPFS, Weka

Experience with managing virtualized Python environments, e.g. conda, pyenv.

Experience working in a consulting environment, engaging with client stakeholders at a senior level.

Experience leading a team of cloud platform engineers.

Strong written and verbal communication skills, with the ability to explain complex technical concepts to both technical and non-technical audiences.

#J-18808-Ljbffr