AI Infrastructure Engineer (The AI Backbone Builder)
Unreal Gigs, San Francisco, CA, United States
Are you passionate about designing and building the robust infrastructure that powers cutting-edge AI solutions? Do you thrive on creating scalable, high-performance systems that support AI workloads, from training machine learning models to deploying real-time inference? If you're excited about building the backbone for the future of AI, then our client has the perfect opportunity for you. We're looking for an AI Infrastructure Engineer (aka The AI Backbone Builder) to design, deploy, and maintain the infrastructure that powers AI innovation.
As an AI Infrastructure Engineer at our client, you'll play a critical role in building the platforms that support machine learning and AI development across the organization. You'll work closely with data scientists, software engineers, and DevOps teams to ensure that AI systems run efficiently, securely, and at scale. Your work will enable fast experimentation, seamless deployments, and the continuous delivery of AI models into production.
Key Responsibilities:
- Design and Build AI Infrastructure:
- Architect and implement scalable infrastructure that supports AI workloads, including machine learning model training, large-scale data processing, and real-time inference. You'll design solutions that ensure high availability, fault tolerance, and performance optimization.
- Collaborate with data scientists and engineers to build pipelines that automate the end-to-end machine learning lifecycle, from data ingestion to model training, deployment, and monitoring. You'll ensure smooth integration of AI models into production environments.
- Implement strategies to optimize compute resources for AI workloads, including GPU/TPU provisioning, memory management, and parallel processing. You'll ensure that infrastructure is optimized for the unique demands of AI and machine learning tasks.
- Manage cloud-based AI platforms (AWS, GCP, Azure) as well as on-premise infrastructure for AI development. You'll handle everything from infrastructure as code (IaC) to container orchestration (Docker, Kubernetes), ensuring seamless scalability and automation.
- Implement and maintain CI/CD pipelines for machine learning models to enable rapid experimentation, testing, and deployment. You'll automate workflows, model updates, and monitor the performance of AI systems in production.
- Ensure that the AI infrastructure complies with security best practices and regulatory requirements. You'll implement robust access controls, encryption, and other security measures to protect sensitive data and AI models.
- Continuously monitor the health and performance of AI infrastructure, identifying bottlenecks, reducing latency, and troubleshooting issues. You'll ensure the reliability of systems, optimizing them as AI demands grow.
Required Skills:
- AI Infrastructure Expertise: Deep experience in designing and building infrastructure that supports AI and machine learning workloads. You're familiar with both cloud and on-premise infrastructure solutions and know how to optimize them for AI.
- Cloud Platforms and Tools: Strong experience with cloud platforms like AWS, GCP, or Azure, particularly with AI services and infrastructure management. You're comfortable with tools like SageMaker, AI Platform, or Azure ML, as well as container orchestration with Kubernetes.
- Automation and DevOps: Expertise in automating infrastructure provisioning and model deployment using tools such as Terraform, Ansible, Jenkins, or GitLab CI. You're skilled at managing CI/CD pipelines for AI model deployment.
- GPU/TPU Optimization: Hands-on experience with GPU/TPU optimization for machine learning and deep learning tasks. You understand how to manage compute resources to maximize efficiency for AI workloads.
- Security and Compliance: Strong understanding of security best practices, including data encryption, access management, and compliance with regulations like GDPR and HIPAA.
- Bachelor's or Master's degree in Computer Science, Engineering, Data Science, or a related field. Equivalent experience in AI infrastructure or DevOps is highly valued.
- Certifications in cloud platforms (AWS, GCP, Azure) or DevOps tools are a plus.
- 3+ years of experience in infrastructure engineering, with a focus on building and maintaining AI or machine learning infrastructure in production environments.
- Proven experience with cloud services, containerization, orchestration tools, and optimizing infrastructure for AI workloads.
- Experience working with data scientists and machine learning engineers to support model development, testing, and deployment.
Benefits
Health and Wellness: Comprehensive medical, dental, and vision insurance plans with low co-pays and premiums.
Paid Time Off: Competitive vacation, sick leave, and 20 paid holidays per year.
Work-Life Balance: Flexible work schedules and telecommuting options.
Professional Development: Opportunities for training, certification reimbursement, and career advancement programs.
Wellness Programs: Access to wellness programs, including gym memberships, health screenings, and mental health resources.
Life and Disability Insurance: Life insurance and short-term/long-term disability coverage.
Employee Assistance Program (EAP): Confidential counseling and support services for personal and professional challenges.
Tuition Reimbursement: Financial assistance for continuing education and professional development.
Community Engagement: Opportunities to participate in community service and volunteer activities.
Recognition Programs: Employee recognition programs to celebrate achievements and milestones.