Money Fit by DRS

Lead/Sr Machine Learning Engineer - AWS (with LLM Focus)

Money Fit by DRS, Myrtle Point, Oregon, United States, 97458

Remote but MST and PST based talent

Responsibilities:LLM-Optimized MLOps Infrastructure:

Design and implement MLOps infrastructure on AWS tailored for LLMs, leveraging services like SageMaker, EC2 (with GPU instances), S3, ECS/EKS, Lambda, and more.LLM Deployment Pipelines:

Build and manage CI/CD pipelines specifically for LLM deployment, addressing unique challenges like model size, inference optimization, and versioning.LLMOps Practices:

Implement LLMOps best practices for monitoring model performance, drift detection, prompt management, and feedback loops for continuous improvement.RESTful API Development:

Design and develop RESTful APIs to expose LLM capabilities to other applications and services, ensuring scalability, security, and optimal performance.Model Optimization:

Apply techniques like quantization, distillation, and pruning to optimize LLM models for efficient inference on AWS infrastructure.Monitoring and Observability:

Establish comprehensive monitoring and alerting mechanisms to track LLM performance, latency, resource utilization, and potential biases.Prompt Engineering and Management:

Develop strategies for prompt engineering and management to enhance LLM outputs and ensure consistency and safety.Collaboration:

Work closely with data scientists, researchers, and software engineers to integrate LLM models into production systems effectively.Cost Optimization:

Continuously optimize LLMOps processes and infrastructure for cost-efficiency while maintaining high performance and reliability.Qualifications:Experience:

3+ years of experience in MLOps or a related field, with hands-on experience in deploying and managing LLMs.AWS Expertise:

Strong proficiency in AWS services relevant to MLOps and LLMs, including SageMaker, EC2 (with GPU instances), S3, ECS/EKS, Lambda, and API Gateway.LLM Knowledge:

Deep understanding of LLM architectures (e.g., Transformers), training techniques, and inference optimization strategies.Programming Skills:

Proficiency in Python and experience with infrastructure-as-code tools (e.g., Terraform, CloudFormation), REST API frameworks (e.g., Flask, FastAPI), and LLM libraries (e.g., Hugging Face Transformers).Monitoring:

Familiarity with monitoring and logging tools for LLMs, such as Prometheus, Grafana, and CloudWatch.Containerization:

Experience with Docker and container orchestration (e.g., Kubernetes, ECS) for LLM deployment.Problem Solving:

Excellent problem-solving and troubleshooting skills in the context of LLMs and MLOps.Communication:

Strong communication and collaboration skills to effectively work with cross-functional teams.

#J-18808-Ljbffr