Karkidi
Software development Engineer 5 - Devops
Karkidi, San Jose, California, United States, 95199
We are seeking an experienced Senior MLOps Engineer with a Kubernetes background to join our dynamic team. The Engineer will be responsible for bridging the gap between development, operations, and data science teams to ensure smooth deployment and operation of machine learning models in production environments. The ideal candidate will have a strong background in managing Kubernetes clusters at scale, deploying, maintaining, and optimizing infrastructure for performance and reliability.Responsibilities
Architecting, deploying, and maintaining Kubernetes clusters according to best practices and organizational requirements.Managing containerized applications using Kubernetes, including pod scheduling, scaling, updating, and rolling deployments.Developing and maintaining automation scripts and tools for provisioning, configuring, and managing Kubernetes infrastructure, leveraging infrastructure-as-code principles.Collaborate with data scientists and software engineers to design, develop, and deploy machine learning models in production environments.Collaborating with cross-functional teams, including developers, DevOps engineers, and system administrators, to support application deployment and integration with Kubernetes. Documenting processes, configurations, and troubleshooting procedures.Build and maintain scalable, reliable, and efficient machine learning pipelines for data ingestion, model training, evaluation, and deployment.Implement monitoring, logging, and alerting systems to ensure the health and performance of deployed models.Develop and maintain infrastructure as code (IaC) using tools like Terraform or CloudFormation to automate the provisioning and configuration of cloud resources.Staying updated with the latest Kubernetes developments, best practices, and emerging technologies to continuously improve the organization's Kubernetes infrastructure and practices.Implementing disaster recovery strategies and high availability configurations to ensure business continuity and resilience of Kubernetes environments.Implement security best practices and compliance standards for handling sensitive data in production environments.Providing guidance, training, and support to junior team members and stakeholders on Kubernetes concepts, best practices, and usage.Requirements
Tech / M.Tech degree in Computer Science from a premiere institute.9 - 14 years of experience in software development and operations.Excellent computer science fundamentals and a good understanding of architecture, design, and performance.Hands-on experience with cloud platforms such as AWS, Azure, or Google Cloud, including services like EC2, S3, Lambda, Kubernetes, and managed AI/ML services.Experience with containerization technologies (e.g., Docker) and container orchestration platforms (e.g., Kubernetes).Familiarity with version control systems (e.g., Git), CI/CD pipelines (e.g., Jenkins, GitLab CI/CD), and configuration management tools (e.g., Ansible, Puppet).Good knowledge of the cloud security domain.Proficient in Java/Python, Shell.Hands-on in writing code that is reliable and maintainable.Ability to work independently with strong problem-solving skills.Good understanding of k8s and knowledge of product life cycles and associated issues.Technical depth in operating systems, computer architecture, and OS internals.Technical depth in Cloud Computing, Cloud Platforms, and Services architecture and design.Good To Have
Foundational knowledge in, and fundamentals of Machine Learning and Artificial Intelligence.Experience with ML Lifecycle, AI Ethics, ML Frameworks like TensorFlow, Caffe, Torch, and other similar frameworks.
#J-18808-Ljbffr
Architecting, deploying, and maintaining Kubernetes clusters according to best practices and organizational requirements.Managing containerized applications using Kubernetes, including pod scheduling, scaling, updating, and rolling deployments.Developing and maintaining automation scripts and tools for provisioning, configuring, and managing Kubernetes infrastructure, leveraging infrastructure-as-code principles.Collaborate with data scientists and software engineers to design, develop, and deploy machine learning models in production environments.Collaborating with cross-functional teams, including developers, DevOps engineers, and system administrators, to support application deployment and integration with Kubernetes. Documenting processes, configurations, and troubleshooting procedures.Build and maintain scalable, reliable, and efficient machine learning pipelines for data ingestion, model training, evaluation, and deployment.Implement monitoring, logging, and alerting systems to ensure the health and performance of deployed models.Develop and maintain infrastructure as code (IaC) using tools like Terraform or CloudFormation to automate the provisioning and configuration of cloud resources.Staying updated with the latest Kubernetes developments, best practices, and emerging technologies to continuously improve the organization's Kubernetes infrastructure and practices.Implementing disaster recovery strategies and high availability configurations to ensure business continuity and resilience of Kubernetes environments.Implement security best practices and compliance standards for handling sensitive data in production environments.Providing guidance, training, and support to junior team members and stakeholders on Kubernetes concepts, best practices, and usage.Requirements
Tech / M.Tech degree in Computer Science from a premiere institute.9 - 14 years of experience in software development and operations.Excellent computer science fundamentals and a good understanding of architecture, design, and performance.Hands-on experience with cloud platforms such as AWS, Azure, or Google Cloud, including services like EC2, S3, Lambda, Kubernetes, and managed AI/ML services.Experience with containerization technologies (e.g., Docker) and container orchestration platforms (e.g., Kubernetes).Familiarity with version control systems (e.g., Git), CI/CD pipelines (e.g., Jenkins, GitLab CI/CD), and configuration management tools (e.g., Ansible, Puppet).Good knowledge of the cloud security domain.Proficient in Java/Python, Shell.Hands-on in writing code that is reliable and maintainable.Ability to work independently with strong problem-solving skills.Good understanding of k8s and knowledge of product life cycles and associated issues.Technical depth in operating systems, computer architecture, and OS internals.Technical depth in Cloud Computing, Cloud Platforms, and Services architecture and design.Good To Have
Foundational knowledge in, and fundamentals of Machine Learning and Artificial Intelligence.Experience with ML Lifecycle, AI Ethics, ML Frameworks like TensorFlow, Caffe, Torch, and other similar frameworks.
#J-18808-Ljbffr