Logo
Foundation Inc.

Lead DevOps Engineer

Foundation Inc., Irvine, California, United States, 92713


Foundation AI is looking for a highly skilled Lead DevOps Engineer to architect and develop scalable data systems and pipelines. This position demands a strong technical background in version control, coding, and automation, alongside proven leadership in implementing and optimizing CI/CD pipelines to ensure efficient and reliable software delivery.Job Summary:

At Foundation AI, as a Lead DevOps Engineer, you will play a pivotal role encompassing technical expertise in version control and coding, alongside leadership in CI/CD implementation. Your proficiency in troubleshooting and strong communication skills are essential to maintaining seamless operations in a dynamic, cloud-native environment, adhering to best practices and fostering efficient collaboration.Responsibilities:

As a Lead DevOps Engineer at Foundation AI, you will bring extensive expertise in managing code deployment, configuration, and monitoring, ensuring high service availability, latency management, and effective change and capacity management.Utilizing SLAs, SLIs, and SLOs, you will define and maintain system reliability standards, while actively collecting and sharing data with development teams to enhance code quality.Your role will focus on proactive issue resolution through robust monitoring and logging practices, implementing automation to streamline operations, and maintaining a balance between operational tasks and development work.Experience with Infrastructure as a Service (e.g., Terraform), proficiency in Shell scripting and Linux OS, and database configuration skills are essential.Familiarity with the ELK stack, expertise in managing CI/CD pipelines such as Jenkins, and proficiency in Git repository management are also required. Knowledge of AWS services, access provisioning, and optimizing product scalability and availability are critical, along with deploying and maintaining monitoring tools and utilizing cloud services efficiently.Additional qualifications include acting as a configuration manager, optimizing cloud resources, and reducing spending, with AWS Certification considered advantageous.Experience with Airflow, Helm Charts, AWS SageMaker, and MLOps is a plus, alongside 7-10 years of experience, a Master's degree in Computer Science preferred, and demonstrated leadership in project development and adherence to system security best practices.Proficiency in coding and scripting, detailed knowledge of web and application servers, familiarity with Linux and Windows operating systems, and experience with containerized platforms and orchestration round out the essential qualifications for this role.Rich Industry Experience:

You should possess a substantial 7-10 years of experience in DevOps and Site Reliability Engineering (SRE) & should have worked for product-based companies (Startup/Scaleup). This extensive experience underscores your ability to navigate complex DevOps challenges effectively.Mastery of Version Control:

A critical aspect of your role involves demonstrating an in-depth mastery of version control systems. Your proficiency in this area ensures the proper management of code repositories and versioning.Operating System Expertise:

Your command over operating systems is particularly vital, with a strong emphasis on Linux. This expertise ensures a solid foundation for managing and optimizing system-level operations.DevOps Methodology:

Your role will require you to not only apply DevOps concepts but also effectively implement best practices. This includes streamlining processes and fostering a culture of collaboration and continuous improvement.CI/CD Leadership:

You will be at the forefront of CI/CD (Continuous Integration and Continuous Deployment) efforts. This leadership position involves overseeing the automation of software delivery pipelines, enabling rapid and reliable releases.Efficient Troubleshooting:

Troubleshooting is a core aspect of your responsibilities. You'll need to swiftly and efficiently diagnose and resolve issues that arise in the development and production environments, minimizing downtime.Effective Communication and Collaboration:

Exceptional communication and collaboration skills are essential. You'll work closely with cross-functional teams, bridging the gap between development and operations, and ensuring smooth coordination.Cloud-Native Proficiency:

Proficiency in Cloud-native applications is crucial. You'll be tasked with architecting, deploying, and managing applications in cloud environments, harnessing the benefits of scalability and resilience.Understanding Distributed Computing:

A solid grasp of Distributed Computing principles is fundamental. It enables you to design and implement systems that can handle complex, distributed workloads effectively.Coding Prowess:

Your coding skills, particularly in Bash Shell Scripting and Python, will play a pivotal role. These skills empower you to automate tasks and develop tools to enhance system reliability and efficiency.Technical Guidance and Support:

Provide technical guidance to the team, helping to resolve complex technical issues and production problems.Skill/Qualifications:

You will leverage extensive expertise across various domains, including operating systems (CentOS/Ubuntu, Windows), cloud platforms (AWS, Azure), and container technologies (Docker, Kubernetes with Helm).Your proficiency extends to planning and design using Jira and Confluence, source code versioning with Git (GitHub), and management of web servers like Apache HTTP and Nginx.You will demonstrate adeptness in programming languages such as Bash Shell Scripting and Python, coupled with robust configuration management skills in Ansible and infrastructure coding using Terraform and CloudFormation.Additionally, you'll utilize service mesh technologies like Envoy, Istio, and AWS AppMesh, alongside network configuration tools such as Consul.Your role will also encompass implementing CI/CD pipelines with Jenkins, ensuring secure credentials management with HashiCorp Vault and SSL, and monitoring infrastructure and applications using tools like Datadog, Prometheus integrated with Grafana, and ELK Stack for log management.Furthermore, you'll optimize performance and manage emergency responses through SMTP, SES, SNS, and PagerDuty.Foundation AI is dedicated to fostering an inclusive and diverse workplace, valuing the principles of equal opportunity and affirmative action. We strive to provide equal employment opportunities to all individuals, irrespective of their race, colour, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, or veteran status. We believe in upholding these values and complying with all applicable laws.

Please send your CV to

careers@foundationai.com

19800 MacArthur Boulevard, Suite 300, Irvine, CA 92612#J-18808-Ljbffr