Cognizant

Lead Site Reliability Engineer

Cognizant, Plano, Texas, us, 75086

About Cognizant’s Digital Engineering Practice:At Cognizant Digital Engineering, a small cross-functional team comprised of a Product Manager, an Architect, Full-Stack Developers, UI/UX designers, and Big Data analysts builds higher quality software faster than siloed individuals working independently. Small, nimble engineering teams generate collective empathy and camaraderie, thus increasing their ability to anticipate unforeseen development scope changes and maintain high-quality deliverables. Our Digital Engineering teams ideate and develop innovative cloud-based solutions following a Lean-Agile process with a DevOps culture.The Role:Cognizant is looking for an experienced and innovative

Lead SRE Engineer

to serve our diverse base of global clients. As a member of our team, you will build cutting-edge, cloud-based software that powers modern business. An ideal candidate is someone who enjoys working in a diverse, collaborative, geographically distributed team. The ideal candidate is an expert engineer who values the “team,” drives continuous improvement, and is unafraid to challenge the legacy status quo with creative cloud-based solutions.Location: Plano, TexasResponsibilities:Strong SRE experience with Java, AWS, DevOps, deployment strategy, and monitoring tools. Hands-on experience with Dynatrace, Splunk, CICD, Grafana, etc.Application troubleshooting experience, focusing on core SRE metrics like uptime vs availability and monitoring vs observability.Familiarity with SLO, SLA, SLI, and other SRE terminology.Experience deploying using CICD pipelines and debugging/troubleshooting issues in coordination with application teams (Java, Spring Boot, Python, .Net, etc.).Ability to perform API performance testing using tools such as JMeter or Blazemeter.Identify root cause analysis for production issues in AWS environments with multiple microservices.Expertise in Terraform to manage infrastructure as code; troubleshoot and resolve technical issues to ensure smooth operation of applications.Champion site reliability culture and practices, exerting technical influence throughout the team.Lead initiatives to improve reliability and stability of applications and platforms using data-driven analytics.Collaborate with team members to identify service level indicators and work with stakeholders to establish service level objectives and error budgets.Demonstrate high technical expertise in one or more domains, proactively identifying and solving technology-related bottlenecks.Act as the main point of contact during major incidents, demonstrating skills to quickly identify and resolve issues.Document and share knowledge within the organization via internal forums and communities.Required Skills:8+ years of relevant work experience.Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices.Fluency in Java programming.Proficiency in observability tools (e.g., Splunk, Grafana, Dynatrace, Prometheus, Datadog).Experience with continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform).Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker).Experience with infrastructure as code tools such as Terraform and managing/supporting cloud-based applications (AWS preferred).Excellent communication skills.Benefits:Cognizant offers the following benefits for this position, subject to applicable eligibility requirements:Medical/Dental/Vision/Life InsurancePaid holidays plus Paid Time Off401(k) plan and contributionsLong-term/Short-term DisabilityPaid Parental LeaveEmployee Stock Purchase PlanDisclaimer: The salary, other compensation, and benefits information is accurate as of the date of this posting. Cognizant reserves the right to modify this information at any time, subject to applicable law.

#J-18808-Ljbffr