Logo
Cognizant

Lead Site Reliability Engineer

Cognizant, Plano, Texas, us, 75086


About Cognizant’s Digital Engineering Practice:At Cognizant Digital Engineering, a small cross-functional team comprised of a Product Manager, an Architect, Full-Stack Developers, UI/UX designers, and Big Data analysts builds higher quality software faster than siloed individuals working independently. Small, nimble engineering teams generate collective empathy and camaraderie, thus increasing their ability to anticipate unforeseen development scope changes and maintain high-quality deliverables. Our Digital Engineering teams ideate and develop innovative cloud-based solutions following a Lean-Agile process with a DevOps culture.The Role:Cognizant is looking for an experienced and innovative

Lead SRE Engineer

to serve our diverse base of global clients. As a member of our team, you will build cutting-edge, cloud-based software that powers modern business. An ideal candidate is someone who enjoys working in a diverse, collaborative, geographically distributed team and is an expert engineer who values the team, drives continuous improvement, and is unafraid to challenge the legacy status quo with creative cloud-based solutions.Location: Plano, TexasResponsibilities:Strong SRE experience with Java, AWS, DevOps, deployment strategy, and monitoring tools (e.g., Dynatrace, Splunk, CICD, Grafana).Application troubleshooting experience, focusing on core SRE metrics before production deployment (uptime vs availability, monitoring vs observability, incidents, and outages).Familiarity with SLO, SLA, SLI, or other SRE-related terms.Experience deploying using CICD pipelines and debugging/troubleshooting issues in coordination with application teams (Java, Spring Boot, Python, .Net).Ability to perform API performance testing using tools such as JMeter or Blazemeter.Identifying root causes for production issues in AWS environments with multiple microservices.Expertise in Terraform for managing infrastructure as code and troubleshooting technical issues.Championing site reliability culture and practices, exerting technical influence throughout the team.Leading initiatives to improve the reliability and stability of applications and platforms using data-driven analytics.Collaborating with team members to identify comprehensive service level indicators and establishing reasonable service level objectives and error budgets with customers.Demonstrating a high level of technical expertise and proactively identifying and solving technology-related bottlenecks.Acting as the main point of contact during major incidents and solving issues quickly to avoid financial losses.Documenting and sharing knowledge within the organization via internal forums and communities.Required Skills:8+ years of relevant work experience.Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and site reliability best practices.Fluency in JAVA programming.Experience in observability, including white and black box monitoring, SLO alerting, and telemetry collection using tools like Splunk, Grafana, Dynatrace, Prometheus, and Datadog.Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform).Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker).Experience with infrastructure as code tools such as Terraform and managing/supporting Cloud-based applications, preferably AWS.Excellent communication skills.Benefits:

Cognizant offers the following benefits for this position, subject to applicable eligibility requirements:· Medical/Dental/Vision/Life Insurance· Paid holidays plus Paid Time Off· 401(k) plan and contributions· Long-term/Short-term Disability· Paid Parental Leave· Employee Stock Purchase PlanDisclaimer: The salary, other compensation, and benefits information is accurate as of the date of this posting. Cognizant reserves the right to modify this information at any time, subject to applicable law.

#J-18808-Ljbffr