SRE L3 Engineer
Inficare, Austin, TX, United States
job title:- SRE L3 Engineer
job location:- Austin, tx and jacksonville, FL
Job description:-
Qualifications:
• The ideal candidate will have a strong background in production monitoring, a deep understanding of development and operations, and a proven track record in managing and scaling distributed systems in a public, private, or hybrid cloud environments.
• Understanding of SRE principles, including monitoring, alerting, fault analysis, and other common reliability engineering concepts, with a keen eye for opportunities to eliminate toil by code and process improvements.
• Expertise in infrastructure as code (IAC), configuration management, build automation, source control, and CI/CD tools (e.g., Terraform, CloudFormation, Ansible, GitHub, Artifactory, Jenkins).
• Deep understanding of containerization and orchestration technologies (e.g., Docker, Kubernetes).
• Experience with monitoring and logging tools (e.g., Prometheus, Grafana, Dynatrace, Splunk) and incident response processes.
• Proficient in Java, .NET, Web UI/JavaScript Frameworks and scripting languages such as Python, Bash, and PowerShell.
• High-level understanding of the different layers of the Tech stack and how they come together to provide a service (e.g. network, compute, storage, OS (Linux, Windows), supporting services, application layer).
Responsibilities:
• Key measures of success will include platform stability, effective integration and delivery, instrumentation, release quality, technical debt(toil) reduction, development of automation, risk/security compliance, and sustained advancement of the SRE practice.
• Design & implement scalable, automated, monitored, and well-documented systems to accelerate the development of the services running in the AWS and Azure cloud.
• Configure, tune, and fix multi-tiered systems to achieve optimal application performance, stability, and availability.
• Be part of an on-call rotation providing hands-on technical expertise during service-impacting events.
• Apply troubleshooting skills, debugging tools, and examine logs, telemetry, and other methods to verify assumptions and customer impact. Lead blameless postmortems for root cause and production resiliency.
- Mandatory skills:- infrastructure as code (IAC), configuration management, build automation, source control, and CI/CD tools (e.g., Terraform, CloudFormation, Ansible, GitHub, Artifactory, Jenkins). Deep understanding in Docker, Kubernetes.