ECS Limited

Site Reliability Engineering Manager - Hybrid

ECS Limited, Virginia Beach, Virginia, us, 23450

ECS is seeking a

Site Reliability Engineering Manager

to work in our

Fairfax, VA

office.

ECS is seeking talented professionals to join our successful and growing team in building the next-generation Continuous Diagnostics and Mitigation (CDM) Cyber data solution. The CDM Program is the Cybersecurity and Infrastructure Security Agency's (CISA) dynamic approach to strengthening the cybersecurity of Federal networks and systems through better awareness and visibility into their security posture and cyber threats. ECS is responsible for designing, building, deploying, operating, and maintaining a complete 'Data Services' solution which includes the collection, normalization, visualization, and sharing of cyber data from more than 100 Federal agencies. The CDM Data Services product is an integrated suite of multiple Commercial Off the Shelf (COTS) products, software configuration packages, and custom code which work together to operate as an integrated solution tailored to meet Department of Homeland Security (DHS) requirements.

We are seeking professionals who thrive in a dynamic, fast-paced, and highly collaborative environment where problem-solving, critical thinking, and a holistic approach to serving the mission are key. Our program operates within the Scaled Agile Framework (SAFe). An aptitude and enthusiasm for continuous learning, improvement, and cyber security is a must!

ECS is seeking a talented Site Reliability Engineering (SRE) Manager to play a key role in defining, implementing, and growing our SRE practice to ensure the reliability, availability, and performance of our critical production environments.

Technical leadership and team management are core aspects of this role. The SRE Manager will manage, lead, and mentor a team of SREs, fostering a culture of continuous improvement, identifying areas for enhancement, and driving initiatives to improve system reliability, scalability, and efficiency. The SRE manager will be responsible for the team's professional growth. Regular performance evaluations, constructive feedback, and career development support for team members are essential.

The successful candidate will have demonstrated hands-on experience designing, implementing, and maintaining systems that are resilient, highly available, and performant. The SRE Manager will also play a critical role in defining and measuring the Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for our solution.

The SRE team will be responsible for setting up comprehensive logging, monitoring, and alerting solutions using the Elastic stack and other tools as necessary to ensure the continuous performance of services. Additionally, they will respond to incidents, perform root cause analyses, and implement solutions to prevent reoccurrences. The SRE team will work in close collaboration with developers, testers, infrastructure engineers, DevOps engineers, and other stakeholders to integrate reliability and observability into the software development lifecycle.

US citizenship with ability to obtain Public Trust Suitability8+ years of experience as a Site Reliability Engineer (SRE)8+ years of demonstrated experience designing, implementing, and maintaining observability solutions to include logging, monitoring, and alerting8+ years of hands-on experience with SRE tools (e.g., Elastic, Prometheus, Grafana, Splunk, etc.)4+ years defining and measuring SLOs and SLIs4+ years of relevant experience using cloud platforms (AWS, Azure, Google Cloud)4+ years of hands-on programming or scripting (e.g., Python, Bash, etc.)2+ years of recent experience building and managing a team of 5 or more engineers with differing levels of experience and expertiseStrong knowledge of microservices, containerization, and orchestration tools (Docker, Kubernetes)Proven ability to collaborate with cross-functional teams (development, testing, and product) to integrate reliability and observability into the software development lifecycleStrong problem-solving and analytical skillsProactive, detail-oriented approach to identifying inefficiencies and implementing improvements