Credence Management Solutions, LLC

Site Reliability Engineer

Credence Management Solutions, LLC, Mc Lean, VA

Overview

Credence Management Solutions, LLC (Credence) is seeking a Site Reliability Engineer to support a task order within GSA COMET II.

Responsibilities include, but are not limited to the duties listed below

Education, Requirements and Qualifications

Bachelor's/Masters degree in computer science or other highly technical, scientific discipline
Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C/C++, Ruby, and JavaScript
Experience with cloud storage technologies as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
5+ years of experience with Cloud Architecture, preferably AWS
10+ years of experience with Operations of enterprise systems with over million users
10+ years of experience with application development
5+ years of experience in DevSecOps
3+ years of experience with microservices
5+ years of experience leading teams
3+ years of experience with agile Role & Responsibilities
Run the production environment by monitoring availability and taking a holistic view of system health
Build software and systems to manage/operate platform infrastructure and applications
Improve reliability, quality, and time-to-market of our suite of software solutions
Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
Provide primary operational support and engineering for multiple large distributed software applications
Ensure Production readiness for releases which includes Performance/Usability Testing
Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
Partner with development teams to improve services through rigorous testing and release procedures
Participate in system design consulting, platform management, and capacity planning
Create sustainable systems and services through automation and uplifts
Balance feature development speed and reliability with well-defined service level objectives
Production incidents RCAs and Conducting post-incident reviews
Optimizing on-call rotations and processes
Constant upkeep of documentation and runbooks