Logo
Credence Management Solutions, LLC

Site Reliability Engineer

Credence Management Solutions, LLC, Mc Lean, VA


Overview

Credence Management Solutions, LLC (Credence) is seeking a Site Reliability Engineer to support a task order within GSA COMET II.

Responsibilities include, but are not limited to the duties listed below

Education, Requirements and Qualifications

  • Bachelor's/Masters degree in computer science or other highly technical, scientific discipline
  • Ability to program (structured and OO) with one or more high level languages, such as Python, Java, C/C++, Ruby, and JavaScript
  • Experience with cloud storage technologies as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
  • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
  • 5+ years of experience with Cloud Architecture, preferably AWS
  • 10+ years of experience with Operations of enterprise systems with over million users
  • 10+ years of experience with application development
  • 5+ years of experience in DevSecOps
  • 3+ years of experience with microservices
  • 5+ years of experience leading teams
  • 3+ years of experience with agile Role & Responsibilities
  • Run the production environment by monitoring availability and taking a holistic view of system health
  • Build software and systems to manage/operate platform infrastructure and applications
  • Improve reliability, quality, and time-to-market of our suite of software solutions
  • Measure and optimize system performance, with an eye toward pushing our capabilities forward, getting ahead of customer needs, and innovating to continually improve
  • Provide primary operational support and engineering for multiple large distributed software applications
  • Ensure Production readiness for releases which includes Performance/Usability Testing
  • Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
  • Partner with development teams to improve services through rigorous testing and release procedures
  • Participate in system design consulting, platform management, and capacity planning
  • Create sustainable systems and services through automation and uplifts
  • Balance feature development speed and reliability with well-defined service level objectives
  • Production incidents RCAs and Conducting post-incident reviews
  • Optimizing on-call rotations and processes
  • Constant upkeep of documentation and runbooks