Diversity Resource Staffing Inc
Manager Site Reliability Engineering
Diversity Resource Staffing Inc, Sandy Springs, Georgia, United States,
This is an exciting opportunity for a Manager in the Consumer Site Reliability Engineer (SRE) Team at IMT. IMT is a division of our client, which operates numerous financial and commodity marketplaces and exchanges, including the New York Stock Exchange (NYSE).This position is for a hands-on technical manager to lead a team of SRE engineers, focused on providing resilient, secure, scalable and supportable services for mortgage borrowers and lenders. You will contribute to the strategy and delivery of the team, as well as managing the day-to-day workload. This role requires building a close relationship with our customer support, operations, engineering, database and product organizations.You will be involved in the design of resilient systems, the definition and monitoring of SLI/SLOs, creating pro-active actionable alerts, and also drive production incidents. We operate in a hybrid multi-cloud environments supporting Windows, Linux and container-based applications.
ResponsibilitiesProvide thought-leadership; set the technical direction for the SRE TeamDefine and manage projects to meet Team objectives.Set individual goals and manage personal growth of team members.Manage and troubleshoot a diverse set of SaaS Applications and internal servicesServe as the face of a team responsible for the overall health, performance, and capacity of our business applicationsDevelop sustainable SRE practices around simplification and standardizationDrive of the cultural standard for SRE including defining ways of working, runbooks and accountability across people, processes and technologyLead Incident Response and Root Cause Analysis.Partner with other SRE teams and lead by exampleKnowledge and Experience
3+ years of managing high-performance teams in10+ years of Application/Systems engineering in 24x7 Production Services environmentsBS in Computer Science, Computer Engineering, Math, or equivalent professional experienceExperience in designing, deploying and operating SaaS applications and cloud infrastructure (AWS or equivalent & On-Premise virtualized environments)Excellent troubleshooter spanning systems, networks and code, utilizing a systematic problem-solving approachProven track record decreasing MTTR (Meant-Time-To-Recovery), increasing MTTF (Mean-Time-To-Failure), and improving overall service qualityDemonstrate the ability to lead Incident Response and root cause analysis (RCA)Fluency with one or more current generation scripting language used by SRE/DevOps professionals (Powershell, Python, Perl, PHP, Ruby) + Java/.NET developmentStrong communication skills
ResponsibilitiesProvide thought-leadership; set the technical direction for the SRE TeamDefine and manage projects to meet Team objectives.Set individual goals and manage personal growth of team members.Manage and troubleshoot a diverse set of SaaS Applications and internal servicesServe as the face of a team responsible for the overall health, performance, and capacity of our business applicationsDevelop sustainable SRE practices around simplification and standardizationDrive of the cultural standard for SRE including defining ways of working, runbooks and accountability across people, processes and technologyLead Incident Response and Root Cause Analysis.Partner with other SRE teams and lead by exampleKnowledge and Experience
3+ years of managing high-performance teams in10+ years of Application/Systems engineering in 24x7 Production Services environmentsBS in Computer Science, Computer Engineering, Math, or equivalent professional experienceExperience in designing, deploying and operating SaaS applications and cloud infrastructure (AWS or equivalent & On-Premise virtualized environments)Excellent troubleshooter spanning systems, networks and code, utilizing a systematic problem-solving approachProven track record decreasing MTTR (Meant-Time-To-Recovery), increasing MTTF (Mean-Time-To-Failure), and improving overall service qualityDemonstrate the ability to lead Incident Response and root cause analysis (RCA)Fluency with one or more current generation scripting language used by SRE/DevOps professionals (Powershell, Python, Perl, PHP, Ruby) + Java/.NET developmentStrong communication skills