HAN Staffing
SRE Lead
HAN Staffing, Iselin, NJ
Experience working as an IT Operation Automation Solution Architect for minimum of 2 years.
Experience implementing AIOPS solution
Strong Experience with one AIOPS platform (ServiceNow ITOM, Splunk ITSI, Moogsoft
Strong Experience with one Orchestration & automation platform (ServiceNow Orchestrator, Ansible Tower, IPCenter)
Exposure with one APM AIOPS tools (Dynatrace, AppDynamics, Datadog, New Relic)
Exposure with at least couple of Infra monitoring tools (For ex: Solarwinds, ScienceLogic, Zabbix etc.)
Exposure with one or multiple RPA platform (Blue Prism, UIPath, Automation Anywhere etc.)
5+ year's experience designing, implementing and managing one or more of the following monitoring platforms o App Dynamics / Dynatraceo DataDog / Splunk / Moogsofto Sensu o New Relic
3-5 years designing, deploying and managing one or more of the following o Graphite o Prometheus o TICK stack 3-5 years designing, deploying and managing log aggregation solutions with either Elastic or Splunk
Proficiency in at least one high level programing language used for automation Ansible, Python, Ruby, GO,
Experience developing monitoring integrations into o ServiceNowo PagerDutyo Slacko Microsoft Teams
3-5 year's experience as a system administrator in a predominantly RedHat LINUX environment
Proficiency with at least one of the following configuration management tools o Chef, Puppet, Ansible
Understanding of application development and deployment practices, primarily in Java
Experience with monitoring large scale containerized applications
3+ years developing and designing dashboards in Grafana, Kibana, Tableau or equivalent
Identify operations to automate, and design and implement automation frameworks for provisioning, configuration, and deployment of infrastructure and applications.
Develop and implement effective incident management processes and procedures to minimize service disruptions and mitigate the impact of incidents when they occur.
Coordinate and lead incident response teams to quickly and effectively address incidents.
Analyze incidents to identify root causes and implement measures to prevent similar incidents in the future.
Develop and implement disaster recovery and business continuity plans to minimize service disruptions and data loss.
Develop and implement security best practices and standards to ensure the security and privacy of the data and platform.
Design and implement effective monitoring, logging, and alerting systems to ensure the stability and performance of the platform.
Collaborate with developers, product managers, and operations teams to ensure the platform meets the needs of the business.
Develop and implement continuous integration and delivery pipelines to ensure faster and more reliable software delivery.
Maintain a deep understanding of emerging technologies, best practices, and industry trends, and make recommendations for improvements to the platform.
Establish processes for deploying, managing, and troubleshooting production systems across multiple cloud providers and on-premise infrastructure.
Bachelor's or Master's degree in Computer Science or a related field.
7+ years of experience in designing and managing large-scale, geo-distributed systems, with expertise in cloud computing, networking, and security.
Experience in automating operations tasks and deploying applications using continuous integration and delivery pipelines.
Strong problem-solving skills and ability to analyze incidents and identify root causes.
Passion for learning and staying up-to-date with emerging technologies and best practices.
Excellent communication and collaboration skills
SRE (Site Reliability Engineer) - Monitoring & and Observability Engineer