Logo
IBM

Infrastructure & Technology Site Reliability Engineer - Apptio Professional Mult

IBM, Lowell, Massachusetts, United States, 01856


IntroductionAt IBM, work is more than a job - it's a calling: To build. To design. To code. To consult. To think along with clients and sell. To make markets. To invent. To collaborate. Not just to do something better, but to attempt things you've never thought possible. Are you ready to lead in this new era of technology and solve some of the world's most challenging problems? If so, let's talk.

Your Role and ResponsibilitiesYou:You are passionate about observability, automation, and reliability. Your team can count on you to deliver creative and inventive solutions to hard problems. You are comfortable working with developers, senior leadership, and non-technical individuals to help provide value to the broader organization. You take opportunities to fix problems, mentor your peers, and step outside your comfort zone to develop your skill set.Us:Apptio Targetprocess empowers businesses to adopt and scale agile across the enterprise. We develop Agile tools that connect teams, products, and portfolios to business objectives using SAFe, LeSS, and other Agile frameworks. In the 2021 Gartner Magic Quadrant for Enterprise Agile Planning Tools report, Apptio’s recently acquired Targetprocess has been recognized as a “Leader”.SRE Team:Apptio Targetprocess SRE team’s main responsibility is to ensure that the company’s infrastructure and applications run smoothly and stably. We count on our site reliability engineers (SREs) to empower our users with a rich feature set, high availability, and stellar performance level to pursue their missions. This mostly means working proactively on system reliability, preventing outages, observing key metrics, taking urgent mitigation measures when needed, and assisting other teams on infrastructure-related topics.On a typical day in this role, you will interact with Kubernetes, Docker, Helm, Elasticsearch, DataDog, Grafana, Sensu, Puppet, Ansible/AWX, AWS, Azure, Python/Bash/PowerShell, Terraform/Terragrunt. If you don't know all these tools, don't worry; we are not expecting that you know them all, as we understand that technology evolves quickly.Major Responsibilities:Scale systems sustainably through mechanisms like automationOwnership of monitoring systemsMaintain services in production by measuring and monitoring availability, latency, and overall system health.Application expansion and horizontal scaling.Work closely with developers, support, and QA teams on maintaining and improving the whole lifecycle of services.Practice sustainable incident response and blameless post-mortems.Provide primary operational support and engineering for multiple large distributed software applications.Required Technical and Professional ExpertiseKnowledge of configuration management tools (e.g., Ansible or Puppet)Experience with any scripting language (Bash, Python, PowerShell, etc.)Experience with containerization (e.g., Docker, Podman, etc.)Experience with container orchestration tools (e.g., Kubernetes, Open Shift, Docker Swarm, etc.)Experience with database administration and management (MS SQL Server, PostgreSQL, MongoDB)Familiarity with public cloud providers such as AWS, Azure, or IBM CloudExperience with monitoring, observability & logging (e.g., DataDog, Prometheus, Grafana, ELK stack, Loki, etc.)Familiarity with RESTful systems and their APIsExperience with any high-level programming languages (Golang, .Net, Java, etc.) is a plusFluent English language skillsPreferred Technical and Professional ExpertiseAbility to thrive in autonomyExperience in a large-scale, distributed Linux/Unix or Windows is a plusMentoring peers and sharing skillsGreat communication skills

#J-18808-Ljbffr