Logo
Grab

Lead Software Reliability Engineer - Business Ecosystem

Grab, Houston, Texas, United States, 77246


```htmlLife at Grab

At Grab, every Grabber is guided by The Grab Way, which spells out our mission, how we believe we can achieve it, and our operating principles - the 4Hs: Heart, Hunger, Honour, and Humility. These principles guide and help us make decisions as we work to create economic empowerment for the people of Southeast Asia.

Get to know the Team

The Business & Transaction Platform, SNP, and DNA SRE team is a longstanding team responsible for the stable operation of the core Grab systems. We make an impact by contributing to Business & Transaction Platform, Search & Personalization, Demand, and Ads systems, as well as the company's stability and operational excellence. Our team is made up of a group of passionate Site Reliability Engineers. If you are looking for an opportunity to work in a large-scale cloud environment and utilize your sharp ideas to make engineers’ lives better, then you should join our team!

Get to know the Role

We are looking for a Lead Software Reliability Engineer to provide better stability and operational excellence for Business & Transaction Platform, SNP, and DNA tech families in Grab. We believe a successful candidate has professional sysops/infrastructure knowledge and the ability to build comprehensive systems, but if you believe you have what it takes, then we’d love to hear from you either way. This role is required because stability and operational excellence are critical to our services. In return, you will get an opportunity to generate impacts on Grab’s core systems.

The Day-to-Day Activities

Engage in and improve the whole lifecycle of services - from design, through deployment, operation, and refinement.

Work with engineering teams to design and write code to create systems that are highly available and able to scale seamlessly.

Help improve reliability, stability, and scalability challenges with engineering teams.

Get involved in deep diagnosis of incidents, and engage with multiple highly skilled engineering teams on resolutions.

Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.

Contribute to a culture of learning and responsibility by guiding teams to write detailed postmortem reports.

Identify and resolve problems relating to critical service operations and prevent their recurrence using automation.

Be part of a cool team, responsible for one of the largest cloud-based services in South East Asia.

Mentor other engineers, define our technical culture, set high engineering bars, and help build a fast-growing team.

Lead other engineers to conquer challenging projects with great qualities.

Contribute initiatives to improve tech family’s stability and operational excellence.

The Must-Haves

Bachelor's or Master's degree in Computer Science, Software Engineering, Information Technology, or related technical field involving coding.

Preferably with at least 5 years of relevant experience in this role.

Strong experience with algorithms, data structures, complexity analysis, and software design.

Strong experience in one or more of the following: Go, Python, C, C++, Java, Perl, or Ruby.

Strong experience in using service monitoring, logging, and alarm-related environments and tools.

Strong experience in system troubleshooting in a Linux environment.

Solid experience in using Linux commands and shell scripting, coupled with the ability to automate routine tasks.

Solid experience with automation & provisioning tools (e.g. Jenkins, Ansible, Chef, SaltStack, Puppet).

Possess analytical skills, mental resilience, and the ability to think systematically under stressful conditions.

Highly accountable and takes ownership. Outstanding work ethic, high-integrity, team player, and a lifelong learner.

Proficiency in verbal and written English.

The Nice-to-Haves

Experience in Go.

Experience with cloud-based large-scale infrastructure from vendors such as Amazon Web Services, Azure, or Google Cloud Platform.

Experience with containerization technologies (e.g. Docker) and container orchestration platforms (e.g. Kubernetes).

Experience on building high throughput streaming services, and knowledge of streaming processing frameworks such as Flink.

Contributes to open source project experience with performance analysis and debugging tools.

Our Commitment

We recognize that with these individual attributes come different workplace challenges, and we will work with Grabbers to address them in our journey towards creating inclusion at Grab for all Grabbers.

```#J-18808-Ljbffr