Logo
Tractor Supply

Engineer, IT Cloud Site Reliability

Tractor Supply, Brentwood, Tennessee, United States, 37027


Engineer, IT Cloud Site Reliability

Overall Job Summary

A Cloud Site Reliability Engineer is a multifaceted role that combines elements of software engineering, system administration, and IT operations. Cloud SREs are responsible for ensuring the reliability, performance, and scalability of systems by focusing on system design, automation, monitoring, incident management, performance tuning, collaboration, and security. Their efforts directly impact the stability and efficiency of critical systems, enabling organizations to deliver reliable and efficient services at scale. This role requires a blend of technical expertise, problem-solving skills, and effective communication, making it essential for the success of modern, complex infrastructures.

Essential Duties and Responsibilities

Vendor Management

Strong negotiation skills, the ability to build better vendor relationships, network effectively, manage multiple vendors, identify financial risks, and evaluate new vendors.

Industry awareness, strong people skills, and the ability to make effective decisions.

Effective management by monitoring performance, managing risks, tracking key performance indicators, and ensuring compliance with regulations.

Coordinating Teams Efforts

Coordinate efforts with teams located onsite, offshore, nearshore, and across multiple vendors, providing clear direction, setting expectations, and motivating team members to achieve common goals.

Ensure that tasks are assigned, schedules are aligned, and resources are allocated effectively across teams and vendors.

Establish regular communication channels and protocols to ensure that information is shared, feedback is provided, and issues are addressed in a timely manner.

Understand and respect cultural differences and work styles of team members and vendors from different regions.

System Design and Architecture:

Collaborate with software engineers to identify and mitigate risks to system availability and reliability.

Automation and Tooling:

Develop and maintain automation tools to streamline operations and reduce manual interventions.

Monitoring and Incident Management:

Help improve monitoring and alerting systems.

Respond to incidents, perform root cause analysis, and implement permanent fixes to prevent recurrence, maintaining detailed documentation.

Performance and Scalability:

Conduct performance tuning, optimization and capacity management of systems to handle increasing loads and demand.

Collaboration and Communication:

Communicate effectively with stakeholders about system performance, incidents, and improvements.

Foster a culture of reliability and continuous improvement across the organization.

Security and Compliance:

Ensure that systems and infrastructure comply with security best practices and regulatory requirements.

Required Qualifications

Experience: 4+ years related work experience. Experience in the retail industry preferred

Education: Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent practical experience. Any combination of education and experience will be considered.

Professional Certifications: None

High Demand IT Specialized skills:

Platform knowledge (UNIX, Linux, Windows): Public and Private Cloud Technologies (AWS, Google Cloud, Azure) and containerization technologies (Docker, Kubernetes). Hyper-converged Platforms (Nutanix, Simplivity), VMware vSphere 6, Microsoft Applications (Active Directory, Exchange, O365 and server OS), AHV, Kubernetes, Docker, Saltstack

Preferred knowledge, skills or abilities

Knowledge of ITIL Foundation concepts, practices, and procedures preferred.

Knowledge of continuous improvement concepts preferred.

Experience with programming and scripting languages (Python, Go, Java, Bash).

Experience with monitoring and logging tools (Prometheus, Grafana, ELK stack).

Excellent problem-solving skills and the ability to work under pressure.

Strong communication and collaboration skills, with a focus on teamwork and knowledge sharing.

Strong Enterprise Application Support experience

Strong Process Management skills

Ability to manage ITSM Tools and Enterprise Support tools

Understand data integration concepts.

SDLC Waterfall and Agile knowledge preferred

Working Conditions

Normal office working conditions

Physical Requirements

Sitting

Standing (not walking)

Walking

Lifting up to 20 pounds

Disclaimer

This job description represents an overview of the responsibilities for the above referenced position. It is not intended to represent a comprehensive list of responsibilities. A team member should perform all duties as assigned by his/ her supervisor.

ALREADY A TEAM MEMBER?

You must apply or refer a friend through our internal portal

Click here (https://performancemanager4.successfactors.com/sf/home?company=tractorsup)

CONNECTION

Our Mission and Values are more than just words on the wall - they’re the one constant in an ever-changing environment and the bedrock on which we build our culture. They're the core of who we are and the foundation of every decision we make. It’s not just what we do that sets us apart, but how we do it.

Learn More

EMPOWERMENT

We believe in managing your time for business and personal success, which is why we empower our Team Members to lead balanced lives through our benefits total rewards offerings. fot full-time and eligible part-time TSC and Petsense Team Members. We care about what you care about!

Learn More

OPPORTUNITY

A lot of care goes into providing legendary service at Tractor Supply Company, which is why our Team Members are our top priority. Want a career with a clear path for growth? Your Opportunity is Out Here at Tractor Supply and Petsense.

Learn More

Join Our Talent Community

Nearest Major Market: Nashville