Matlen Silver

Site Reliability Engineer

Matlen Silver, Boca Raton, FL, United States

Compensation: $70 - $75/Hour

Hybrid: 2 Days Onsite Boca Raton, Florida

Domain: Retail/Supply Chain

Job Title: Site Reliability Engineer

Position Summary

As a Site Reliability Engineer/DevOps Engineer, you will be responsible for ensuring the availability, performance, and reliability of Fulfillment Technology solutions for our client to support omni-channel strategy. You will work closely with the development, testing, and operations teams to design, implement, and maintain scalable, reliable, and efficient solutions for the production environment. You will also troubleshoot and resolve any issues that may arise in the production systems, using various tools and techniques such as monitoring, logging, alerting, automation, and incident management. You will also contribute to the continuous improvement of the DevOps practices and processes, such as CI/CD, configuration management, infrastructure as code, and cloud computing. You will have a strong background in software engineering, system administration, networking, and cloud technologies. You will also have excellent communication and collaboration skills, as well as a passion for learning new technologies and solving complex problems.

Minimum Position Qualifications

Bachelor’s Degree in Computer Science/Engineering or related field
4+ years of experience in the cloud SRE/DevOps/Infrastructure, or any related fields
4+ years experience working with databases, web applications and micro-services, event-driven applications, messaging systems, REST APIs and integrations, cloud, support tools, observability and containerization technologies.
Knowledge of Java, Spring boot, Microservices, Kafka, Cassandra & SQL Server
Proficiency in scripting languages such as Python / Shell scripting
1 year of experience managing System Observability tools (DynaTrace, ELK, PagerDuty, Datadog, Azure Monitor, Grafana, etc)
Hands-on experience with GitActions for CI/CD automations
Knowledge of Linux architecture, security, administration, performance monitoring/tuning, troubleshooting, and production operations
Demonstrated skill in working in an Agile environment
Demonstrated skill in working with multi-location global teams
Proven ability to think and contribute at the strategic level
Demonstrated knowledge of eCommerce, Fulfillment, or Retail Technology solutions
Demonstrated written, oral and presentation/public speaking communication skills

Desired Previous Experience/Education

Master’s Degree or PhD in computer science, information systems, or related field
4+ years of experience in designing/working in high volume eCommerce applications
2+ years of experience configuring and managing cloud infrastructure (Azure, AWS, GCP)
1 year of experience with technologies such as Apache Kafka, Azure Cosmos DB, Apache Cassandra, Ansible, Terraform, Docker and Kubernetes
Experience with Nginx, HAProxy, Squid
Experience with CI/CD pipelines using tools such as Jenkins, Spinnaker, Azure DevOps, TeamCity, etc.
?Proficient in implementing and managing RoyalTS or similar cross-platform remote management solutions, ensuring secure and efficient remote access and system administration across diverse environments.

Key Responsibilities

Essential Job Functions

Partner and collaborate with application engineering, observability, and other support teams within our clients organization, as well as our business operation partners and third parties (as appropriate) to prioritize, address and drive the resolution of issues and incidents that impact customer pickup or delivery domains
Drive root-cause analysis of critical business and production issues to prevent future occurrences and review/approve potential solutions
Lead Major Incident calls impacting the Pickup Fulfillment domain and provide clear, timely updates on status of service restoration to key stakeholders
Work with the engineering teams to continuously implement and improve reliable and speedy build environments
Increase automation to improve efficiency and quality
Ensure traceability, observability, and retrievability of system behavior
Build logging, monitoring, and alerting systems to identify bottlenecks and assist with debugging, analysis, and optimization in cloud, on-prem and store environments
Craft solid and clearly explained designs, playbooks, and documentation
Participate in an off-hours on-call rotation, and perform periodic off-hours work during maintenance windows