Principal Engineer, Cloud SRE/DevOps (SD/DC/Remote) (R3138)
The Rundown AI, Inc. - Washington, DC, US, 20022
Work at The Rundown AI, Inc.
Overview
- View job
Overview
What You'll Do:
- Optimize cloud deployments of Forge to ensure scalability, reliability, and cost efficiency.
- Design and document processes for external customers to deploy Forge instances using the SDK in on-premises or hybrid environments.
- Manage and maintain internal Hivemind instances, ensuring their ability to handle large-scale simulation and testing workloads.
- Collaborate with the software operations team to enhance Forge’s ability to scale dynamically, accommodate bursts of use, and support continuous upgrades with minimal disruption.
- Develop tools and processes for canary deployments, ensuring smooth rollouts of new features and updates.
- Serve as the primary technical consultant for the customer engagement team, providing expertise on deploying and managing Forge in external environments.
- Create and maintain detailed, user-friendly documentation and tutorials for deployment processes, catering to both internal teams and external customers.
- Monitor, troubleshoot, and resolve issues related to Forge deployments, ensuring high availability and performance.
Required Qualifications:
- Typically requires a minimum of 15 years of related experience with a Bachelor’s degree; or 14 years and a Master’s degree; or a PhD with 12 years experience; or equivalent experience.
- 10+ years of experience in DevOps, Site Reliability Engineering, or cloud infrastructure roles.
- Expertise in cloud platforms such as AWS, Azure, or GCP, including deploying and managing scalable, distributed systems.
- Strong experience with Kubernetes and containerization.
- Experience creating Helm charts.
- Solid understanding of infrastructure-as-code tools like Terraform, CloudFormation, or similar.
- Proficiency in scripting and programming languages such as Python, Golang, or Bash.
- Demonstrated experience optimizing CI/CD pipelines, implementing canary deployments, or tools like ArgoCD and FluxCD.
- Familiarity with networking concepts and protocols, as well as system monitoring tools (e.g., Prometheus, Grafana).
- Experience deploying and configuring databases such as Postgres.
- Excellent technical writing skills, with a proven ability to create clear, comprehensive documentation and tutorials.
- BS/MS in Computer Science, Engineering, or equivalent practical experience.
- Ability to work cross-functionally and communicate effectively with engineering, operations, and customer-facing teams.
Preferred Qualifications:
- Experience with secure software deployments in regulated industries such as aerospace, defense, or finance.
- Systems software development experience using programming languages like C++, Rust or Golang.
- Experience building software development kits or productized tools for deploying cloud systems.
- Knowledge of hybrid and on-premises deployment strategies and challenges.
- Hands-on experience with database performance optimization and scaling strategies.
- Familiarity with configuration management tools like Ansible, Chef, or Puppet.
- Experience building robust monitoring and alerting systems for mission-critical applications.
- Background in managing high-throughput simulation or testing environments.