Amazon

Principal Technical Program Manager, Enterprise Engineering Availability

Amazon, Austin, Texas, us, 78716

Description Amazon strives to be the worlds most customer centric company. To succeed, our products and services must be available at all times to our customers. The Enterprise Engineering Availability (EEA) team is responsible for improving the availability of internal systems (software, hardware, network) used by millions of Amazonians around the world. A Principal Technical Program Manager on the EEA team will create multi-year programs which drive engineering best practices (testing, deployments, resilience, incident management) across multiple orgs resulting in a best-in-class IT experience which boosts employee productivity. Your programs will improve the performance of software teams and bolster the resilience of the software built by those teams. You will create a closed loop between incident response and incident prevention by analysis of top root causes for problems and then designing programs to eradicate those classes of problems going forward. Your analyses will identify opportunities for continual reduction of MTTR by improving automatic detection, diagnosis and mitigation recommendations. You will drive efforts to improve system telemetry and observability which will result in better prediction, detection and triage of customer-impacting outages. This role is a perfect fit for an experienced technologist who is passionate about availability (alerting, metrics, monitoring, observability), incident management and machine learning. You thrive in a fast-paced, startup-like environment, are comfortable with full-stack applications, communicate effectively to all types of stakeholders (tech, non-tech), enjoy learning new technology and lead through others to ship complex software at scale in fast iterations. A day in the life * Deliver high-impact, high-visibility projects that improve the productivity of millions of Amazonians around the world * Invent processes, tools, and technology to force multiply the effect of your contributions across many organizations. * Be responsible for owning, scoping, leading and delivering projects and experiments end-to-end, leveraging statistical evaluation, pattern recognition, and machine learning. Basic Qualifications - 7+ years of technical product or program management experience - 10+ years of working directly with engineering teams experience - 5+ years of software development experience - Experience in hands-on work managing complex technology projects - Experience managing technical programs across cross-functional teams, building processes and coordinating release schedules - Experience owning/driving roadmap strategy and definition - Experience building and maintaining large-scale, high-availability distributed systems - Excellent oral and written communication skills with both technical and non-technical stakeholders - Experience using data to make priority decisions and taking those initiatives from scoping through production launch into daily operation. - Ability to identify and solve ambiguous problems in short timeframes by using experiments and research spikes with limited oversight/direction. - Experience influencing engineering team members on best practices (full SDLC inclusive of coding standards, code reviews, source control management, build processes, testing, and operations) - Understanding of CI/CD, test automation and robust system health monitoring (metrics, monitors, alarms) - Experience developing software in quick iterations (Agile, Kanban, Scrum, etc) - Bachelor's degree in Computer Science, a related technical field OR equivalent training and industry experience Preferred Qualifications - 8+ years of hands-on work managing complex technology projects experience - Experience managing projects across cross functional teams, building sustainable processes and coordinating release schedules - Experience with incident management - Experience with telemetry and observability systems - Experience, preferably in building large-scale end-to-end distributed systems - Production experience with AWS services & tools (IAAS, PAAS, APIs, tools) - Experience with anomaly detection, time-series data and storage, data streaming - Experience with Site Reliability Engineering (SRE) concepts, practices - Experience with statistical analysis and machine learning Amazon is committed to a diverse and inclusive workplace. Amazon is an equal opportunity employer and does not discriminate on the basis of race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age, or other legally protected status. For individuals with disabilities who would like to request an accommodation, please visit https://www.amazon.jobs/en/disability/us.