Logo
Epsilon Data Interactive, Inc.

Director Incident Management

Epsilon Data Interactive, Inc., Indiana, Pennsylvania, us, 15705


When you’re one of us, you get to run with the best. For decades, we’ve been helping marketers from the world’s top brands personalize experiences for millions of people with our cutting-edge technology, solutions, and services. Epsilon’s best-in-class identity gives brands a clear, privacy-safe view of their customers, which they can use across our suite of digital media, messaging, and loyalty solutions. We process 400+ billion consumer actions each day and hold many patents of proprietary technology, including real-time modeling languages and consumer privacy advancements. Thanks to the work of every employee, Epsilon India is now Great Place to Work-Certified. Epsilon has also been consistently recognized as industry-leading by Forrester, Adweek, and the MRC. Positioned at the core of Publicis Groupe, Epsilon is a global company with more than 8,000 employees around the world. For more information, visit epsilon.com/apac or our LinkedIn page.Job Description

About the RoleWe seek a seasoned, strategic leader with exceptional product engineering operational and technical acumen to spearhead our incident management and SRE function across the Product Engineering organization. This high-pressure, 24/7 role demands a relentless focus on driving operational excellence, minimizing system downtime, and ensuring rapid incident resolution. As a key member of the leadership team, you will be instrumental in defining and executing our incident management strategy, fostering a culture of reliability, and mitigating risks across our product portfolio.What you’ll do:Primary ResponsibilitiesIncident Management Strategy and Leadership:Develop, implement, and continuously refine incident management strategies, policies, and procedures aligned with business objectives and regulatory requirements.Serve as the primary escalation point for all critical incidents, providing strategic direction and coordinating cross-functional response efforts.Lead and manage the incident management team, providing direction, guidance, and mentorship.Develop and implement incident management strategies, policies, and procedures to ensure rapid and effective incident resolution.Serve as the primary point of contact for all major incidents, coordinating response efforts and ensuring timely resolution.Incident Response and Resolution:Drive rapid incident resolution by orchestrating cross-functional teams, leveraging data-driven decision-making, and ensuring effective communication.Conduct post-incident reviews to identify root causes, implement corrective actions, and prevent recurrence.Oversee the incident response process, ensuring incidents are accurately identified, categorized, prioritized, and resolved.Coordinate cross-functional teams to resolve incidents, including IT, security, operations, and business stakeholders.Ensure detailed incident reports are created and communicated to relevant stakeholders.Risk Management and Prevention:Proactively identify, assess, and mitigate risks to system availability and performance.Collaborate with engineering teams to implement preventive measures and improve system resilience.Develop and implement risk management strategies to proactively identify, assess, and mitigate potential risks that could impact business operations and system reliability.Conduct regular risk assessments and vulnerability analyses to identify potential threats and weaknesses in IT infrastructure and processes.Establish and enforce preventative measures and controls to minimize the likelihood of incidents occurring, including implementing robust security protocols, conducting regular system audits, and ensuring compliance with industry standards and best practices.Collaborate with IT, security, and operations teams to design and implement effective preventative maintenance plans and system updates.Continuous Improvement and Operational Excellence:Champion a culture of operational excellence and reliability.Drive initiatives to improve incident response time, mean time to repair (MTTR), and overall system uptime.Analyze incident data to identify trends, root causes, and areas for improvement.Drive continuous improvement initiatives to enhance incident management processes, reduce incident recurrence, and improve overall system reliability.Implement best practices and industry standards in incident management.Stakeholder Communication and Management:Build and maintain strong relationships with senior leadership, product management, engineering, operations, and other key stakeholders.Communicate incident status, impact, and recovery plans effectively and transparently.Maintain clear and effective communication with senior management, providing regular updates on incident status and impact.Ensure all stakeholders are informed and engaged throughout the incident lifecycle.Develop and deliver incident management training and awareness programs for staff.Performance Monitoring and Reporting:Establish and monitor key performance indicators (KPIs) for incident management.Prepare and present incident management performance reports to senior leadership.Ensure compliance with regulatory requirements and internal policies.Crisis Management:Develop and maintain a comprehensive crisis management plan.Lead crisis response efforts during major incidents, ensuring business continuity and minimal disruption.SRE Implementation and Governance:Lead a team of site reliability engineers responsible for tracking SLOs, solutioning recurring production incidents, and implementing SRE principles of observability, alerting, error budgeting, and continual improvement for cloud-native, distributed SaaS systems.Team Development and Performance:Build, lead, and mentor a high-performing, globally distributed incident management and SRE teams. Foster a culture of ownership, accountability, and continuous improvement. Develop and implement performance metrics and reporting to measure team effectiveness and identify areas for enhancement.Additional Information

Epsilon is committed to promoting diversity, inclusion, and equal employment opportunities by using reasonable efforts to attract, recruit, engage, and retain qualified individuals of all ethnicities and backgrounds, including, but not limited to, women, people of color, LGBTQ individuals, people with disabilities, and any other underrepresented groups, traits or characteristics.

#J-18808-Ljbffr