National Grid plc
Platform Owner AIOps SRE
National Grid plc, Waltham, Massachusetts, United States, 02254
National Grid is hiring a Platform Owner AI OPS SRE. This position offers remote flexibility, with the requirement that candidates reside in one of the following states: New York (NY), New Jersey (NJ), Massachusetts (MA), Connecticut (CT), Vermont (VT), Rhode Island (RI), Maine (ME), or New Hampshire (NH).Job Purpose
As a Platform Owner of AI Ops and SRE, your primary objective is to design and oversee the implementation of complex systems that meet functional and non-functional requirements. You will play a key role in developing system design policies, standards, and innovation processes specific to AI Ops and SRE. Additionally, you will actively monitor emerging technologies and assess their potential impact on the organization. Your responsibilities will include driving the strategic vision for AI Ops and SRE within the platform, ensuring alignment among stakeholders, and promoting a cohesive approach to AI Ops and SRE implementation.Key Accountabilities
Your key responsibilities as a Platform Owner of AI Ops and SRE include:Developing AI Ops and Site Reliability Engineering (SRE) Strategies:
You will be responsible for developing strategies that incorporate AI Ops and SRE practices within the data center and cloud domain.Designing Cloud Architecture Solutions:
You will design cloud and on-premise architecture solutions that integrate AI technologies and SRE principles.Collaborating with Development and Operations Teams:
You will work closely with development and operations teams to provide technical guidance and ensure the successful implementation of AI Ops and SRE practices.Implementing AI-Driven Monitoring and Analytics:
You will implement AI-driven monitoring and analytics solutions within the cloud domain.Establishing Incident Response and Resolution Processes:
You will define and establish incident response and resolution processes aligned with SRE practices.Driving Continuous Improvement and Optimization:
You will drive continuous improvement and optimization efforts within the cloud domain.Staying Current with Industry Trends:
It is crucial to stay updated with the latest industry trends, technologies, and best practices related to AI Ops, SRE, cloud, and on-premises computing.Creating and delivering traceable and auditable customer success metrics for the platform services/products.Monitoring and analyzing platform performance metrics and reporting on the overall health of the platform to senior leadership.Managing the infrastructure platform within budget guardrails to ensure alignment with company priorities and goals.Collaborating with Transversal Teams to align Non-Functional Requirements (NFRs) and prioritize them jointly.Requirements
• Bachelor's degree in a relevant discipline, or an equivalent combination of education, training, and experience.• 7 - 10 years of related experience.• Foster one-team culture with ownership, collaboration, and empathy across functions.• 5 or more years of people management experience with relevant industry and professional certifications.• Manage risks and communicate project status, issues, and risks clearly and timely to stakeholders.• Collaborate with colleagues and suppliers in different time zones and communicate effectively with both technical and business people.• 3-5 years Experience with cloud platforms such as Azure preferred, Amazon Web Services (AWS), or Google Cloud Platform (GCP) is essential for managing and optimizing cloud-based infrastructure.• Containerization and Orchestration: Proficiency in containerization technologies like Docker and container orchestration platforms like Kubernetes is important for deploying and managing containerized applications at scale.• Infrastructure-as-Code (IaC): Knowledge of infrastructure-as-code tools such as Terraform or AWS CloudFormation is valuable for automating the provisioning and management of infrastructure resources.• Monitoring and Observability: Familiarity with monitoring and observability tools like Prometheus, Grafana, ServiceNow, ELK Stack (Elasticsearch, Logstash, Kibana), or Splunk is crucial for monitoring system performance, analyzing logs, and troubleshooting issues.• Continuous Integration and Continuous Deployment (CI/CD): Experience with CI/CD pipelines and related tools such as GitHub, GitLab CI/CD.• Configuration Management: Knowledge of configuration management tools like Ansible, Puppet, or Chef is valuable for managing and automating configuration changes across infrastructure and application environments.• Proficiency in incident management tools like ServiceNow, PagerDuty, VictorOps, or ServiceNow, as well as collaboration platforms like Slack or Microsoft Teams, is essential for effective incident response and coordination.• Understanding of networking concepts, protocols, and security best practices is important for managing network infrastructure, implementing secure access controls, and ensuring system and data protection.• Scripting and Programming Languages: Familiarity with scripting languages like Python, Bash, or PowerShell, as well as programming languages like Java, Go, or Ruby, enables automation and customization of various tasks and workflows.• Database Technologies: Knowledge of database technologies such as MySQL, PostgreSQL, MongoDB, or Redis is valuable for managing and optimizing database systems and ensuring data integrity and availability.Your Rewards
Rewarding work and a collaborative, team-oriented culture are just the beginning.
Review our digital benefit guide at ngbenefitslivebrighter.com for full details and descriptions.More Information
This position has a career path which provides for advancement opportunities within and across bands as you develop and evolve in the position; gaining experience, expertise and acquiring and applying technical skills.National Grid is an equal opportunity employer that values a broad diversity of talent, knowledge, experience and expertise. We foster a culture of inclusion that drives employee engagement to deliver superior performance to the communities we serve.
#J-18808-Ljbffr
As a Platform Owner of AI Ops and SRE, your primary objective is to design and oversee the implementation of complex systems that meet functional and non-functional requirements. You will play a key role in developing system design policies, standards, and innovation processes specific to AI Ops and SRE. Additionally, you will actively monitor emerging technologies and assess their potential impact on the organization. Your responsibilities will include driving the strategic vision for AI Ops and SRE within the platform, ensuring alignment among stakeholders, and promoting a cohesive approach to AI Ops and SRE implementation.Key Accountabilities
Your key responsibilities as a Platform Owner of AI Ops and SRE include:Developing AI Ops and Site Reliability Engineering (SRE) Strategies:
You will be responsible for developing strategies that incorporate AI Ops and SRE practices within the data center and cloud domain.Designing Cloud Architecture Solutions:
You will design cloud and on-premise architecture solutions that integrate AI technologies and SRE principles.Collaborating with Development and Operations Teams:
You will work closely with development and operations teams to provide technical guidance and ensure the successful implementation of AI Ops and SRE practices.Implementing AI-Driven Monitoring and Analytics:
You will implement AI-driven monitoring and analytics solutions within the cloud domain.Establishing Incident Response and Resolution Processes:
You will define and establish incident response and resolution processes aligned with SRE practices.Driving Continuous Improvement and Optimization:
You will drive continuous improvement and optimization efforts within the cloud domain.Staying Current with Industry Trends:
It is crucial to stay updated with the latest industry trends, technologies, and best practices related to AI Ops, SRE, cloud, and on-premises computing.Creating and delivering traceable and auditable customer success metrics for the platform services/products.Monitoring and analyzing platform performance metrics and reporting on the overall health of the platform to senior leadership.Managing the infrastructure platform within budget guardrails to ensure alignment with company priorities and goals.Collaborating with Transversal Teams to align Non-Functional Requirements (NFRs) and prioritize them jointly.Requirements
• Bachelor's degree in a relevant discipline, or an equivalent combination of education, training, and experience.• 7 - 10 years of related experience.• Foster one-team culture with ownership, collaboration, and empathy across functions.• 5 or more years of people management experience with relevant industry and professional certifications.• Manage risks and communicate project status, issues, and risks clearly and timely to stakeholders.• Collaborate with colleagues and suppliers in different time zones and communicate effectively with both technical and business people.• 3-5 years Experience with cloud platforms such as Azure preferred, Amazon Web Services (AWS), or Google Cloud Platform (GCP) is essential for managing and optimizing cloud-based infrastructure.• Containerization and Orchestration: Proficiency in containerization technologies like Docker and container orchestration platforms like Kubernetes is important for deploying and managing containerized applications at scale.• Infrastructure-as-Code (IaC): Knowledge of infrastructure-as-code tools such as Terraform or AWS CloudFormation is valuable for automating the provisioning and management of infrastructure resources.• Monitoring and Observability: Familiarity with monitoring and observability tools like Prometheus, Grafana, ServiceNow, ELK Stack (Elasticsearch, Logstash, Kibana), or Splunk is crucial for monitoring system performance, analyzing logs, and troubleshooting issues.• Continuous Integration and Continuous Deployment (CI/CD): Experience with CI/CD pipelines and related tools such as GitHub, GitLab CI/CD.• Configuration Management: Knowledge of configuration management tools like Ansible, Puppet, or Chef is valuable for managing and automating configuration changes across infrastructure and application environments.• Proficiency in incident management tools like ServiceNow, PagerDuty, VictorOps, or ServiceNow, as well as collaboration platforms like Slack or Microsoft Teams, is essential for effective incident response and coordination.• Understanding of networking concepts, protocols, and security best practices is important for managing network infrastructure, implementing secure access controls, and ensuring system and data protection.• Scripting and Programming Languages: Familiarity with scripting languages like Python, Bash, or PowerShell, as well as programming languages like Java, Go, or Ruby, enables automation and customization of various tasks and workflows.• Database Technologies: Knowledge of database technologies such as MySQL, PostgreSQL, MongoDB, or Redis is valuable for managing and optimizing database systems and ensuring data integrity and availability.Your Rewards
Rewarding work and a collaborative, team-oriented culture are just the beginning.
Review our digital benefit guide at ngbenefitslivebrighter.com for full details and descriptions.More Information
This position has a career path which provides for advancement opportunities within and across bands as you develop and evolve in the position; gaining experience, expertise and acquiring and applying technical skills.National Grid is an equal opportunity employer that values a broad diversity of talent, knowledge, experience and expertise. We foster a culture of inclusion that drives employee engagement to deliver superior performance to the communities we serve.
#J-18808-Ljbffr