Logo
Microsoft

Principal Software Engineer - AIOps

Microsoft, Redmond, Washington, United States, 98052


OverviewWith the rise of Artificial Intelligence (AI) and cloud computing, microservices, and containerization, traditional monitoring tools and manual processes are no longer sufficient to ensure optimal performance, availability, and security. AIOps leverages machine learning, analytics, and automation to enable faster detection, diagnosis, and resolution of issues, as well as proactive prevention of problems before they occur. By analyzing large volumes of data from multiple sources, AIOps can identify patterns, anomalies, and correlations that would be difficult or impossible for humans to detect. This can help cloud engineers improve efficiency, reduce downtime, and enhance the overall user experience. In this AI and Cloud era, AIOps and AI-driven monitoring systems are becoming increasingly important for today's complex and dynamic cloud service systems. Brain is an intelligent AIOps health and monitoring platform designed and created to detect and prevent customer impacting health issues and automatically triage, diagnose, and mitigate them. It was first created to solve Azure and Microsoft cloud health and monitoring problems. As we continue to make rapid progress for our internal services, we are also starting to make it available for Azure customers. In addition to its intelligent capabilities around anomaly detection, auto triage, and issue prevention, we are also introducing LLM based copilot experience to provide on-call engineers with a natural language user interface and a unified intelligence engine to drive AIOps scenarios.

ResponsibilitiesAs a Principal Software Engineer - AIOps, you will be responsible for the following:

Leads the development of architecture and design documents and determines the technology that will be leveraged and how they will interact.Leads design discussions with the team and shares findings/learnings from investigations, holding ownership for design decisions.Creates, implements, optimizes, debugs, refactors, and reuses code to establish and improve performance and resilience, maintainability, effectiveness, and return on investment (ROI).Creates and applies metrics to drive the quality and stability of code, as well as appropriate coding patterns and best practices.Holds accountability as a Designated Responsible Individual (DRI), working on call to monitor system/product/service for degradation, downtime, or interruptions.Leads efforts to reduce incident volume, looking globally at incidences and providing broad resolutions.Escalates issues to appropriate owners.Remains current by investing time and effort into staying abreast of current developments.Proactively seeks new knowledge and adapts to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and shares knowledge with other engineers.Leads efforts to ensure the correct processes are followed to achieve a high degree of security, privacy, safety, and accessibility across solutions and teams.Creates and assures the presence of visible evidence to demonstrate compliance for products.Develops and maintains a deep understanding of the implications of onboarding new technologies following expectations of compliance at Microsoft.Defines and develops standardized, repeatable, scalable solutions to guarantee quality.Identifies best practices and coding patterns and provides deep expertise in the coding and validation strategy.Leads by example and mentors others to produce extensible and maintainable code used across products.

#J-18808-Ljbffr