Logo
University of California - San Francisco Campus and Health

Observability Engineer

University of California - San Francisco Campus and Health, San Francisco, California, United States, 94199


Observability Engineer

F_IT COMMAND CENTERFull Time80620BRJob SummaryAn Observability Engineer within the Incident Command team plays a critical role in monitoring, evaluating, and optimizing the performance and health of IT systems and applications. This position is pivotal in ensuring that the IT infrastructure operates efficiently and is capable of handling emerging issues swiftly and effectively.The primary duties of an Observability Engineer include the development and maintenance of monitoring tools and dashboards that provide real-time insights into the operational status of IT systems. This role involves the collection and evaluation of metrics, logs, and traces to proactively detect, diagnose, and resolve performance bottlenecks or anomalies before they escalate into more significant incidents.Furthermore, the Observability Engineer partners closely with other IT and incident management teams to enhance incident response strategies. They are tasked with improving the observability framework by integrating advanced analytics and machine learning techniques to predict potential system failures and automate response processes.The Observability Engineer will positively impact UCSF's operations and culture by ensuring UCSF's IT infrastructure is operable, secure, efficient, and effective in service of the University's mission. The Observability Engineer will advance the University's mission by delivering exceptional information technology services comprehensively and consistently across customers and stakeholders. This role will execute UCSF's vision while modeling UCSF's culture and values.The salary range for this position is $120,300 - $194,600 (Annual Rate).Department DescriptionUniversity of California, San Francisco (UCSF) is distinguished as a leading academic healthcare organization, home to groundbreaking discoveries, world-class education, and exceptional healthcare services. Infrastructure Services (IS) is the backbone of the technological infrastructure, assuring the technical services that enable the academic, medical, and research missions of the organization. Beyond a focus on maintaining systems and resolving issues, we are committed to nurturing the potential of our team members and empowering them to excel. UCSF Infrastructure Services provides 24x7 support to the University community, always upholding the highest level of responsiveness and reliability for our customers.The Incident Command team within Infrastructure Services operates as a critical support system for the community of medical and health researchers. This team is dedicated to ensuring seamless access to essential IT resources, thereby enabling continuous and vital research work that has a profound impact on human health and well-being. Incident Command's mission is to manage any major IT incidents, such as data breaches or network failures, effectively and swiftly. These incidents could pose potential disruptions to the ongoing research. Operating around the clock, the team's primary objective is to restore standard operations promptly, minimizing any possible disruption to the researchers' work. The Incident Command team collaborates to diagnose the issue, evaluate its potential impact on research activities, strategize an appropriate solution, and oversee the resolution process.Required QualificationsBachelor's degree, or equivalent combination of experience/training, in one or more of the following fields: computer science, engineering, computer information systems, etc.5 to 7 years of experience in information technology or Information Technology (IT) Service Management/Customer.Expertise in using advanced monitoring and observability tools such as Datadog, Spectrum, Prometheus, Grafana, Splunk, or New Relic to track system performance and health.Advanced ability to analyze and interpret complex data from various sources to diagnose issues and understand system behaviors.Skilled in responding to and managing incidents efficiently, minimizing downtime and ensuring quick resolution of issues.Proficiency in automating monitoring tasks using scripting languages such as Python, Bash, PowerShell, JAVA, YAML, and XML to enhance system efficiency and reliability.Demonstrated experience using PagerDuty, OpsGenie, or comparable applications.Excellent communication skills for effectively articulating incident details and collaborating with cross-functional teams to resolve issues.Advanced problem-solving skills with an ability to think critically and strategically under pressure to address and resolve unforeseen issues swiftly.Deep understanding of Information Technology (IT) infrastructure including networks, servers, databases, logging, and cloud services to identify and address potential points of failure.Ability to document incidents, create detailed reports, and maintain clear records of system performance and issues for future reference.Ability to lead and collaborate with team members in high-stress situations, ensuring effective teamwork and optimal incident handling.Proficiency in risk management, including risk assessment and mitigation strategies related to IT.Understanding of compliance requirements relevant to IT operations within the specific sector, such as educational institutions, which may include data protection laws and standards.Preferred QualificationsInformation Technology Infrastructure Library (ITIL)About UCSFThe University of California, San Francisco (UCSF) is a leading university dedicated to promoting health worldwide through advanced biomedical research, graduate-level education in the life sciences and health professions, and excellence in patient care. It is the only campus in the 10-campus UC system dedicated exclusively to the health sciences.Pride ValuesUCSF is a diverse community made of people with many skills and talents. We seek candidates whose work experience or community service has prepared them to contribute to our commitment to professionalism, respect, integrity, diversity and excellence - also known as our PRIDE values.In addition to our PRIDE values, UCSF is committed to equity - both in how we deliver care as well as our workforce. We are committed to building a broadly diverse community, nurturing a culture that is welcoming and supportive, and engaging diverse ideas for the provision of culturally competent education, discovery, and patient care.Equal Employment OpportunityThe University of California San Francisco is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, age, protected veteran or disabled status, or genetic information.Organization : CampusJob Code and Payroll Title : 000499 INFO SYS ANL 4Job Category : Clinical Systems / IT ProfessionalsBargaining Unit : 99 - Policy-Covered (No Bargaining Unit)Employee Class : CareerLocation : Mission Center Building (SF), San Francisco, CAShift : DaysShift Length : 8 HoursAdditional Shift Details : M-F, 8am-5pm with on-call rotation

#J-18808-Ljbffr