The University of Chicago
HPC Systems & Operations Manager
The University of Chicago, Chicago, Illinois, United States, 60290
This job was posted by https://illinoisjoblink.illinois.gov : For more information, please see: https://illinoisjoblink.illinois.gov/jobs/12331192 Department
Provost Research Computing Center
About the Department
The University of Chicago Research Computing Center (RCC), a unit in the Office of Research, provides high-end research computing resources to researchers at the University of Chicago. It is dedicated to enabling research by providing access to centrally managed High-Performance Computing (HPC), storage, and visualization resources. These resources include hardware, software, high-level scientific and technical user support, and the education and training required to help researchers make full use of modern HPC technology and local and national supercomputing resources. The Office of Research oversees the conduct of sponsored research, research program development, and contract management functions.
Job Summary
The job manages a team of professional staff responsible for designing automated, scalable, and rapidly deployable solutions to infrastructure development and server configuration. Manages the provision of hands-on maintenance for production servers as well as Windows and Linux servers.
The University of Chicago is seeking a highly qualified HPC Systems & Operations Manager to oversee the systems and operations team responsible for designing, configuring, deploying, and maintaining the Research Computing Center (RCC) High Performance Computing (HPC) systems, as well as managing facility operations. This is hands-on role will involve active participation in day-to-day systems operations. The individual in this position will also be involved in the procurement and management of HPC hardware and software.
This is a hybrid position requiring 3 days onsite.
Responsibilities
Lead the design, configuration, deployment, and management of RCC HPC systems.Ensure the stability, integrity, and efficient operation of RCC HPC systems that support core organizational functions.Monitor, maintain, and optimize HPC systems and software to improve performance and resource utilization.Manage a growing team of HPC system administrators and systems programmers to ensure reliable service delivery.Oversee the project management of the team\'s initiatives, ensuring that all projects receive the necessary management oversight and resources for successful completion.Serve as the primary point of contact for other university units regarding systems and operations-related matters.Diagnose and resolve system operational problems promptly and effectively. Coordinating with vendors to address hardware and software issues.Foster automation within HPC systems.Troubleshoot and identify failed hardware, implement parts replacement and resolve system failures.Develop and implement strategies for HPC data management, backup, disaster recovery, and security.Create standard operating procedures for routine and complex system tasks.Maintain and monitor the security of HPC systems and servers, implementing robust security measures, as necessary.Provide technical leadership, guidance, and support to the HPC systems and operations team.Manages a single team\'s progress by maintaining accurate and up-to-date logs, ensures that all projects have the necessary management oversight and approvals for successful completion.Ensures the implementation of approved best practices and information technology policies that result in the highest quality systems administration.Performs other related work as needed.
Minimum Qualifications
Education:
Minimum requirements include a college or university degree in related field.
---
Work Experience:
Minimum requirements include knowledge and ski ls developed through 7+ years of work experience in a related job discipline.
---
Certifications:
---
Preferred Qualifications
Education:
Advanced degree strongly preferred.
Experience:
A minimum of seven years of Linux system administration experience in a large, distributed computing environment.At least three year\'s experience in providing support for Linux HPC cluster used for scientific research strongly preferred.
Technical Skills or Knowledge:
Experience with Linux system administration (e.g., RHEL, Rocky, CentOS).
Proficiency in the installation, maintenance, operation, tuning and troubleshooting of Linux and related systems and software.
Experience in installing, configuring, and maintaining a job scheduler/workload manager (such as SLURM, TORQUE, or PBS).
Experience in configuring, installing and troubleshooting MPI and OpenMP.
Experience with at least one HPC cluster management tool (e.g., XCAT, Confluent, Warewulf, or Bright).
Provost Research Computing Center
About the Department
The University of Chicago Research Computing Center (RCC), a unit in the Office of Research, provides high-end research computing resources to researchers at the University of Chicago. It is dedicated to enabling research by providing access to centrally managed High-Performance Computing (HPC), storage, and visualization resources. These resources include hardware, software, high-level scientific and technical user support, and the education and training required to help researchers make full use of modern HPC technology and local and national supercomputing resources. The Office of Research oversees the conduct of sponsored research, research program development, and contract management functions.
Job Summary
The job manages a team of professional staff responsible for designing automated, scalable, and rapidly deployable solutions to infrastructure development and server configuration. Manages the provision of hands-on maintenance for production servers as well as Windows and Linux servers.
The University of Chicago is seeking a highly qualified HPC Systems & Operations Manager to oversee the systems and operations team responsible for designing, configuring, deploying, and maintaining the Research Computing Center (RCC) High Performance Computing (HPC) systems, as well as managing facility operations. This is hands-on role will involve active participation in day-to-day systems operations. The individual in this position will also be involved in the procurement and management of HPC hardware and software.
This is a hybrid position requiring 3 days onsite.
Responsibilities
Lead the design, configuration, deployment, and management of RCC HPC systems.Ensure the stability, integrity, and efficient operation of RCC HPC systems that support core organizational functions.Monitor, maintain, and optimize HPC systems and software to improve performance and resource utilization.Manage a growing team of HPC system administrators and systems programmers to ensure reliable service delivery.Oversee the project management of the team\'s initiatives, ensuring that all projects receive the necessary management oversight and resources for successful completion.Serve as the primary point of contact for other university units regarding systems and operations-related matters.Diagnose and resolve system operational problems promptly and effectively. Coordinating with vendors to address hardware and software issues.Foster automation within HPC systems.Troubleshoot and identify failed hardware, implement parts replacement and resolve system failures.Develop and implement strategies for HPC data management, backup, disaster recovery, and security.Create standard operating procedures for routine and complex system tasks.Maintain and monitor the security of HPC systems and servers, implementing robust security measures, as necessary.Provide technical leadership, guidance, and support to the HPC systems and operations team.Manages a single team\'s progress by maintaining accurate and up-to-date logs, ensures that all projects have the necessary management oversight and approvals for successful completion.Ensures the implementation of approved best practices and information technology policies that result in the highest quality systems administration.Performs other related work as needed.
Minimum Qualifications
Education:
Minimum requirements include a college or university degree in related field.
---
Work Experience:
Minimum requirements include knowledge and ski ls developed through 7+ years of work experience in a related job discipline.
---
Certifications:
---
Preferred Qualifications
Education:
Advanced degree strongly preferred.
Experience:
A minimum of seven years of Linux system administration experience in a large, distributed computing environment.At least three year\'s experience in providing support for Linux HPC cluster used for scientific research strongly preferred.
Technical Skills or Knowledge:
Experience with Linux system administration (e.g., RHEL, Rocky, CentOS).
Proficiency in the installation, maintenance, operation, tuning and troubleshooting of Linux and related systems and software.
Experience in installing, configuring, and maintaining a job scheduler/workload manager (such as SLURM, TORQUE, or PBS).
Experience in configuring, installing and troubleshooting MPI and OpenMP.
Experience with at least one HPC cluster management tool (e.g., XCAT, Confluent, Warewulf, or Bright).