The University of Chicago
Principal HPC System Administrator
The University of Chicago, Chicago, Illinois, United States, 60290
DepartmentProvost Research Computing Center
About the DepartmentThe University of Chicago Research Computing Center (RCC), a unit in the Office of Research, provides high-end research computing resources to researchers at the University of Chicago. It is dedicated to enabling research by providing access to centrally managed High-Performance Computing (HPC), storage, and visualization resources. These resources include hardware, software, high-level scientific and technical user support, and the education and training required to help researchers make full use of modern HPC technology and local and national supercomputing resources. The Office of Research oversees the conduct of sponsored research, research program development, and contract management functions.
Job SummaryThe University of Chicago is seeking a highly qualified Principal HPC System Administrator to join the systems and operations team responsible for designing, configuring, deploying, and maintaining the Research Computing Center (RCC) High Performance Computing (HPC) systems, as well as managing facility operations. This is a hands-on position and will apply years of expertise and technical skills to complex assignments. The individual in this position will also be involved in the procurement and management of HPC hardware and software. This is a hybrid position requiring 3 days onsite.
Responsibilities
Design, configure, deploy, and maintain large computer clusters, servers and software.
Perform day-to-day operations leadership, including systems administration, monitoring and storage performance up to and including network components. Management of the system’s network switch, parallel file system and HPC software stack and tools.
Monitor, maintain, and optimize HPC systems and software to improve performance and resource utilization.
Serve as the technical lead on complex projects and system related tasks, as needed.
Configure, install, and maintain the job scheduler/workload manager.
Diagnose and resolve system operational problems promptly and effectively. Coordinating with vendors to address hardware and software issues.
Use scripting/programming to enable system-level automation, monitoring, and problem detection.
Build and deploy open-source software as well as software from vendors/partners.
Develop and implement strategies for HPC data management, backup, disaster recovery, and security, ensuring reliable and efficient backup and restores for all managed systems.
Create standard operating procedures for routine and complex system tasks.
Maintain and monitor the security of HPC systems and servers, implementing robust security measures, as applicable.
Troubleshoot and identify failed hardware, implement parts replacement, and resolve system failures.
Stay updated with the latest developments in HPC technologies and apply this knowledge to improve RCC systems.
Solves complex problems to configure, install, upgrade and maintain server applications and hardware. Works to safeguard the integrity of computer software. Implements operating system enhancements to improve the reliability and performance of the system.
Provides expertise in planning and installing necessary patches and upgrades for servers and their associated storage, network, communications, and peripheral sub-systems. Installs and maintains an appropriate level of intrusion detection, monitoring, and auditing software as required.
Perform other related work as needed.
Minimum QualificationsEducation:Minimum requirements include a college or university degree in related field.
Work Experience:Minimum requirements include knowledge and skills developed through 7+ years of work experience in a related job discipline.
Preferred QualificationsEducation:
Bachelor’s degree in Computer Science or closely related field.
Experience:
A minimum of seven years of full time Linux system administration experience in a large distributed computing environment.
Technical Skills or Knowledge:
Experience with Linux system administration (e.g., RHEL, Rocky, CentOS).
Proficiency in the installation, maintenance, operation, tuning and troubleshooting of Linux and related systems and software.
Experience in installing, configuring, and maintaining a job scheduler/workload manager (such as SLURM, TORQUE, or PBS).
Experience configuring, installing and troubleshooting MPI and OpenMP.
Experience with at least one HPC cluster management tool (e.g. XCAT, Confluent, Warewulf, or Bright).
Experience in configuring, administering, and supporting network storage subsystems.
Hands-on experience with at least one parallel file system (e.g., Spectrum Scale-GPFS, Lustre, BeeGFS, or Ceph).
Direct experience working with Infiniband, including a working knowledge of Infiniband concepts, OFED layers, subnet managers, as well as Gigabit Ethernet.
Experience with networking and security.
Experience with systems automation tools such as Ansible or Puppet.
Experience with versioning tools such as Git or Subversion.
Experience configuring, installing, maintaining and using monitoring and optimization tools.
Strong knowledge of scripting languages such as Python or bash.
Preferred Competencies:
Ability to work well with faculty and researchers.
Ability to identify and gain expertise in appropriate new technologies and/or software tools.
Ability to function as part of an interactive team while demonstrating self-initiative to achieve project's goals and Research Computing Center's mission.
Strong analytical skills and problem-solving ability.
Application Documents
Cover letter (preferred)
Resume (required)
When applying, the document(s) MUST be uploaded via the My Experience page, in the section titled Application Documents of the application.
Job FamilyInformation Technology
Role ImpactIndividual Contributor
FLSA StatusExempt
Pay FrequencyMonthly
Scheduled Weekly Hours37.5
Benefits EligibleYes
Drug Test RequiredNo
Health Screen RequiredNo
Motor Vehicle Record Inquiry RequiredNo
Posting StatementThe University of Chicago is an Affirmative Action/Equal Opportunity/Disabled/Veterans and does not discriminate on the basis of race, color, religion, sex, sexual orientation, gender, gender identity, national or ethnic origin, age, status as an individual with a disability, military or veteran status, genetic information, or other protected classes under the law. For additional information please see the University's Notice of Nondiscrimination.
Staff Job seekers in need of a reasonable accommodation to complete the application process should call 773-702-5800 or submit a request via Applicant Inquiry Form.
We seek a diverse pool of applicants who wish to join an academic community that places the highest value on rigorous inquiry and encourages a diversity of perspectives, experiences, groups of individuals, and ideas to inform and stimulate intellectual challenge, engagement, and exchange.
All offers of employment are contingent upon a background check that includes a review of conviction history. A conviction does not automatically preclude University employment. Rather, the University considers conviction information on a case-by-case basis and assesses the nature of the offense, the circumstances surrounding it, the proximity in time of the conviction, and its relevance to the position.
The University of Chicago's Annual Security & Fire Safety Report (Report) provides information about University offices and programs that provide safety support, crime and fire statistics, emergency response and communications plans, and other policies and information. The Report can be accessed online at:
http://securityreport.uchicago.edu . Paper copies of the Report are available, upon request, from the University of Chicago Police Department, 850 E. 61st Street, Chicago, IL 60637.
#J-18808-Ljbffr
About the DepartmentThe University of Chicago Research Computing Center (RCC), a unit in the Office of Research, provides high-end research computing resources to researchers at the University of Chicago. It is dedicated to enabling research by providing access to centrally managed High-Performance Computing (HPC), storage, and visualization resources. These resources include hardware, software, high-level scientific and technical user support, and the education and training required to help researchers make full use of modern HPC technology and local and national supercomputing resources. The Office of Research oversees the conduct of sponsored research, research program development, and contract management functions.
Job SummaryThe University of Chicago is seeking a highly qualified Principal HPC System Administrator to join the systems and operations team responsible for designing, configuring, deploying, and maintaining the Research Computing Center (RCC) High Performance Computing (HPC) systems, as well as managing facility operations. This is a hands-on position and will apply years of expertise and technical skills to complex assignments. The individual in this position will also be involved in the procurement and management of HPC hardware and software. This is a hybrid position requiring 3 days onsite.
Responsibilities
Design, configure, deploy, and maintain large computer clusters, servers and software.
Perform day-to-day operations leadership, including systems administration, monitoring and storage performance up to and including network components. Management of the system’s network switch, parallel file system and HPC software stack and tools.
Monitor, maintain, and optimize HPC systems and software to improve performance and resource utilization.
Serve as the technical lead on complex projects and system related tasks, as needed.
Configure, install, and maintain the job scheduler/workload manager.
Diagnose and resolve system operational problems promptly and effectively. Coordinating with vendors to address hardware and software issues.
Use scripting/programming to enable system-level automation, monitoring, and problem detection.
Build and deploy open-source software as well as software from vendors/partners.
Develop and implement strategies for HPC data management, backup, disaster recovery, and security, ensuring reliable and efficient backup and restores for all managed systems.
Create standard operating procedures for routine and complex system tasks.
Maintain and monitor the security of HPC systems and servers, implementing robust security measures, as applicable.
Troubleshoot and identify failed hardware, implement parts replacement, and resolve system failures.
Stay updated with the latest developments in HPC technologies and apply this knowledge to improve RCC systems.
Solves complex problems to configure, install, upgrade and maintain server applications and hardware. Works to safeguard the integrity of computer software. Implements operating system enhancements to improve the reliability and performance of the system.
Provides expertise in planning and installing necessary patches and upgrades for servers and their associated storage, network, communications, and peripheral sub-systems. Installs and maintains an appropriate level of intrusion detection, monitoring, and auditing software as required.
Perform other related work as needed.
Minimum QualificationsEducation:Minimum requirements include a college or university degree in related field.
Work Experience:Minimum requirements include knowledge and skills developed through 7+ years of work experience in a related job discipline.
Preferred QualificationsEducation:
Bachelor’s degree in Computer Science or closely related field.
Experience:
A minimum of seven years of full time Linux system administration experience in a large distributed computing environment.
Technical Skills or Knowledge:
Experience with Linux system administration (e.g., RHEL, Rocky, CentOS).
Proficiency in the installation, maintenance, operation, tuning and troubleshooting of Linux and related systems and software.
Experience in installing, configuring, and maintaining a job scheduler/workload manager (such as SLURM, TORQUE, or PBS).
Experience configuring, installing and troubleshooting MPI and OpenMP.
Experience with at least one HPC cluster management tool (e.g. XCAT, Confluent, Warewulf, or Bright).
Experience in configuring, administering, and supporting network storage subsystems.
Hands-on experience with at least one parallel file system (e.g., Spectrum Scale-GPFS, Lustre, BeeGFS, or Ceph).
Direct experience working with Infiniband, including a working knowledge of Infiniband concepts, OFED layers, subnet managers, as well as Gigabit Ethernet.
Experience with networking and security.
Experience with systems automation tools such as Ansible or Puppet.
Experience with versioning tools such as Git or Subversion.
Experience configuring, installing, maintaining and using monitoring and optimization tools.
Strong knowledge of scripting languages such as Python or bash.
Preferred Competencies:
Ability to work well with faculty and researchers.
Ability to identify and gain expertise in appropriate new technologies and/or software tools.
Ability to function as part of an interactive team while demonstrating self-initiative to achieve project's goals and Research Computing Center's mission.
Strong analytical skills and problem-solving ability.
Application Documents
Cover letter (preferred)
Resume (required)
When applying, the document(s) MUST be uploaded via the My Experience page, in the section titled Application Documents of the application.
Job FamilyInformation Technology
Role ImpactIndividual Contributor
FLSA StatusExempt
Pay FrequencyMonthly
Scheduled Weekly Hours37.5
Benefits EligibleYes
Drug Test RequiredNo
Health Screen RequiredNo
Motor Vehicle Record Inquiry RequiredNo
Posting StatementThe University of Chicago is an Affirmative Action/Equal Opportunity/Disabled/Veterans and does not discriminate on the basis of race, color, religion, sex, sexual orientation, gender, gender identity, national or ethnic origin, age, status as an individual with a disability, military or veteran status, genetic information, or other protected classes under the law. For additional information please see the University's Notice of Nondiscrimination.
Staff Job seekers in need of a reasonable accommodation to complete the application process should call 773-702-5800 or submit a request via Applicant Inquiry Form.
We seek a diverse pool of applicants who wish to join an academic community that places the highest value on rigorous inquiry and encourages a diversity of perspectives, experiences, groups of individuals, and ideas to inform and stimulate intellectual challenge, engagement, and exchange.
All offers of employment are contingent upon a background check that includes a review of conviction history. A conviction does not automatically preclude University employment. Rather, the University considers conviction information on a case-by-case basis and assesses the nature of the offense, the circumstances surrounding it, the proximity in time of the conviction, and its relevance to the position.
The University of Chicago's Annual Security & Fire Safety Report (Report) provides information about University offices and programs that provide safety support, crime and fire statistics, emergency response and communications plans, and other policies and information. The Report can be accessed online at:
http://securityreport.uchicago.edu . Paper copies of the Report are available, upon request, from the University of Chicago Police Department, 850 E. 61st Street, Chicago, IL 60637.
#J-18808-Ljbffr