Logo
Santa Clara University

High Performance Computing (HPC) Linux Systems Administrator

Santa Clara University, Santa Clara, California, us, 95053


Position Title:High Performance Computing (HPC) Linux Systems Administrator

Position Type:Regular

Hiring Range:

$112,100 - 131,800 / annual; Compensation will be based on education, experience, skills relevant to the role and internal equity.

Pay Frequency:Annual

Company Overview:

Santa Clara University is a prestigious academic institution dedicated to advancing research, innovation, and education. We are seeking a skilled and experienced High Performance Computing (HPC) Systems Administrator to join our dynamic team and contribute to the optimization and management of our HPC infrastructure, supporting groundbreaking research across various disciplines.

Job Description:

As a High Performance Computing (HPC) Systems Administrator, as part of the team supporting the SCU High Performance Computing environment, you will play a pivotal role in the administration, maintenance, and optimization of our HPC systems within the academic environment. Working closely with researchers, faculty, and IT professionals, you will ensure the smooth operation of our computational infrastructure, enabling cutting-edge research and academic excellence. You will work closely with the SCU architectural leaders and systems administrators to provide systems automation, DevOps, and user support for research computing. You will collaborate with SCU researchers and scientists to advance cutting-edge research projects by enabling and optimizing their application pipelines for AI, Data Science. and graphics GPU processing. You will maintain and expand the existing HPC cluster and parallel storage systems as necessary.

The ideal candidate for this position is curious, creative, tenacious, and self-directed, and demonstrates a strong work ethic; is productive working independently as well as collaboratively; is analytical and can identify, define, interpret, and resolve both technical and human issues.

This position requires on-site support on a regular basis. On-campus vs. remote schedule will be hybrid on an as-needed basis depending on current tasks.

Key Responsibilities:Install, configure, and maintain HPC hardware and software components, networking architectures, including InfiniBand fabric, and parallel file systems.Oversee and monitor system performance, troubleshoot issues, and optimize system configurations to ensure maximum efficiency and reliability.Responsible for HPC facility, and incorporating industry best practice into facilityProvide advocacy and outreach across the university; train and teach researchers and teams as needed.Develop and implement security measures to protect HPC systems and data from unauthorized access and cyber threats.Contributes to the development of the HPC center's strategic vision, and uses this vision to create a common focus.Manage user accounts, access permissions, and job scheduling in accordance with university policies and best practices.Plan and execute system upgrades, patches, and maintenance activities to minimize downtime and ensure system stability.Manage and document system configurations, procedures, and troubleshooting guidelines to facilitate knowledge sharing and maintain system integrity.Stay current with emerging technologies and industry trends in HPC to recommend and implement innovative solutions that enhance system performance and capabilities.Regularly consulted by faculty and staff on their complex computational requirements.Monitors feedback from researchers to identify and address gaps in services to constantly strive for quality and excellence.Collaborate with vendors and external partners to evaluate and procure HPC hardware and software components as needed.Analyze user workflows to identify opportunities for parallelism or efficiency improvements.Interact and collaborate with researchers and faculty to understand their computational requirements and provide support and guidance on utilizing HPC resources effectively.Responsible for the design and execution of innovative and high-quality programs and services that meet the current and future needs of SCU researchersResponsible for HPC facility, and incorporating industry best practice into facilityProvide advocacy and outreach across the university; train and teach researchers and teams as needed.Provide expert computational and data analytic technical assistance, including complex problem solving and programming support across different departmentsProvide training and support to users on HPC system usage, optimization techniques, data organization, storage, and sharing best practices.Provide training in code-management best practices (such as using Git, Github).Works with senior leadership to develop strategies and implement tactics that will successfully ensure the fulfillment of SCU's research-computing goals, and to enable and amplify the work of SCU researchers across campus."Develops long-term, strategic relationships and partnerships with providers of national resources (such as Access, Globus, NRP, OSG), to assist researchers in finding, getting access to, and optimizing their use; define and maintain gateways to those resources.Facilitates the growth of corporate and foundation givingSupport the development, execution and reporting on externally supported research.Required Qualifications:

Bachelor's degree in computer science, engineering, or a related field; advanced degree required.8-10 years of experience required including management experienceExperience providing direct user support and customer service with demonstrated success.Experience Installing, monitoring and optimizing the performance of scientific applications in an HPC cluster.Five years of experience with systems automation scripting in at least one of the following: bash, perl, python, puppet, ansible.Demonstrated experience writing and editing complex scripts used to perform system maintenance and administration.Linux systems administration experience including: automated OS provisioning, software updates and package management, user accounts management, filesystems and access management, compiling software and kernel modules, versioning, environment modules.Hands-on experience with networking architectures, including InfiniBand fabric.Experience with containers (e.g., Docker, Singularity).Ability to elicit and communicate technical and non-technical information in a clear and concise manner.Self-motivated and works independently and as part of a team. Demonstrates problem-solving skills. Able to learn effectively and meet deadlines.Ability to write technical documentation in a clear and concise manner.Understanding of system performance monitoring and actions that can be taken to improve or correct performance.General knowledge of other areas of IT. Thorough understanding of and experience with systems-related issues and actions that can be taken to improve or correct performance.Strong analytical and problem-solving skills with a proactive approach to identifying and resolving technical issues.Excellent communication and interpersonal skills with the ability to collaborate effectively with researchers, faculty, and IT professionals.Knowledge of cybersecurity principles and best practices for securing HPC environments.Preferred Qualifications:

Five years of experience in administering and supporting HPC systems in an academic or research environment.Familiarity with open-source HPC technologies such as OpenHPC.Experience with configuring, deploying and managing batch queueing systems for HPC clusters such as SGE, LSF, or Slurm.Experience with distributed file systems for HPC clusters (such as BeeGFS, Lustre).Experience with installation and integration of tools and software (such as compilers, scientific applications) in a shared cluster environment (e.g., modules).Proficiency with using source code version control systems for continuous integration and testing methods (e.g., git, svn).Experience with MySQL/MariaDB: installation, data extracts and loads.Experience with developing systems monitoring dashboards (e.g. Grafana, Prometheus, Tableau) and using monitoring tools (e.g. Nagios, Ganglia).MS or PhD with adequate understanding of the challenges associated with the data analytics needed to answer scientific questions and also the capabilities and limitations of an HPC cluster and distributed file systems.

EEO Statement

Equal Opportunity/Notice of Nondiscrimination

Santa Clara University is an equal opportunity/equal access/affirmative action employer fully committed to achieving a diverse workforce and complies with all Federal and California State laws, regulations, and executive orders regarding non-discrimination and affirmative action. Applications from members of historically underrepresented groups are especially encouraged. For a complete copy of Santa Clara University's equal opportunity and nondiscrimination policies, see https://www.scu.edu/title-ix/policies-reports/

COVID-19 Statement

The health and safety of the University community is a top priority. The University strongly recommends that all employees are fully vaccinated for COVID-19 as the vaccination and boosters are safe, effective tools that significantly minimize the chances of serious illness and hospitalization. Please contact Human Resources if you have any questions.

Telecommute

Santa Clara University is registered to do business in the following states: California, Nevada, Oregon, Washington, Arizona, and Illinois. Employees approved to telecommute are required to perform their work within one of these states.

Title IX of the Education Amendments of 1972

Santa Clara University does not discriminate in its employment practices or in its educational programs or activities on the basis of sex/gender, and prohibits retaliation against any person opposing discrimination or participating in any discrimination investigation or complaint process internally or externally. Information about Title IX can be found at www.scu.edu/title-ix. Information about Section 504 and the ADA Coordinator can be found at https://www.scu.edu/oae/, (408) 554-4109, oae@scu.edu. Inquiries can also be made to the Assistant Secretary of Education within the Office for Civil Rights (OCR).

Clery Notice of Availability

Santa Clara University annually collects information about campus crimes and other reportable incidents in accordance with the federal Jeanne Clery Disclosure of Campus Security Policy and Campus Crime Statistics Act. To view the Santa Clara University report, please go to the Campus Safety Services website . To request a paper copy please call Campus Safety at (408) 554-4441. The report includes the type of crime, venue, and number of occurrences.

Americans with Disabilities Act

Santa Clara University affirms its commitment to employ qualified individuals with disabilities within the workplace and to comply with the Americans with Disability Act. All applicants desiring an accommodation should contact the Department of Human Resources, and 408-554-5750 and request to speak to Indu Ahluwalia by phone at 408-554-5750 or by email at iahluwalia@scu.edu.