Logo
Southern Methodist University

Senior HPC Systems Administrator (HR Title: Systems Administrator III)

Southern Methodist University, Dallas, Texas, United States, 75215


Salary Range:

Salary commensurate with experience and qualifications

About SMU

SMU’s more than 12,000 diverse, high-achieving students come from all 50 states and over 80 countries to take advantage of the University’s small classes, meaningful research opportunities, leadership development, community service, international study and innovative programs.

SMU serves approximately 7,000 undergraduates and 5,000 graduate students through eight degree-granting schools: Dedman College of Humanities and Sciences, Cox School of Business, Lyle School of Engineering, Meadows School of the Arts, Simmons School of Education and Human Development, Dedman School of Law, Perkins School of Theology and Moody School of Graduate and Advanced Studies.

SMU is data driven, and its powerful supercomputing ecosystem – paired with entrepreneurial drive – creates an unrivaled environment for the University to deliver research excellence.

Now in its second century of achievement, SMU is recognized for the ways it supports students, faculty and alumni as they become ethical, enterprising leaders in their professions and communities. SMU’s relationship with Dallas – the dynamic center of one of the nation’s fastest-growing regions – offers unique learning, research, social and career opportunities that provide a launch pad for global impact.

SMU is nonsectarian in its teaching and committed to academic freedom and open inquiry.

About the Department:

SMU supports some of the state’s leading high-performance computing (HPC) clusters. The M3 cluster boasts 1,077 TFLOPS, 181 nodes, 22,892 CPU cores, 122,880 accelerator cores, and 200Gb/s bandwidth. Meanwhile, the NVIDIA DGX SuperPOD offers 1,644 TFLOPS, 20 nodes, 2,560 CPU cores, 1,392,640 accelerator cores, and 200Gb/s bandwidth. Both clusters feature cutting-edge CPUs, accelerators, and networking technologies, high memory capacity per node, and provide advanced interactive experiences through the Open OnDemand Portal.

About the Position:

This role is an on-campus, in-person position.

Dedicated to supporting SMU's research community, the Senior System Administrator for High Performance Computing (HPC) works exclusively to design, build, maintain, operate and manage HPC systems at SMU.

This position shares responsibility for university HPC technical support as member of a two-person HPC systems infrastructure team. This position also assists with Enterprise Linux support.

This position provides hardware, software and end-user support for SMU's growing number of research faculty and centercompute resources dedicated to advancement of SMU research activities.

Demonstrates advanced knowledge with all the technical tools required to perform the job.

Subject matter expert in primary areas of support.

Able to solve complex problems crossing multiple research disciplines with little or no escalation support.

Effective technical resource to others to resolve problems and implement projects.

Essential Functions:

Design, plan, deploy, administer services & troubleshoot issues related to HPC services for research at SMU.

Install and maintain cluster environments and provision systems using automated installation methods. Manage/maintain Lustre parallel file system and NFS storage. Manage/maintain InfiniBand high performance interconnect fabric. Configure, manage, monitor SLURM scheduling & queuing system.

Develop/maintain programs/scripts that aid in operation and automation of administrative tasks using various shell and scripting languages (bash, Perl, Python) required by systems dedicated to research. Compile, install, and port software in support needed by SMU researchers. Build and deploy open source and vendor/commercial software required by researchers.

Plans projects, communicates with end users, and management, provides updates and expectations management.

Document all configurations, procedures, and changes. Document system administration procedures for routine and complex tasks.

Diagnose and resolve system and operational problems with research systems. Work with researchers and constituents to diagnose and optimize workloads. Participate in on call support of research infrastructure.

Coordinate with vendors to resolve hardware and software problems.

Ensure hardware firmware and software revision levels are maintained at the appropriate level on HPC research systems.

Keep current with research computing, HPC technology trends and best practices.