Logo
St. Jude Children's Research Hospital

Senior HPC Infrastructure Engineer

St. Jude Children's Research Hospital, Millington, Tennessee, us, 38083


Join a cutting-edge team dedicated to pushing the boundaries of high-performance computing (HPC) and artificial intelligence (AI) infrastructure As a Senior HPC Infrastructure Engineer, you'll play a pivotal role in designing, implementing, and optimizing our state-of-the-art HPC clusters and servers. Your expertise will ensure that our research computing environment excels in scalability, redundancy, and performance. Key Responsibilities: Lead the architecture, design, and implementation of advanced HPC/AI systems to support groundbreaking research. Oversee the ongoing monitoring, support, and maintenance of our HPC/AI clusters, ensuring peak performance and reliability. Drive system upgrades, customization, and seamless integration with database administrators, software developers, network operations, and data center teams. Manage and maintain a diverse range of computer systems and application software, ensuring they meet the highest standards of functionality and efficiency. Ensure continuous support and monitoring of our research computing infrastructure, delivering exceptional service 24/7. What We Offer: An opportunity to work with cutting-edge technology in a dynamic, collaborative environment. A role that directly impacts the success of groundbreaking research projects. A chance to collaborate with top-tier professionals across various disciplines. If you're passionate about HPC technology and thrive in a fast-paced, innovative setting, we want to hear from you This position may be eligible for the possibility of remote work. Job Responsibilities: Oversee configuration and management of the IT infrastructure to support requirements (e.g. data retention, security, business continuity, disaster recovery, information risk management). Monitor and evaluate the efficiency and effectiveness of infrastructure service delivery methods and procedures. Lead and manage internal infrastructure through established regulations & standards. Implement and monitor incident/problem & disaster recovery for infrastructure support. Manage and provide current systems usage statistics, provide future projected growth estimates based on customer's demand. Partner with internal teams to develop prioritization, metrics, and processes around capacity planning and infrastructure availability. Periodically present capacity planning and performance reports to senior leaders during presentations and meetings. Benchmark, analyze, and make recommendations for improvement of IT infrastructure. Perform other duties as assigned to meet the goals and objectives of the department and institution. Maintains regular and predictable attendance. Minimum Education and/or Training: Bachelor's degree in Computer Science, Engineering, Business or related field of study required. Master's degree preferred. Minimum Experience: Minimum experience: Four (4) years of IT experience with experience in infrastructure operations and engineering environments. Experience with Red Hat Enterprise Linux (RHEL) is highly preferred. Experience with using and supporting Linux in a high-performance computing (HPC) cluster and research computing environment is highly preferred. Must have experience managing an HPC cluster. Experience with Slurm and/or LSF is highly preferred. Experience with Kubernetes (e.g., Rancher, OpenShift, etc.) is a plus. Experience with Base Command Manager, Bright Cluster Manager, or another HPC cluster manager (e.g., HPCM, xCAT, Warewulf, Scyld) is highly preferred. Experience with IBM Spectrum Scale (GPFS) is required; experience with Lustre is a plus. Experience with Message Passing Interface (MPI) is highly preferred. Experience with InfiniBand, Ethernet, and TCP/IP networking and topology is highly preferred. Experience with HPE Aruba Ethernet switches is preferred. Experience with NVIDIA GPUs is required; experience with AMD GPUs is a plus. Experience with NVIDIA GPUDirect Storage is a plus. Advanced knowledge and strong understanding of in-depth HPC technologies and principals. Must have strong knowledge of Linux security and Linux shell scripting. Proven performance in earlier role/comparable role. Compensation In recognition of certain U.S. state and municipal pay transparency laws, St. Jude is including a reasonable estimate of the compensation range for this role. This is an estimate offered in good faith and a specific salary offer takes into account factors that are considered in making compensation decisions including but not limited to skill sets, experience and training, licensure and certifications, and other business and organizational needs. It is not typical for an individual to be hired at or near the top of the salary range and compensation decisions are dependent on the facts and circumstances of each case. A reasonable estimate of the current salary range is $94,640 - $169,520 per year for the role of Senior HPC Infrastructure Engineer. Explore our exceptional benefits Diversity, Equity and Inclusion St. Jude Children's Research Hospital has a diverse, global patient population and workforce, built on the principles of diversity, equity and inclusion. Our founder Danny Thomas envisioned a hospital that would treat children of the world-regardless of race, religion or a family's ability to pay. Learn more about our history and commitment. Today, we continue the mission to advance cures and means of prevention for pediatric catastrophic diseases through research and treatment. As we accelerate this progress globally, we believe our legacy of diversity, equity and inclusion is foundational to success. With the commitment of leaders at all levels of the organization, we strive to ensure the St. Jude culture, leadership approaches and talent processes are equitable and culturally responsive. View our Diversity, Equity and Inclusion Report to learn about the hospital's roots in diversity, equity and inclusion, where we are today and our aspirations for an even better future. St. Jude is an Equal Opportunity Employer No Search Firms St. Jude Children's Research Hospital does not accept unsolicited assistance from search firms for employment opportunities. Please do not call or email. All resumes submitted by search firms to any employee or other representative at St. Jude via email, the internet or in any form and/or method without a valid written search agreement in place and approved by HR will result in no fee being paid in the event the candidate is hired by St. Jude.