XCEL Engineering Inc
Linux HPC Systems Engineer (Hybrid)
XCEL Engineering Inc, Oak Ridge, Tennessee, United States, 37830
COMPANY OVERVIEW
XCEL Engineering, Inc. is an award-winning small business that provides trusted information technology, engineering, consulting and project management solutions and services to federal agencies and organizations. Originally founded in 1971 by professional engineers at the University of Tennessee, XCEL was acquired in 2003 by U.S. Army and Navy veterans and in 2023 became a MartinFed company.
XCEL Engineering is a part of IT Lab Partners (ITLP) which was created to support a leading research facility in the East Tennessee region in recruiting the best and the brightest technical talent. Consider joining our impressive team today!
JOB OVERVIEW
Xcel Engineering is seeking qualified applicants for a Linux HPC Systems Engineer position in the Research and Development Systems Engineering group. The R&D Systems Engineering group exists to facilitate lab goals through systems engineering, integration, and support for a top research lab. We run clusters, servers, workstations, and other services where science happens at the lab. The Linux Systems Engineer is responsible for facilitating R&D projects. This position will work with a team to provide Linux Systems deployment, automation, monitoring, and management for researchers. Our group optimizes our workflows and monitoring solutions to take advantage of our 24/7 operations staff, which significantly reduces the need for off-hours support. We also offer a flexible work schedule and utilize Email, Jira, Confluence, Teams, Slack, and other collaboration solutions to stay in contact.
ESSENTIAL FUNCTIONS
Our primary goal is to partner with research organizations to enable research excellence and delivery. We work with other clustered computing and HPC groups to help research programs identify the best solutions for their needs. When we build our customer's environments, our team collaborates to design, implement, and maintain the systems from inception to retirement.Advocate and promote HPC and clustered computing services to researchers who process large data sets and/or develop code as a part of their project.Ensure the availability, performance, scalability, and security of production systems.Leverage automation and monitoring solutions that minimize our day-to-day maintenance and scout opportunities to optimize system management practices or system performance.Collaborate with technical POCs for the programs that we support to install and help tune the performance of various scientific toolsets.The Emerging Technologies & Computing Group optimizes workflows and monitoring solutions to take advantage of our 24/7 operations staff, which significantly reduces the need for off-hours support. We use Email, Jira, Confluence, Teams, and other collaboration solutions to stay in contact.
Deliver the mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service. Promote diversity, equity, inclusion, and accessibility by fostering a respectful workplace - in how we treat one another, work together, and measure success.
BASIC QUALIFICATIONS
United States Citizen with the ability to obtain and maintain a DOE Security Clearance.A BS in computer science, computer engineering, information technology, science, engineering, business or a related field of study and two (2) to four (4) years of aligned experience is required for consideration.1+ year managing UNIX/Linux Systems.1+ year utilizing configuration management and automation tools such as Git, Jenkins, Ansible, or Puppet.Some proficiency in at least one scripting language such as Bash, Python, or equivalent.Experience performing troubleshooting and system administration with Linux Servers.Experience supporting large data systems.A desire to push the envelope and identify new technologies and opportunities and be able to communicate the potential benefits of those choices to others within the team and our research partners.A collaborative and energetic mindset to thrive on the opportunity to build trust and credibility, and ultimately become a trusted advisor to our research teams.DESIRED QUALIFICATIONS
Understanding of multiple operating systems and cluster technologies.Experience with Centos/RHEL, Ubuntu, VMware.Understanding of HPC platforms to support users with SLURM job submissions and troubleshooting.Experience building and running containerized applications in an HPC environment.Experience with multiple deployment mechanisms like Diskless, Warewulf, and traditional deployment (Cobbler, PXEboot, and/or Bright).Experience managing systems utilizing GPU (NVIDIA and AMD) clusters for AI/ML and/or image processing.Knowledge of networking fundamentals including TCP/IP, traffic analysis, common protocols, and network diagnostics.Experience with Infiniband networks and diagnostics.Extensive experience with High Performance Parallel File Systems (Lustre, WEKA, GPFS, etc).Experience with performance and diagnostic tools for benchmarking, analysis and tuning of systems, networking, and storage.Experience with Grafana, CheckMK, Nagios, Zabbix, SolarWinds, Ganglia, or other network and device monitoring systems.Previous experience working in a government, scientific or other highly technical environment.Good documentation skills, including ability to prepare simple documentation web pages.PHYSICAL REQUIREMENTS & ENVIRONMENTAL CONDITIONS
Inside office environment.Working on a computer for long periods of time.May involve long period of sitting at a desk.The work environment is fast-paced and sometimes involves extreme deadline pressures.
OTHER DUTIES
This job description is not designed to cover or contain a comprehensive listing of activities, duties or responsibilities that are required of the employee for this job. Duties, responsibilities and activities may change at any time with or without notice.
Xcel Engineering is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regards to race, color, religion, religious creed, gender, sexual orientation, gender identity, gender expression, transgender, pregnancy, marital status, national origin, ancestry, citizenship status, age, disability, protected Veteran Status, genetics or any other characteristics protected by applicable federal, state or local law.
If you are a qualified individual with a disability or disabled veteran, you have the right to request a reasonable accommodation if you are unable or limited in your ability to use or access Xcel Engineering's current openings as a result of your disability. You can request reasonable accommodations by calling 855.212.1810. Thank you for your interest in Xcel Engineering.
All positions at Xcel Engineering, Inc. are contingent upon passing both a background check and drug screening prior to a start date and are subject to random drug screenings during the employment period. In addition, Xcel Engineering is an E-Verify employer.
XCEL Engineering, Inc. is an award-winning small business that provides trusted information technology, engineering, consulting and project management solutions and services to federal agencies and organizations. Originally founded in 1971 by professional engineers at the University of Tennessee, XCEL was acquired in 2003 by U.S. Army and Navy veterans and in 2023 became a MartinFed company.
XCEL Engineering is a part of IT Lab Partners (ITLP) which was created to support a leading research facility in the East Tennessee region in recruiting the best and the brightest technical talent. Consider joining our impressive team today!
JOB OVERVIEW
Xcel Engineering is seeking qualified applicants for a Linux HPC Systems Engineer position in the Research and Development Systems Engineering group. The R&D Systems Engineering group exists to facilitate lab goals through systems engineering, integration, and support for a top research lab. We run clusters, servers, workstations, and other services where science happens at the lab. The Linux Systems Engineer is responsible for facilitating R&D projects. This position will work with a team to provide Linux Systems deployment, automation, monitoring, and management for researchers. Our group optimizes our workflows and monitoring solutions to take advantage of our 24/7 operations staff, which significantly reduces the need for off-hours support. We also offer a flexible work schedule and utilize Email, Jira, Confluence, Teams, Slack, and other collaboration solutions to stay in contact.
ESSENTIAL FUNCTIONS
Our primary goal is to partner with research organizations to enable research excellence and delivery. We work with other clustered computing and HPC groups to help research programs identify the best solutions for their needs. When we build our customer's environments, our team collaborates to design, implement, and maintain the systems from inception to retirement.Advocate and promote HPC and clustered computing services to researchers who process large data sets and/or develop code as a part of their project.Ensure the availability, performance, scalability, and security of production systems.Leverage automation and monitoring solutions that minimize our day-to-day maintenance and scout opportunities to optimize system management practices or system performance.Collaborate with technical POCs for the programs that we support to install and help tune the performance of various scientific toolsets.The Emerging Technologies & Computing Group optimizes workflows and monitoring solutions to take advantage of our 24/7 operations staff, which significantly reduces the need for off-hours support. We use Email, Jira, Confluence, Teams, and other collaboration solutions to stay in contact.
Deliver the mission by aligning behaviors, priorities, and interactions with our core values of Impact, Integrity, Teamwork, Safety, and Service. Promote diversity, equity, inclusion, and accessibility by fostering a respectful workplace - in how we treat one another, work together, and measure success.
BASIC QUALIFICATIONS
United States Citizen with the ability to obtain and maintain a DOE Security Clearance.A BS in computer science, computer engineering, information technology, science, engineering, business or a related field of study and two (2) to four (4) years of aligned experience is required for consideration.1+ year managing UNIX/Linux Systems.1+ year utilizing configuration management and automation tools such as Git, Jenkins, Ansible, or Puppet.Some proficiency in at least one scripting language such as Bash, Python, or equivalent.Experience performing troubleshooting and system administration with Linux Servers.Experience supporting large data systems.A desire to push the envelope and identify new technologies and opportunities and be able to communicate the potential benefits of those choices to others within the team and our research partners.A collaborative and energetic mindset to thrive on the opportunity to build trust and credibility, and ultimately become a trusted advisor to our research teams.DESIRED QUALIFICATIONS
Understanding of multiple operating systems and cluster technologies.Experience with Centos/RHEL, Ubuntu, VMware.Understanding of HPC platforms to support users with SLURM job submissions and troubleshooting.Experience building and running containerized applications in an HPC environment.Experience with multiple deployment mechanisms like Diskless, Warewulf, and traditional deployment (Cobbler, PXEboot, and/or Bright).Experience managing systems utilizing GPU (NVIDIA and AMD) clusters for AI/ML and/or image processing.Knowledge of networking fundamentals including TCP/IP, traffic analysis, common protocols, and network diagnostics.Experience with Infiniband networks and diagnostics.Extensive experience with High Performance Parallel File Systems (Lustre, WEKA, GPFS, etc).Experience with performance and diagnostic tools for benchmarking, analysis and tuning of systems, networking, and storage.Experience with Grafana, CheckMK, Nagios, Zabbix, SolarWinds, Ganglia, or other network and device monitoring systems.Previous experience working in a government, scientific or other highly technical environment.Good documentation skills, including ability to prepare simple documentation web pages.PHYSICAL REQUIREMENTS & ENVIRONMENTAL CONDITIONS
Inside office environment.Working on a computer for long periods of time.May involve long period of sitting at a desk.The work environment is fast-paced and sometimes involves extreme deadline pressures.
OTHER DUTIES
This job description is not designed to cover or contain a comprehensive listing of activities, duties or responsibilities that are required of the employee for this job. Duties, responsibilities and activities may change at any time with or without notice.
Xcel Engineering is an Equal Opportunity/Affirmative Action Employer. All qualified applicants will receive consideration for employment without regards to race, color, religion, religious creed, gender, sexual orientation, gender identity, gender expression, transgender, pregnancy, marital status, national origin, ancestry, citizenship status, age, disability, protected Veteran Status, genetics or any other characteristics protected by applicable federal, state or local law.
If you are a qualified individual with a disability or disabled veteran, you have the right to request a reasonable accommodation if you are unable or limited in your ability to use or access Xcel Engineering's current openings as a result of your disability. You can request reasonable accommodations by calling 855.212.1810. Thank you for your interest in Xcel Engineering.
All positions at Xcel Engineering, Inc. are contingent upon passing both a background check and drug screening prior to a start date and are subject to random drug screenings during the employment period. In addition, Xcel Engineering is an E-Verify employer.