Compunnel

Senior \/ Lead Hadoop Engineer

Compunnel, Charlotte, North Carolina, United States, 28245

Job Summary

We are seeking a highly experienced Senior / Lead Hadoop Engineer to lead complex technology initiatives focused on maintaining platform stability, automating monitoring, and providing 24/7 support for Hadoop clusters. The ideal candidate will have extensive experience in Hadoop administration, Unix scripting, and incident management. This role involves managing and optimizing the Hadoop ecosystem, coordinating with platform tenants and vendors, and ensuring platform resilience through backup, recovery, and patch management.

Key Responsibilities

Platform Stability and Incident Management:

â€¢ Lead complex technology initiatives focused on Hadoop platform stability and automation.

â€¢ Manage and troubleshoot platform incidents, serving as part of a 24/7 platform support team.

â€¢ Assess the impact of platform issues, prioritize triaging efforts, and partner with tenants, developers, and vendors.

â€¢ Perform root cause analysis of incidents and implement permanent solutions.

â€¢ Communicate progress and status updates to stakeholders and leadership.

Hadoop Cluster Monitoring and Troubleshooting:

â€¢ Monitor Hadoop cluster health and performance using tools such as Prometheus, Splunk, and Thousand Eyes.

â€¢ Automate monitoring opportunities and fine-tune alert systems for optimal performance.

â€¢ Use Autosys or similar scheduling tools to manage platform activities.

â€¢ Troubleshoot tenant-reported failures by analyzing logs to identify root causes.

Hadoop Configuration and Tuning:

â€¢ Periodically configure and tune Hadoop components (HDFS, YARN, Hive, HBase, Spark, HPOS).

â€¢ Schedule and track change requests for implementation.

â€¢ Collaborate with tenants and operating system administrators to manage patches and upgrades.

Backup, Recovery, and Patching Management:

â€¢ Ensure platform resilience through robust data backup and recovery strategies.

â€¢ Execute failover to disaster recovery sites and manage fallback processes.

â€¢ Plan and perform upgrades and patch management of Hadoop components.

â€¢ Validate cluster health post-patching and communicate with stakeholders.

â€¢ Identify automation opportunities to enhance patching processes.

Documentation and Reporting:

â€¢ Maintain detailed documentation of cluster configurations, troubleshooting processes, and procedures.

â€¢ Generate reports on cluster usage, performance metrics, and capacity utilization.

Collaboration and Support:

â€¢ Collaborate with platform tenants, development teams, and Hadoop vendors to troubleshoot complex issues.

â€¢ Work independently and in coordination with key technical experts to ensure production stability.

Required Qualifications:

â€¢ 8+ years of experience in technology incident and problem management.

â€¢ 4+ years of experience using monitoring tools (Prometheus, Splunk, SiteScope, Thousand Eyes).

â€¢ 6+ years of experience with Unix scripting.

â€¢ 2+ years of experience in Hadoop administration or a similar role.

â€¢ Strong knowledge of the Hadoop ecosystem and its components (Hive, HBase, Spark, HPOS).

â€¢ Experience with Autosys or a similar job scheduling tool.

Preferred Qualifications:

â€¢ Excellent problem-solving and troubleshooting skills.

â€¢ Strong communication and collaboration skills for working with cross-functional teams.

â€¢ Ability to work both independently and as part of a collaborative team.

Education:

Bachelors Degree