Logo
JobRialto

Data Engineer

JobRialto, Mc Lean, Virginia, us, 22107


Job Description:

Top Skills' Details

Experience building/standing up a big data on-prem solution (Hadoop, Cloudera, Hortonworks), data lake or data warehouse or similar solution.

All roles require 100% hands on experience and strong data foundations.

These skills MUST be on prem NOT in the cloud, if you have both that's fine but they need these skills from an on-prem big data environment.

Big Data Platform (data lake) and data warehouse engineering experience.

Preferably with Hadoop stack: HDFS, Hive, SQL, Spark, Spark Streaming, Spark SQL, HBase, Kafka, Sqoop, Atlas, Flink, Kafka, Cloudera Manager, Airflow, Impala, Hive, HBase, Tez, Hue, and a variety of source data connectors.

Solid hands-on software engineer who can design and code Big Data pipeline frameworks (as a software product - Cloudera ideally) - not just a "data engineer" implementing spark jobs, or a team lead for data engineers. Building self-service data pipelines that help automate the controls to help build the data pipeline and ingest data into the ecosystem (data lake) and transform it for different consumption to support GCP, Hadoop on Premise, bring in massive volumes of Cybersecurity data , validating data and data quality.

Solid PySpark Developer - experience working with Spark Core, Spark Streaming, Spark Optimizations - know how to optimize the code, PySpark API.

Experience writing PySpark code.

PySpark with solid Hadoop data lake foundations

Airflow experience - just using it and developing workflows in Airflow.

Job Description

The Big Data Lead software engineer is responsible for owning and driving the technical innovation along with big data technologies.

The individual is a subject matter expert technologist with strong Python experience and deep hands-on experience building data pipelines for the Hadoop platform as well as Google cloud.

This person will be part of successful Big Data implementations for large data integration initiatives.

The candidates for this role must be willing to push the limits of traditional development paradigms typically found in a data-centric organization while embracing the opportunity to gain subject matter expertise in the cyber security domain.

In this role you will

Lead the design and development of sophisticated, resilient, and secure engineering solutions for modernizing our data ecosystem that typically involve multiple disciplines, including big data architecture, data pipelines, data management, and data modeling specific to consumer use cases.

Provide technical expertise for the design, implementation, maintenance, and control of data management services - especially end-to-end, scale-out data pipelines.

Develop self-service, multitenant capabilities on the cyber security data lake including custom/of the shelf services integrated with the Hadoop platform and Google cloud, use API and messaging to communicate across services, integrate with distributed data processing frameworks and data access engines built on the cluster, integrate with enterprise services for security, data governance and automated data controls, and implement policies to enforce fine-grained data access

Build, certify and deploy highly automated services and features for data management (registering, classifying, collecting, loading, formatting, cleansing, structuring, transforming, reformatting, distributing, and archiving/purging) through Data Ingestion, Processing, and Consumption stages of the analytical data lifecycle.

Provide the highest technical leadership in terms of design, engineering, deployment and maintenance of solutions through collaborative efforts with the team and third-party vendors.

Design, code, test, debug, and document programs using Agile development practices.

Review and analyze complex data management technologies that require in depth evaluation of multiple factors including intangibles or unprecedented factors.

Assist in production deployments, including troubleshooting and problem resolution.

Collaborate with enterprise, data platform, data delivery, and other product teams to provide strategic solutions, influencing long range internal and enterprise level data architecture and change management strategies.

Provide technical leadership and recommendation into the future direction of data management technology and custom engineering designs.

Collaborate and consult with peers, colleagues, and managers to resolve issues and achieve goals.

10+ years of Big Data Platform (data lake) and data warehouse engineering experience.

Preferably with Hadoop stack: HDFS, Hive, SQL, Spark, Spark Streaming, Spark SQL, HBase, Kafka, Sqoop, Atlas, Flink, Kafka, Cloudera Manager, Airflow, Impala, Hive, HBase, Tez, Hue, and a variety of source data connectors. Solid hands-on software engineer who can design and code BigData pipeline frameworks (as a software product - Cloudera ideally) - not just a "data engineer" implementing spark jobs, or a team lead for data engineers. Building self-service data pipelines that help automate the controls to help build the data pipeline and ingest data into the ecosystem (data lake) and transform it for different consumption to support GCP, Hadoop On Premise, bring in massive volumes of Cybersecurity data , validating data and data quality. Reporting consumption - advanced analytics, data science and ML.

3+ years of hands-on experience designing and building modern, resilient, and secure data pipelines, including movement, collection, integration, transformation of structured/unstructured data with built-in automated data controls, and built-in logging/monitoring/alerting, and pipeline orchestration managed to operational SLAs. Preferably using Airflow Custom Operator (at least 1 year of experience customizing within it), DAGS, connector plugins.

Python, spark, PySpark - working with APIs to integrate different services, Google big data services, Cloud data proc, data store, BigQuery, cloud composer - Google data services.

On prem - Apache Airflow - streaming tool core orchestrator.

Kafka for streaming services - getting data sourced from and then spark streaming.

Python, spark, APIs to integrate different services, GCP services

Building self-service data pipelines - supports GCP, Hadoop On Premise, bring in massive volumes of Cybersecurity data , validating data and data quality. Reporting consumption - advanced analytics, data science and ML.

Skill sets:

Python, Spark, (PYSPARK) used APIs to integrate with various services, Google big data services, Cloud data proc, data store, BigQuery, cloud composer - Google data services.

On prem - Apache Airflow - streaming tool core orchestrator.

Kafka for streaming services - getting data sourced from and then spark streaming.

Additional Skills & Qualifications

Additional skills to look for in any/all the above candidates as a plus: GCP, Kafka/Kafka Connect, Hive DB development

Experience with Google cloud data services such as cloud storage, cloud proc, cloud flow, and Big Query.

Google Cloud Big Data Specialty - hands on experience ideally not just a certification

Hands-on experience developing and managing technical and business metadata

Experience creating/managing Time-Series data from full data snapshots or incremental data changes

Hands-on experience with implementing fine-grained access controls such as Attribute Based Access Controls using Apache Ranger

Experience automating DQ validation in the data pipelines

Experience implementing automated data change management including code and schema, versioning, QA, CI/CD, rollback processing

Experience with automating end to end data lifecycle on the big data ecosystem

Experience with managing automated schema evolution within data pipelines

Experience implementing masking and/or other forms of obfuscating data

Experience designing and building microservices, APIs and, MySQL

Advanced understanding of SQL and NoSQL DB schemas

Advanced understanding of Partitioned Parquet, ORC, Avro, various compression formats

Developing containerized Microservices and APIs

Familiarity with key concepts implemented by Apache Hudi or Iceberg, or Databricks Delta Lake (bonus)

Job expectations:

Ability to occasionally work nights and/or weekends as needed for on-call/production issue resolution

Ability to occasionally work nights and/or weekends for off-hours system maintenance

Employee Value Proposition (EVP)

Strategy:

The more tactical need (2-3 years) is to implement a robust big data platform on-premise to meet Client's Cyber Security BI/analytics/reporting and data science/ML needs.

This includes building a custom Data Pipeline solution using Spark, Airflow and on top of the Hadoop platform using python.

In parallel, we would like to start onboarding select early-adopter use cases to our target state Google Cloud Platform (GCP) starting Q1'2023.

Portability of our on-premise solutions to GCP is critical.

As we learn and gain momentum on GCP, we will start to accelerate our journey to the public cloud - expect that to be around Q3/Q4 of 2023.

Education:

Bachelors Degree