EON Systems, Inc.
Senior Data Engineer - Acquisition & Infrastructure
EON Systems, Inc., Little Ferry, New Jersey, us, 07643
This role
As a data engineer, you will be responsible for acquisition, processing and handling of large amounts of complex neuroscientific data. You will build and maintain an end-to-end cloud-based data pipeline structure from data capture to providing processed data to our ML models. You will be collaborating closely with the human / animal brain data acquisition and AI engineering teams, building the interface between data-acquisition and our machine learning models.
Representative projects
Download neuro datasets from 10+ repositories, format and preprocess them, and store them in an infrastructure accessible for training pipelines.Build creative validation and quality assurance steps into this pipeline, that allow SMEs to judge their quality and later automate this process. Visualize key metrics in dashboards. One potential example: run our smallest neuro foundation model on it, rank by reconstruction loss, flag if the dataset was used to train the model and thus will have artificially low loss.Work with ML engineers to build an API to feed (tokenized) brain data to training runs.Download or scrape metadata from the above repositories, extract additional metadata from fields like Description, impute missing metadata via LLMs.Proactively work to determine what other projects would provide value to the ML team and the company
Responsibilities
Manage the acquisition process of petabytes of online datasets of different types and modalitiesAssess and process unstructured and noisy data sets, requiring intensive cleanup and organization.Build a cloud-based data pipeline to streamline massive amounts of data for our ML model applicationsHost and maintain our large cloud-based datasets, ensuring scalability, accessibility and end-to-end functionality at all levelsCollaborate closely with our Machine Learning (ML) team to facilitate and optimize data pipeline projects.Document the data pipeline with clear and comprehensive guides, facilitating easy access and understanding for the ML team and other stakeholders.do not refer to internal details or delivery timelines, but be specific about what they’ll do and useExample (to be deleted)
Requirements
Strong demonstrated experience in handling and preprocessing messy, unstructured datasets, ideally within scientific research environments.Demonstrated experience in building software around cloud-based data pipeline infrastructuresDemonstrated experience in building large data infrastructure for ML applicationsProficiency in cloud computing platforms, at a minimum AWS, and ideally othersGood understanding of machine learning concepts and how data preprocessing affects ML model performance.Strong background and experience in implementing data validation and cleaning techniques.Experience in managing complex projects with a focus on timely delivery of technical solutions.Excellent communication skills for effective collaboration with technical and non-technical teams.
Nice-to-haves (we’ll prioritize your application if you have the skills below)
Experience in the following: Kafka, Hadoop, EMR, GCP, Glue, Spark, CloudStack, HDFS, Databricks, Sagemaker, etcExperience with database management, ETL processes, and SQL/NoSQL databases.Thoughtfulness about policy and epistemics related to the rapidly-changing future of technology
This role may not be the best fit for you if…
You have predominantly developed data pipelines for business contexts, where data needs less serial and experimental processing compared to the complexities of scientific datasets.Your experience does not include hands-on work with design choices around dataset acquisition.You lack familiarity with fundamental scientific computing techniques, for instance, normalizing by z-score or resampling.
Salary
Competitive salaries, including equity, apply.
#J-18808-Ljbffr
As a data engineer, you will be responsible for acquisition, processing and handling of large amounts of complex neuroscientific data. You will build and maintain an end-to-end cloud-based data pipeline structure from data capture to providing processed data to our ML models. You will be collaborating closely with the human / animal brain data acquisition and AI engineering teams, building the interface between data-acquisition and our machine learning models.
Representative projects
Download neuro datasets from 10+ repositories, format and preprocess them, and store them in an infrastructure accessible for training pipelines.Build creative validation and quality assurance steps into this pipeline, that allow SMEs to judge their quality and later automate this process. Visualize key metrics in dashboards. One potential example: run our smallest neuro foundation model on it, rank by reconstruction loss, flag if the dataset was used to train the model and thus will have artificially low loss.Work with ML engineers to build an API to feed (tokenized) brain data to training runs.Download or scrape metadata from the above repositories, extract additional metadata from fields like Description, impute missing metadata via LLMs.Proactively work to determine what other projects would provide value to the ML team and the company
Responsibilities
Manage the acquisition process of petabytes of online datasets of different types and modalitiesAssess and process unstructured and noisy data sets, requiring intensive cleanup and organization.Build a cloud-based data pipeline to streamline massive amounts of data for our ML model applicationsHost and maintain our large cloud-based datasets, ensuring scalability, accessibility and end-to-end functionality at all levelsCollaborate closely with our Machine Learning (ML) team to facilitate and optimize data pipeline projects.Document the data pipeline with clear and comprehensive guides, facilitating easy access and understanding for the ML team and other stakeholders.do not refer to internal details or delivery timelines, but be specific about what they’ll do and useExample (to be deleted)
Requirements
Strong demonstrated experience in handling and preprocessing messy, unstructured datasets, ideally within scientific research environments.Demonstrated experience in building software around cloud-based data pipeline infrastructuresDemonstrated experience in building large data infrastructure for ML applicationsProficiency in cloud computing platforms, at a minimum AWS, and ideally othersGood understanding of machine learning concepts and how data preprocessing affects ML model performance.Strong background and experience in implementing data validation and cleaning techniques.Experience in managing complex projects with a focus on timely delivery of technical solutions.Excellent communication skills for effective collaboration with technical and non-technical teams.
Nice-to-haves (we’ll prioritize your application if you have the skills below)
Experience in the following: Kafka, Hadoop, EMR, GCP, Glue, Spark, CloudStack, HDFS, Databricks, Sagemaker, etcExperience with database management, ETL processes, and SQL/NoSQL databases.Thoughtfulness about policy and epistemics related to the rapidly-changing future of technology
This role may not be the best fit for you if…
You have predominantly developed data pipelines for business contexts, where data needs less serial and experimental processing compared to the complexities of scientific datasets.Your experience does not include hands-on work with design choices around dataset acquisition.You lack familiarity with fundamental scientific computing techniques, for instance, normalizing by z-score or resampling.
Salary
Competitive salaries, including equity, apply.
#J-18808-Ljbffr