Logo
ByteDance

Machine Learning Engineer-Model Training Infrastructure

ByteDance, Seattle, Washington, us, 98127


Responsibilities:

Responsible for the design and implementation of a global-scale machine learning training system for feeds, ads and search ranking models.Responsible for the design and the implementation of orchestration layer of machine learning offline/online training processes.Responsible for improving use-ability and flexibility of the training APIs.Responsible for profiling and optimizing both training and validation frameworks to ensure efficient use of resources.Responsible for creating, managing, and optimizing data pipelines to ensure data availability for training.Qualifications:

Proficient in C/C++/CUDA/Python, and have solid programming skills.Familiar with deep learning frameworks (TensorFlow/Pytorch).Experience in developing and deploying large-scale systems.Ability to work independently and complete projects from beginning to end and in a timely manner.Good communication and teamwork skills to clearly communicate technical concepts with other teammates.Experience on improving core machine learning infrastructure (TensorFlow, Pytorch, and Jax).Preferred Qualifications:

Experience contributing to an open sourced machine learning framework (TensorFlow/PyTorch).Experience in big data frameworks (e.g., Spark/Hadoop/Flink), experience in resource management and task scheduling for large scale distributed systems.Strong background in one of the following fields: Hardware-Software Co-Design, High Performance Computing, ML Hardware Acceleration (e.g., GPU/RDMA) or ML for Systems.

#J-18808-Ljbffr