Logo
TikTok

Software Engineer, ML System Scheduling

TikTok, Seattle, Washington, us, 98127


Responsibilities

TikTok is the leading destination for short-form mobile video. Our mission is to inspire creativity and bring joy. TikTok has global offices including Los Angeles, New York, London, Paris, Berlin, Dubai, Singapore, Jakarta, Seoul and Tokyo.

At TikTok, our people are humble, intelligent, compassionate and creative. We create to inspire - for you, for us, and for more than 1 billion users on our platform. We lead with curiosity and aim for the highest, never shying away from taking calculated risks and embracing ambiguity as it comes. Here, the opportunities are limitless for those who dare to pursue bold ideas that exist just beyond the boundary of possibility. Join us and make impact happen with a career at TikTok.

Team IntroAML (Applied Machine Learning) Machine Learning System team focuses on the research and implementation of cutting-edge technologies in the field of Machine Learning systems, providing high-performance, highly reliable, scalable systems.In the team, you'll have the opportunity to build the large scale heterogeneous system integrating with GPU/RDMA/Storage and keep it running stable and reliable, enrich your expertise in coding, performance improvement and problem analysis, and be involved in the decision-making process.

Responsibilities:Responsible for the design and development of resource scheduling, including model training, model evaluation and model inference in various scenarios (LLM/AIGC/NLP/CV/Speech, etc.)Responsible for the optimal orchestration of various computing resources (GPU, CPU, other heterogeneous hardware), realizing the rational use of stable resources, tidal resources, mixed resources, and multi-cloud resourcesResponsible for the optimal combination of computing resources, RDMA high-speed network resources, and storage resources, and giving full play to the power of large-scale distributed clustersResponsible for offline and online workload scheduling in global data centers integrating multi-cloud scenarios to achieve rational distributions

QualificationsMinimum Qualifications:Be proficient in 1 to 2 programming languages such as Go/Python/Shell in Linux environmentBe familiar with Kubernetes architecture and container technology such as Docker/Containerd/Kata/Podman, and have rich experience in Machine Learning system practice and developmentUnderstand the principles of distributed systems and have experience in the design, development and maintenance of large-scale distributed systemsHave an excellent logical analysis ability, able to reasonably abstract and split business logicHave a strong sense of responsibility, good learning ability, communication skills and self-drive, able to respond and act quickly

Preferred QualificationsFamiliar with at least one major Machine Learning framework (TensorFlow/PyTorch)Experience in one of the following fields: AI Infrastructure, HW/SW Co-Design, High Performance Computing, ML Hardware Architecture (GPU, Accelerators, Networking)

Job Information:Compensation Description (annually)The base salary range for this position in the selected city is $184300 - $337250 annually.Compensation may vary outside of this range depending on a number of factors, including a candidate’s qualifications, skills, competencies and experience, and location. Base pay is one part of the Total Package that is provided to compensate and recognize employees for their work, and this role may be eligible for additional discretionary bonuses/incentives, and restricted stock units.Our company benefits are designed to convey company culture and values, to create an efficient and inspiring work environment, and to support our employees to give their best in both work and life. We offer the following benefits to eligible employees:We cover 100% premium coverage for employee medical insurance, approximately 75% premium coverage for dependents and offer a Health Savings Account(HSA) with a company match. As well as Dental, Vision, Short/Long term Disability, Basic Life, Voluntary Life and AD&D insurance plans. In addition to Flexible Spending Account(FSA) Options like Health Care, Limited Purpose and Dependent Care.Our time off and leave plans are: 10 paid holidays per year plus 17 days of Paid Personal Time Off (PPTO) (prorated upon hire and increased by tenure) and 10 paid sick days per year as well as 12 weeks of paid Parental leave and 8 weeks of paid Supplemental Disability.We also provide generous benefits like mental and emotional health benefits through our EAP and Lyra. A 401K company match, gym and cellphone service reimbursements. The Company reserves the right to modify or change these benefits programs at any time, with or without notice.

#J-18808-Ljbffr