XPENG Motors

Staff AI Infrastructure Engineer: Inference Platform

XPENG Motors, Santa Clara, California, us, 95053

XPeng Motors

is one of China's leading smart electric vehicle ("EV") company. We design, develop, manufactures and market smart EVs that are seamlessly integrated with advanced Internet, AI and autonomous driving technologies. We are committed to in-house R&D and intelligent manufacturing to create a better mobility experience for our customers. We strive to transform smart electric vehicles with technology and data, shaping the mobility experience of the future.

We're looking for people who are as excited as we are to solve the complex technical challenges in autonomous driving, see the results of your work in massive production EV cars and make tremendous impact on our future.

Job Responsibilities: Design, implement and operate components of our novel model inference platform( e.g. quota management, job scheduling, and queuing systems). You will play a critical role in scheduling GPU resources. Identify performance bottlenecks and optimization opportunities Work closely with Machine Learning Engineers to evolve the inference platform as per their use cases Monitor system health, diagnose and troubleshoot issues, and perform routine maintenance tasks to ensure the reliability of the distributed inference infrastructure Build and maintain documentation for infrastructure components and systems Minimum Skill Requirements:

Advanced degree (MS or PhD) in Computer Science or related field 5+ years of industry or research experience in ML Infra, model inference Expertise in programming languages like Python/Java/C++ and experience with distributed computing frameworks Experience with high-throughput, fault-tolerant system design Proficient in Docker and Kubernetes Experience with Jenkins, Github CI/CD, or similar tools Experience with Prometheus, Grafana, or similar monitoring solutions Excellent problem-solving skills and attention to detail Strong communication skills and ability to work in a collaborative environment Preferred Skill Requirements:

Strong background in building and maintaining large-scale distributed systems Strong background in performance optimization and system scaling Experience in scheduling jobs on heterogeneous computation resources Deep understanding of cloud computing platforms Deep knowledge of monitoring and observability practices Experience with CUDA packages Experience with PyTorch, Tensorflow or similar frameworks What do we provide:

A dynamic, supportive, and engaging work environment where creativity thrives. The opportunity to make a significant impact on the transportation revolution through advancements in autonomous driving. Exposure to cutting-edge technologies alongside top industry talent. Competitive compensation package. Perks include snacks, lunches, and organized fun activities.

The base salary range for this full-time position is $180,000-$300,000, in addition to bonus, equity and benefits. Our salary ranges are determined by role, level, and location. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position across all US locations. Within the range, individual pay is determined by work location and additional factors, including job-related skills, experience, and relevant education or training.

We are an Equal Opportunity Employer. It is our policy to provide equal employment opportunities to all qualified persons without regard to race, age, color, sex, sexual orientation, religion, national origin, disability, veteran status or marital status or any other prescribed category set forth in federal or state regulations.