Senior Software Engineer, AI Training & Infrastructure
Posted 45 days ago
Job Description
This job posting has expired and no longer accepting applications.
Company Overview
At Skild AI, we are building the world's first general purpose robotic intelligence that is robust and adapts to unseen scenarios without failing. We believe massive scale through data-driven machine learning is the key to unlocking these capabilities for the widespread deployment of robots within society. Our team consists of individuals with varying levels of experience and backgrounds, from new graduates to domain experts. Relevant industry experience is important, but ultimately less so than your demonstrated abilities and attitude. We are looking for passionate individuals who are eager to explore uncharted waters and contribute to our innovative projects.
Position Overview
Skild AI, Inc. seeks a Senior Software Engineer, AI Training & Infrastructure in San Mateo, CA responsible for building and scaling training infrastructure and tools that support the full ML lifecycle—data preparation, training orchestration, evaluation, and deployment—for real-world robotics applications. This includes performance, reliability, observability, and developer productivity across distributed training systems, as well as data processing for multimodal datasets, performance tuning of training jobs, and media processing/compression (e.g., ffmpeg). Specific duties include: (i) architecting, building, and maintaining distributed training pipelines and frameworks spanning data ingest/preprocessing, large-scale training, and evaluation; (ii) optimizing training performance and resource utilization by identifying bottlenecks and implementing improvements in data loading, I/O, caching, sharding, and prefetching; (iii) integrating state-of-the-art ML techniques into production training systems in collaboration with research/ML teams; (iv) implementing monitoring, logging, alerting, automated testing, and CI/CD for reliable training operations; and (v) developing developer tooling and documentation, including dashboards and utilities, to streamline experimentation at scale and improve engineer productivity.
Responsibilities
- Architecting, building, and maintaining distributed training pipelines and frameworks spanning data ingest/preprocessing, large-scale training, and evaluation.
- Optimizing training performance and resource utilization by identifying bottlenecks and implementing improvements in data loading, I/O, caching, sharding, and prefetching.
- Integrating state-of-the-art ML techniques into production training systems in collaboration with research/ML teams.
- Implementing monitoring, logging, alerting, automated testing, and CI/CD for reliable training operations.
- Developing developer tooling and documentation, including dashboards and utilities, to streamline experimentation at scale and improve engineer productivity.
Minimum Requirements
- Must have a master’s degree (or foreign equivalent) in Computer Science, Robotics, Engineering, or a related field and two (2) years of experience in machine learning infrastructure. Experience can be concurrent.
- Must also have two (2) years of experience designing and operating distributed training pipelines at scale, including data preprocessing, orchestration, and evaluation. Experience can be concurrent.
- Must have any experience with each of the following: (i) Python or C++ and at least one deep learning library (e.g., PyTorch, TensorFlow, or JAX); and (ii) CI/CD and automated testing for ML/infra services. Experience can be concurrent.
- Must have knowledge of: (i) optimizing data loading and I/O for deep learning workloads (e.g., PyTorch DataLoader, sharding, prefetching, or caching); (ii) processing multimodal datasets and formats (e.g., HDF5, TFRecord, Parquet, or equivalent) and image processing/compression (e.g., OpenCV or ffmpeg); (iii) cloud-based training in AWS, Google Cloud, or Azure; (iv) implementing monitoring, logging, and alerting for training systems; (v) Linux OS fundamentals and operation at large scale; (vi) distributed systems and ML training techniques/models; and (vii) core software engineering principles, including algorithms, data structures, and system design. Experience can be concurrent.
Apply online at skild.ai/career.
This job posting has expired and no longer accepting applications. Please check out our latest AI jobs.
Skild AI
1 job posted
Apr 9, 2026
May 9, 2026
Similar Jobs
- 19d
Senior Software Engineer, ML Training Platform
Reddit
$217K - $303KSan Francisco, CASenior Software Engineer, ML Training Platform
Reddit
$217K - $303KSan Francisco, CA19d - 16d
Senior / Staff ML Training Optimization Engineer
Waabi
Remote$141K - $249KDallas, TXPhoenix, AZPittsburgh, PASan Francisco, CAToronto, ON, CanadaRemote US & CanadaSenior / Staff ML Training Optimization Engineer
Waabi
Remote$141K - $249KDallas, TXPhoenix, AZPittsburgh, PASan Francisco, CAToronto, ON, CanadaRemote US & Canada16d - 2d
Software Engineer, ML Systems & Training Architecture
OpenAI
$295K - $380KSan Francisco, CASoftware Engineer, ML Systems & Training Architecture
OpenAI
$295K - $380KSan Francisco, CA2d - 19d
Senior Staff Machine Learning Engineer, Post Training
Airbnb
Remote$248K - $310KUnited StatesSenior Staff Machine Learning Engineer, Post Training
Airbnb
Remote$248K - $310KUnited States19d - 12d
Technical Marketing Engineer – AI Training Workloads & Performance
AMD
Santa Clara, CaliforniaTechnical Marketing Engineer – AI Training Workloads & Performance
AMD
Santa Clara, California12d - 18d
Senior Machine Learning Engineer - Training Platform (AU remote)
Canva
Sydney, AustraliaSenior Machine Learning Engineer - Training Platform (AU remote)
Canva
Sydney, Australia18d - 4d
Machine Learning Engineer (Training Optimization)
Canva
Beijing, Beijing, ChinaMachine Learning Engineer (Training Optimization)
Canva
Beijing, Beijing, China4d - 3d
Sr. Analyst - AI Training Program Coordinator
Nasdaq
$79K - $138KUnited StatesSr. Analyst - AI Training Program Coordinator
Nasdaq
$79K - $138KUnited States3d - 23d
Research Engineer, Search and Knowledge Post-Training
Anthropic
Remote$500K - $850KSan Francisco, CASeattle, WANew York City, NYResearch Engineer, Search and Knowledge Post-Training
Anthropic
Remote$500K - $850KSan Francisco, CASeattle, WANew York City, NY23d - 17d
Research Engineer - Post-Training for Agentic Coding
Sonar
LondonResearch Engineer - Post-Training for Agentic Coding
Sonar
London17d