Senior Backend Developer - Machine Learning Platform
Are you ready to play a key role in simplifying the deployment of Machine Learning models?
Are you passionate about cloud-native technologies, automation, and developer experience? Coveo is looking for a Senior Developer to join our ML Model Training team! Your mission? Build and evolve the infrastructure that powers thousands of model rebuilds every day, enabling our Data Scientists and Applied Scientists to train their models at scale, reliably, and efficiently.
You’ll focus on simplifying the ML model development experience, designing tools and systems that abstract away complexity while giving internal users the visibility and control they need to iterate with confidence. Your work will directly impact how fast, how often, and how safely models are trained across Coveo’s AI ecosystem.
Here’s what you’ll be responsible for:
- Design simple, powerful interfaces and tools that enable scientists to configure and launch training jobs with minimal friction, whether for prototyping or production.
- Develop smart orchestration and automation mechanisms to prioritize, batch, retry, or rollback training jobs at a massive scale.
- Champion performance and cost optimization, helping the organization manage compute usage responsibly without sacrificing velocity or quality.
- Implement robust observability layers so users can monitor performance, track metrics, and debug model training workflows.
- Collaborate with applied scientists and data engineers to understand their needs, improve developer experience, and continuously raise the bar on reliability and efficiency.
Here is what will qualify you for the role:
- 8+ years of backend or platform engineering experience, with a strong focus on cloud-native and distributed systems (Java, Python, AWS preferred).
Deep understanding of scalable system design, CI/CD, and container orchestration (Kubernetes, ECS, or similar). - Passion for developer experience: you care about ergonomics and eliminating friction for internal users.
- A problem-solving mindset, with the resourcefulness to analyze, optimize, and debug large-scale systems while continuously embracing a growth-oriented approach.
Here is what would make you stand out:
- Familiarity with Terraform & Kubernetes for infrastructure automation and container orchestration.
- Experience building ML infrastructure or internal platforms used by data science teams.
- Hands-on experience with job orchestration, task queues, or pipelines at scale
- Solid grasp of observability practices (logs, metrics, traces), and how to build systems that are easy to monitor and debug.
Do you think you can bring this role to life? Send us your application, we want to get to know you!
Join the Coveolife!
We encourage all qualified candidates to apply regardless of, for example, age, gender, disability, gaps in CV, national or ethnic background. We know that applying for a new role is a lot of work and we really appreciate your time.
#li-hybrid