All Jobs
TensorFlow
Site Reliability Engineer - AI Platform

Site Reliability Engineer - AI Platform

Posted 1 day ago

Job Description

Company Description

Ubisoft is a global leader in gaming with teams across the world creating original and memorable gaming experiences, from Assassin’s Creed, Rainbow Six, to Just Dance and more. We believe diverse perspectives help both players and teams thrive. If you’re passionate about innovation and pushing entertainment boundaries, join our journey and help us create the unknown!

Created in 1996, Ubisoft Shanghai studio, is a vibrant and exciting place where our talents get opportunities to either co-develop great AAA blockbuster games, create cutting-edge online games or produce fun mobile games.

To learn more, please visit: www.ubisoftgroup.com

Job Description

About the Role

Join the AI Initiatives team as a Site Reliability Engineer and help operate, scale, and evolve the foundation that powers AI products across the company.
This role sits within the AI team and focuses on ensuring that AI platforms, services, and agent-based systems are reliable, scalable, observable, and secure in production.

This is not a pure operations role. The position requires strong software engineering skills combined with deep experience in cloud infrastructure, DevOps practices, and system reliability. A genuine interest in AI systems and how they behave in real-world production environments is essential.

Responsibilities

As a Site Reliability Engineer - AI Platform, this role plays a critical role in enabling the reliable delivery and operation of AI-powered products and platforms used across the organization.

Build and Operate Reliable AI Infrastructure
- Design, deploy, and operate cloud-native infrastructure supporting AI workloads, including LLM services, RAG pipelines, agent-based systems, and internal AI platforms.
Full-Stack DevOps & Engineering
- Develop automation, tooling, and services to support CI/CD, deployment, configuration, and lifecycle management of AI systems. Balance hands-on development work with infrastructure ownership and operational responsibilities.
Infrastructure as Code & Automation
- Define and manage infrastructure using Infrastructure as Code (e.g. Terraform, CloudFormation), and build automation for provisioning, scaling, recovery, and routine operations.
Observability & Incident Management
- Design and maintain observability solutions (monitoring, logging, tracing, alerting) to ensure high availability, fast detection of issues, and effective incident response for AI services.
System Architecture & Reliability
- Partner with AI engineers and product teams to review system designs, identify reliability risks, define SLOs/SLIs, and improve fault tolerance, scalability, and resilience of AI-powered systems.
Cloud Native Delivery
- Operate and evolve containerized platforms using Docker and Kubernetes; support safe and frequent deployments through robust CI/CD pipelines.
AI-Aware Operations
- Develop an understanding of AI-specific operational challenges such as model serving, LLM latency, rate limits, cost control, caching, retries, fallbacks, and data pipeline reliability.
Cross-Team Collaboration
- Work closely with AI engineers, software engineers, and product teams to ensure that reliability, operability, and scalability are first-class concerns throughout the product lifecycle.

Qualifications

We are seeking a seasoned professional with a strong technical background and a passion for building world-class AI applications.

Must-Have Qualifications:

8+ years of experience in software engineering, SRE, DevOps, or platform engineering roles.
Strong programming skills (e.g. Python, Go, JavaScript, or similar), with experience building internal tools and automation.
Solid experience with cloud platforms (AWS, GCP, or Azure) and cloud-native architectures.
Hands-on experience with DevOps practices, CI/CD pipelines, and container orchestration (Docker, Kubernetes).
Strong knowledge of Infrastructure as Code (Terraform, CloudFormation, or equivalent).
Experience designing and operating observability systems (monitoring, logging, alerting)
Strong understanding of system architecture, reliability engineering, and production operations
Passion for AI technologies and curiosity about how AI systems behave at scale.

Nice-to-Have Qualifications:

Experience supporting AI or data-intensive systems in production environments.
Familiarity with AI/ML workloads, such as model serving, RAG pipelines, or agent-based systems.
Understanding of reliability challenges specific to AI systems (latency, cost control, scaling, failure modes).
Experience operating enterprise-grade platforms with high availability, security, and compliance requirements.
Be familiar with AI service platform, i.e, AWS bedrock or azure foundry
Experience with AI agents and Model Context Protocol (MCP), including operating, integrating, or supporting agent-based systems in production environments.

Additional Information

Growth Opportunities

Joining our team as a Senior Software Engineer in AI Applications offers a unique chance to work on industry-leading projects that shape the future of AI technology. You will have the opportunity to:

Engage in continuous learning and professional development to stay at the forefront of AI advancements.
Take on increased responsibilities and influence the strategic direction of our AI product offerings and drive impactful innovation.

Please mention that you found this job on MoAIJobs, this helps us grow. Thank you!