Job Description
As an MLOps & AI Infrastructure Technical Referent at dLocal, you will be the senior technical reference for how we build, operate and evolve our ML and AI infrastructure.
Your mission is to enable Data Science and AI teams to take models and AI-powered services from idea to production in a reliable, observable and compliant way. You will own the technical direction of our MLOps stack, introduce AI safely into our engineering workflows, and help the team scale its impact as usage and complexity grow.
A core part of this role is to use agents and AI services to automate as much as possible of what we do in MLOps — from feature store and platform operations to fraud/anomaly workflows and ML cost optimization — working side by side with the AI team.
This is a hands-on architecture and leadership role: you won’t own product models yourself, but you will deeply influence how every model and AI component is trained, deployed, monitored and run in production.
What will I be doing?
1. Technical strategy & architecture (MLOps)
- Define and evolve the end-to-end ML platform architecture (data, training, registry, serving, monitoring, governance) used by multiple squads.
- Design standard patterns for:
- Reproducible training pipelines and experiment tracking.
- Model packaging, versioning and promotion flows (dev → staging → production).
- Online and batch inference, with safe rollout strategies (canary, shadow, rollback).
- Balance reliability, performance and cost for ML workloads, working closely with SRE/Infra and Finance/FinOps.
2. Day‑to‑day MLOps enablement & operations
- Act as the go‑to person for complex MLOps questions: how to structure pipelines, choose serving patterns, or design monitoring and rollback.
- Review and challenge designs and deployments for new models and data pipelines, ensuring they follow platform standards and non‑functional requirements.
- Partner with Fraud, Anomaly and other product squads to ensure:
- Clear SLAs/SLOs for ML components.
- Proper logging, metrics and alerts for incidents and regressions.
- Contribute to on‑call readiness: playbooks, dashboards, incident reviews and continuous improvement of our operational posture.
3. AI infrastructure & AI‑assisted operations
- Define infrastructure, contracts and guardrails so that we can safely consume agents and AI services built by the AI team, and extend them when needed from MLOps.
- Design patterns and tooling so that AI and agents automate as much as possible of what we do in MLOps, for example:
- Feature platform operations (feature store pipelines, backfills, parity checks, DQ/drift monitoring).
- MLOps platform workflows (training/eval pipelines, promotion gates, rollbacks, documentation and runbook generation).
- Operational flows in Fraud / Anomaly (triage of alerts, log/metric analysis, enrichment of incident context).
- Platform FinOps & cost optimization (suggesting right‑sizing, schedule changes, decommissioning opportunities).
- Contribute to evaluation, observability and safety for these AI‑powered automations (e.g. prompts, policies, redaction, auditability), in close collaboration with dedicated AI teams.
4. Governance, security & compliance
- Set and maintain technical standards for:
- Model and data access control, PII handling and redaction.
- Auditability of model changes, deployments and runtime behavior.
- Environment separation and change management for ML/AI workloads.
- Work with InfoSec and Architecture to ensure the platform aligns with regulatory and internal requirements while remaining practical for engineers and data scientists.
- Mentor MLOps and Data/ML engineers on:
- System design, reliability and observability.
- Good practices for CI/CD, testing and rollback in ML systems.
- Lead design and architecture reviews, helping teams de‑risk decisions and converge on simple, robust solutions.
- Collaborate closely with:
- Data Science squads and the AI Team (to understand needs and shape the platform).
- SRE/Infra (for capacity, reliability, networking and security).
- Product/Engineering leaders (to align roadmap, trade‑offs and priorities).
What skills do I need?
- Solid experience owning or designing MLOps platforms or ML infrastructure used by multiple teams.
- Strong background in distributed systems and data/stream processing (e.g. Spark, Flink, or similar technologies).
- Experience building production‑grade ML pipelines:
- Experiment tracking, reproducible training and model registry.
- CI/CD for models and data pipelines.
- Online and batch inference at scale.
- Familiarity with cloud‑based ML platforms (e.g. Databricks, SageMaker, Vertex AI, or equivalent) and container‑based deployments.
- Strong understanding of observability for ML systems:
-Metrics, logs and traces.
-Data and model drift, freshness and quality checks.
- Ability to communicate clearly with both technical and non‑technical stakeholders, translating infra and AI/ML trade‑offs into business language.
Nice to have
- Experience rolling out AI assistants (code or infra copilots, AI log analysis, etc.) inside engineering organizations, including policies and best practices.
- Exposure to LLM and AI infrastructure (gateways, vector stores, evaluation harnesses), even if not as a core focus.
- Prior responsibilities as Technical Referent / Tech Lead / Architect for platforms or shared services.
- Contributions to internal standards, RFCs, guilds or tech communities.
dLocal
2 jobs posted
About the job
Mar 13, 2026
Apr 12, 2026
Similar Jobs
22d
Software Engineer, Machine Learning Infrastructure
Stripe
SFSoftware Engineer, Machine Learning Infrastructure
Stripe
SF22d11d
Software Engineer, Machine Learning Infrastructure
DoorDash
$131K - $192KSan Francisco, CASunnyvale, CASeattle, WASoftware Engineer, Machine Learning Infrastructure
DoorDash
$131K - $192KSan Francisco, CASunnyvale, CASeattle, WA11d4d
Software Engineer, Machine Learning Infrastructure
Stripe
N/ASoftware Engineer, Machine Learning Infrastructure
Stripe
N/A4d23d
Machine Learning Engineer, AI Evaluation
Wayve
LondonMachine Learning Engineer, AI Evaluation
Wayve
London23d
17dMachine Learning Analyst (Gen AI)
EarnIn
Bengaluru, India
Machine Learning Analyst (Gen AI)
EarnIn
Bengaluru, India17d18d
Software Dev Intern - AI / Machine Learning
Amazon
United KingdomSoftware Dev Intern - AI / Machine Learning
Amazon
United Kingdom18d
10dSenior Machine Learning Engineer - AI Foundation
XPENG
$175K - $296KSanta Clara, CA
Senior Machine Learning Engineer - AI Foundation
XPENG
$175K - $296KSanta Clara, CA10d
10dStaff Machine Learning Engineer - AI Foundation
XPENG
$215K - $364KSanta Clara, CA
Staff Machine Learning Engineer - AI Foundation
XPENG
$215K - $364KSanta Clara, CA10d9d
Senior Machine Learning - Avatar, Core AI
Roblox
$196K - $242KSan Mateo, CASenior Machine Learning - Avatar, Core AI
Roblox
$196K - $242KSan Mateo, CA9d5d
Lead Machine Learning Engineer, LLM Infrastructure
Salesforce
$173K - $260KSan Francisco, CALead Machine Learning Engineer, LLM Infrastructure
Salesforce
$173K - $260KSan Francisco, CA5d
Looking for something different?
Browse all AI jobsFree AI job alerts
Get the latest AI jobs delivered to your inbox every week. Free, no spam.