AI Evaluation Engineer
Posted 1 day ago
Job Description
About Distyl AI
Distyl is an applied AI technology company partnering with the world’s most ambitious institutions to rearchitect critical operations for the frontier of AI. Our customers include the largest companies in telecom, healthcare, insurance, manufacturing, consumer goods, and global social organizations.
We research and deploy technologies that power AI-native operations — both for our partners and for Distyl itself. Our work spans research into self-constructing systems, the development of the most reliable execution of AI systems, and products that transform mission-critical workflows. As a result, Distyl's technologies affect some of the world's largest operations — from hundreds of millions of consumer interactions to tens of millions of supply chain transactions and millions of patient journeys.
Distyl is backed by leading investors including Lightspeed Venture Partners, Khosla Ventures, Coatue, DST Global, and the board-members of 20+ F500s. The results reflect this approach: a 100% production deployment success rate for our customers and one of the few enterprise AI companies to run a profitable business.
What We Are Looking For
At Distyl, we build AI systems using Evaluation-Driven Development—an approach where evaluation is not an afterthought, but the primary mechanism for iterating, improving, and trusting AI behavior in production.
AI Evaluation Engineers focus on designing and implementing the evaluation systems that drive this process. They are hands-on engineers who write production Python code, build evaluation pipelines, and use structured signals to guide system design, prompt iteration, and deployment decisions for real customer-facing AI systems.
This role is for engineers who believe that AI systems only improve when measurement is tightly coupled to development—and who want to apply that philosophy directly to systems that matter.
Key Responsibilities
Design and implement evaluation frameworks that enable Evaluation-Driven Development for AI systems deployed in customer environments
Define how system quality is measured in each domain, ensuring that evaluation signals reflect real user needs, domain constraints, and business objectives
Build and maintain golden test cases and regression suites in Python, using both human-authored and AI-assisted test generation to capture critical behaviors and edge cases. These test suites are treated as first-class system components that evolve alongside the AI system itself
Develop and maintain evaluation pipelines—offline and online—that integrate directly into system iteration loops. Evaluation results inform prompt design, agent logic, model selection, and release readiness, ensuring that system changes are driven by measurable improvements rather than intuition alone
Define, calibrate, and operate LLM-based graders, aligning automated judgments with expert human assessments. They investigate where evaluation signals diverge from real-world outcomes and refine grading approaches to maintain signal quality as systems and domains evolve
Work closely with Forward Deployed AI Engineers, Architects, Product Engineers, AI Strategists, and domain experts to ensure evaluation frameworks meaningfully guide system development and deployment in production
What We Require
2+ years of software engineering experience
Strong Python Engineering Skills: Write clean, maintainable Python and are comfortable building evaluation and experimentation pipelines that run in production environments. You treat evaluation code with the same rigor as application code
Experience with Evaluation-Driven or Experiment-Driven Development: Experience using structured evaluation or experimentation frameworks to drive system iteration, and understand the pitfalls of overfitting to metrics that don’t reflect real outcomes
Ability to Translate Human Judgment into Code: Work with subject matter experts to elicit high-quality judgments and encode them into test cases, scoring functions, and graders that scale
Systems-Oriented Mindset: Understand how evaluation interacts with prompts, agents, data, and deployment. You design evaluation systems that support fast iteration while maintaining trust and safety in production
AI-Native Working Style: Use AI tools to generate tests, analyze failures, explore edge cases, and accelerate debugging and iteration
Travel: Ability to travel 25-50%
What We Offer
The base salary range for this role is $150K – $250K, depending on experience, location, and level. In addition to base compensation, this role is eligible for meaningful equity, along with a comprehensive benefits package
100% covered medical, dental, and vision for employees and dependents
401(k) with additional perks (e.g., commuter benefits, in‑office lunch)
Access to state‑of‑the‑art models, generous usage of modern AI tools, and real‑world business problems
Ownership of high‑impact projects across top enterprises
A mission‑driven, fast‑moving culture that prizes curiosity, pragmatism, and excellence
Distyl has offices in San Francisco and New York. This role follows a hybrid collaboration model with 3+ days per week (Tuesday–Thursday) in‑office..
We’re grateful for the strong interest in this role. The best way to get your profile in front of our team is to apply directly through our careers page, where all applications are reviewed. Due to the high volume of interest, we’re not able to review or respond to all direct emails or LinkedIn messages. We will be in touch with every applicant once we’ve completed our review, regardless of the decision.
Distyl
1 job posted
About the job
Apr 23, 2026
May 23, 2026
Similar Jobs
9d
AI Engineer
Shyftlabs
Noida, Uttar PradeshAI Engineer
Shyftlabs
Noida, Uttar Pradesh9d30d
AI Deployment Engineer
OpenAI
Tokyo, JapanAI Deployment Engineer
OpenAI
Tokyo, Japan30d29d
Senior AI Engineer
Workday
IsraelSenior AI Engineer
Workday
Israel29d29d
Lead AI Engineer
Mastercard
SingaporeLead AI Engineer
Mastercard
Singapore29d25d
ML/AI Engineer
AMD
Austin, TexasML/AI Engineer
AMD
Austin, Texas25d24d
AI Inference Engineer
Perplexity
LondonAI Inference Engineer
Perplexity
London24d24d
Senior AI Engineer
Mastercard
Gurgaon, IndiaSenior AI Engineer
Mastercard
Gurgaon, India24d22d
Applied AI Engineer
Celonis
Bangalore, IndiaApplied AI Engineer
Celonis
Bangalore, India22d22d
AI Performance Engineer
AMD
Helsinki, FinlandAI Performance Engineer
AMD
Helsinki, Finland22d22d
Lead AI Engineer
Mastercard
CA$109K - CA$158KToronto, CanadaLead AI Engineer
Mastercard
CA$109K - CA$158KToronto, Canada22d
Looking for something different?
Browse all AI jobsFree AI job alerts
Get the latest AI jobs delivered to your inbox every week. Free, no spam.