Distinguished, Software Engineer -AI/ML Engineer – Agentic Systems
Posted 1 day ago
Job Description
Position Summary...
What you'll do...
As a Distinguished AI/ML Engineer within Walmart Global Tech’s Reliability Engineering Organization, you will lead the technical development of next-generation agentic AI systems and intelligent automation solutions that ensure mission-critical reliability, scalability, and operational excellence across Walmart’s entire technology ecosystem. You will architect and implement cutting-edge machine learning platforms and autonomous agents that transform how we manage change and performance, monitor, predict, and automatically resolve issues across all Walmart systems, supporting millions of associates and customers globally. Walmart Global Tech’s Reliability Engineering Organization is built with hybrid systems and software engineers who take technical ownership for change engineering, change management, performance engineering, reliability, scalability, automation, and mission-critical issues related to uptime, availability, and rapid continuous improvement across Walmart’s e-commerce, stores, and omni-channel platforms. As a technical expert in this domain, you will drive the evolution of practices into AI-powered, self-healing, and autonomous systems built on modern technology stacks with intelligent change management and predictive performance optimization. You will also define and implement unified, intelligent, and operationally robust technical solutions and tools for Walmart Technology organizations across all channels and geographies. About the Team The Reliability Engineering Organization at Walmart Global Tech is responsible for ensuring the reliability, availability, and performance of all systems that power the world’s largest retailer. As a Fortune #1 company, our work impacts hundreds of millions of customers and associates globally—across every transaction, search, and interaction spanning Walmart’s digital and physical ecosystem. We are the guardians of system reliability for Walmart’s e-commerce platform, supply chain systems, in-store technology, financial services, and all critical business operations. Our Reliability Engineering organization is at the forefront of applying advanced AI/ML technologies to reliability challenges, building autonomous systems that can predict, prevent, and resolve issues before they impact customers or business operations. Reliability Engineering is a core engineering discipline within Walmart Global Tech, working closely with all product and engineering teams across the enterprise to ensure every system meets the highest standards of reliability, scalability, and performance. We are deeply invested in building a robust, intelligent, and highly automated technology foundation that supports Walmart’s mission to help people live better through innovation and operational excellence. What You’ll Do AI/ML & Agentic Systems Technical Leadership- Architect and develop advanced agentic AI systems that autonomously manage complex reliability engineering workflows, predictive failure analysis, and self-optimization across Walmart’s technology ecosystem.
- Design and implement multi-agent orchestration platforms that coordinate autonomous agents for change management, capacity planning, and performance optimization across e-commerce, supply chain, and in-store systems.
- Build intelligent observability and monitoring platforms using ML-driven anomaly detection, predictive analytics, and autonomous resolution across Walmart’s entire technology landscape.
- Develop self-healing infrastructure platforms that leverage AI to predict, prevent, and automatically remediate system issues before they impact customers, associates, or business operations.
- Design, write, and build advanced tools to improve latency, availability, scalability and change management across Walmart Technology systems, including: Engineering reliability using metrics and measurements across all domains Enabling system scaling through technical solutions, automation, and process optimization Building tools and automation to prevent recurrence of failures across mission-critical services Enhancing instrumentation to create a cohesive, end-to-end view of system health with particular focus on failure points
- Architect and implement fault-tolerant systems and services across Walmart’s hybrid cloud infrastructure with emphasis on autonomous recovery and intelligent failure prediction.
- Collaborate with engineering teams and leadership to reduce mean time to detect (MTTD) and mean time to restore (MTTR) through intelligent automation and predictive capabilities.
- Partner with service owners across e-commerce, supply chain, stores, fintech, and other domains to define SLA breach detection and change related anomalies, ensuring systems meet SLAs while maintaining optimal performance and user experience.
- Perform complex troubleshooting and analysis of large-scale distributed systems using deep expertise in coding, algorithms, and distributed systems design.
- Partner with engineering organizations across E-commerce, Supply Chain, Store Technology, Fintech, and Data Platforms to deliver autonomous reliability solutions using advanced machine learning, natural language processing, and computer vision.
- Drive development of MLOps and AIOps platforms that enable continuous learning, deployment, monitoring, and autonomous optimization of reliability systems.
- Innovate in agentic AI technologies for Reliability Engineering, including:
- Large language models for automated incident response
- Reinforcement learning agents for capacity optimization
- Multi-modal AI for infrastructure monitoring
- Federated learning for cross-domain reliability insights
- Implement advanced CI/CD pipelines for reliability platforms with automated validation, deployment, rollback, and built-in observability.
- Establish platform engineering excellence by building reusable reliability infrastructure, intelligent monitoring platforms, and developer productivity tools.
- Provide technical mentorship and thought leadership across Walmart Technology through code reviews, design discussions, and knowledge sharing.
- Bachelor’s or Master’s degree in engineering, Computer Science, or a related field with 12+ years of hands-on experience in Reliability Engineering, AI/ML Engineering, or Platform Engineering.
- Proven record as a senior individual contributor influencing architecture and driving technical excellence across large organizations.
- Deep experience operating mission-critical systems, with expertise in MTTD, MTTR, availability, change management, model performance, and autonomous system reliability.
- Expert-level AI/ML engineering experience, including deep learning frameworks such as TensorFlow and PyTorch and large-scale production ML deployments.
- Advanced experience with agentic AI systems, including multi-agent frameworks, autonomous decision-making systems, LLM-based agents, and agent orchestration platforms.
- Comprehensive Reliability Engineering expertise, including service management (Incident, Problem, and Change Management) and performance and capacity engineering for AI/ML systems.
- Expert-level cloud engineering experience (Azure, GCP, AWS) with containerization (Kubernetes, Docker), serverless architectures, and cloud-native AI services.
- Deep observability experience across distributed tracing, metrics, logs, APM, and AI-driven anomaly detection.
- Strong platform engineering background including infrastructure as code, service mesh architectures, API gateways, and self-service developer platforms.
- MLOps and model lifecycle management using platforms such as MLflow, Kubeflow, or Seldon.
- NLP and computer vision expertise for intelligent log analysis, automated incident response, and visual infrastructure monitoring.
- Edge computing and distributed systems experience for retail stores and distribution centers.
- Real-time streaming architectures (Kafka, Pulsar).
- Chaos engineering, fault injection, and performance optimization for large-scale distributed systems.
- Open-source contributions in reliability, observability, or infrastructure automation.
At Walmart, we offer competitive pay as well as performance-based bonus awards and other great benefits for a happier mind, body, and wallet. Health benefits include medical, vision and dental coverage. Financial benefits include 401(k), stock purchase and company-paid life insurance. Paid time off benefits include PTO (including sick leave), parental leave, family care leave, bereavement, jury duty, and voting. Other benefits include short-term and long-term disability, company discounts, Military Leave Pay, adoption and surrogacy expense reimbursement, and more. You will also receive PTO and/or PPTO that can be used for vacation, sick leave, holidays, or other purposes. The amount you receive depends on your job classification and length of employment. It will meet or exceed the requirements of paid sick leave laws, where applicable. For information about PTO, see https://one.walmart.com/notices. Live Better U is a Walmart-paid education benefit program for full-time and part-time associates in Walmart and Sam's Club facilities. Programs range from high school completion to bachelor's degrees, including English Language Learning and short-form certificates. Tuition, books, and fees are completely paid for by Walmart.
Eligibility requirements apply to some benefits and may depend on your job classification and length of employment. Benefits are subject to change and may be subject to a specific plan or program terms.
For information about benefits and eligibility, see One.Walmart.
The annual salary range for this position is $169,000.00 - $338,000.00 Additional compensation includes annual or quarterly performance bonuses. Additional compensation for certain positions may also include :
- Stock
ㅤ
ㅤ
ㅤ
ㅤ
Minimum Qualifications...
Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.
Option 1: Bachelor's degree in computer science, computer engineering, computer information systems, software engineering, or related area and 6 years’ experience in software engineering or related area.Option 2: 8 years’ experience in software engineering or related area.
Preferred Qualifications...
Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.
Master’s degree in computer science, computer engineering, computer information systems, software engineering, or related area and 4 years' experience in software engineering or related area, We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly. The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart’s accessibility standards and guidelines for supporting an inclusive culture.Primary Location...
1345 Crossman Ave, Sunnyvale, CA 94089-1114, United States of AmericaWalmart and its subsidiaries are committed to maintaining a drug-free workplace and has a no tolerance policy regarding the use of illegal drugs and alcohol on the job. This policy applies to all employees and aims to create a safe and productive work environment.Walmart
37 jobs posted
About the job
Posted on
Mar 19, 2026
Apply before
Apr 18, 2026
Job typeFull-time
Salary Range
$169,000 - $338,000
CategoryML Engineer
Location
Similar Jobs
17d
Senior AI/ML Engineer*
Egen
RemoteSenior AI/ML Engineer*
Egen
Remote17d18d
Principal AI/ML Engineer, Yahoo Mail
Yahoo
$144K - $299KUnited StatesPrincipal AI/ML Engineer, Yahoo Mail
Yahoo
$144K - $299KUnited States18d2d
Senior ML Engineer- Cloud AI Platform
Visa
$131K - $202KAustin, TXSenior ML Engineer- Cloud AI Platform
Visa
$131K - $202KAustin, TX2d6d
ML Engineer
Egen
HyderabadML Engineer
Egen
Hyderabad6d16d
Senior ML Engineer
Truecaller
SwedenSenior ML Engineer
Truecaller
Sweden16d16d
Sr. ML Engineer
Visa
Bengaluru, IndiaSr. ML Engineer
Visa
Bengaluru, India16d6d
Staff ML Engineer
Visa
$214KFoster City, CAStaff ML Engineer
Visa
$214KFoster City, CA6d3d
(USA) Solution Consultant III, Gen-AI - AI/ML Engineer
Walmart
$96K - $186KIrvine, CA(USA) Solution Consultant III, Gen-AI - AI/ML Engineer
Walmart
$96K - $186KIrvine, CA3d15d
ASIC AI/ML Engineer, Annapurna Labs, MLA SOC, Methodology & Infrastructure
Amazon
IsraelASIC AI/ML Engineer, Annapurna Labs, MLA SOC, Methodology & Infrastructure
Amazon
Israel15d15d
AI Senior Machine Learning Engineer -Gen AI, Machine Learning, Graph ML (10189)
Extreme Networks
San Jose, CaliforniaAI Senior Machine Learning Engineer -Gen AI, Machine Learning, Graph ML (10189)
Extreme Networks
San Jose, California15d
Looking for something different?
Browse all AI jobsFree AI job alerts
Get the latest AI jobs delivered to your inbox every week. Free, no spam.