Cloud Hardware Development Engineer, Cloud AI/ML/storage server teams
Posted 9 days ago
Job Description
As a Cloud Hardware Development Engineer, you will be an end-to-end owner of storage and/or accelerator (AI/ML/GPU) server platforms — from New Product Introduction (NPI) through fleet health in production. You own the full lifecycle: design, development, qualification, launch, and ongoing operational excellence of servers running at scale in the AWS fleet.
You will work closely with internal customers to understand their technical needs and business goals, leveraging your experience with server design and the knowledge of various teams to architect solutions we deploy at scale. To deliver your products, you will work with an interdisciplinary team of component, firmware, power, mechanical, electrical, test, qualification, manufacturing engineers, and lead our ODM (design and manufacturing partners) to bring these servers to the data center. After launch, you own the fleet — monitoring quality, driving reliability improvements, and ensuring servers continue to meet customer requirements throughout their
operational life.
This role demands deep technical curiosity and the willingness to jump in and personally solve the hardest problems. When a complex system failure occurs — whether during NPI qualification or in a production fleet of hundreds of thousands of servers — you roll up your sleeves, dive into the details across hardware, firmware, software, and physical layers, and drive to root cause. You don't wait for someone else to figure it out.
You will own end-to-end system reliability — proactively identifying deficiencies and driving toward zero-touch operations where automation detects, diagnoses, and resolves issues before customer impact. You will decompose complex server system problems (testability, reliability, diagnostics) into deliverable tasks and features, leading delivery yourself and through others in parallel.
This is a fast-paced, intellectually challenging position. You'll work with thought leaders in multiple technology areas, hold high standards for yourself and everyone you work with, and constantly look for ways to improve your products' performance, quality, and cost. We're changing an industry, and we want individuals who are ready for this challenge and want to reach beyond what is possible today.
Key job responsibilities
NPI — New Product Introduction
- Own the end-to-end NPI lifecycle for storage and/or accelerator (AI/ML/GPU) server platforms — from architecture definition through design, qualification, manufacturing ramp, and launch
- Lead technical solutions for complex server and rack system architectural challenges
- Work with ODM/manufacturing partners to develop, validate, and manufacture server products at scale
- Develop functional specifications, design verification plans, and test procedures
- Drive qualification and readiness milestones, ensuring new platforms meet performance, reliability, and cost targets before fleet deployment
- Identify and resolve technical risks early in the development cycle — don't let problems reach production
Fleet Health, Diagnostics & Automation
- Own fleet health for the server platforms you launch — reliability doesn't end at ship
- Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact
- Drive toward zero-touch operations — help build detection, diagnoses, and remediation of faults without human intervention
- Debug complex system failures in time-sensitive settings — personally diving deep when the problem demands it
- Perform root cause analysis correlating across firmware, kernel, driver, thermal, power, and physical layers
Systems Design & Technical Depth
- Apply expertise across hardware, software, system design, x86 architecture, processes, and operations (compute, storage, network, GPU)
- Design and implement solutions to address system-level issues at large scale
- Decompose complex server system problems (testability, reliability, diagnostics) into deliverable tasks and features
- Collaborate with hardware, software, manufacturing, supply chain, and product management teams
Cross-Team Collaboration
- Work closely with internal customers to ensure new server hardware meets data path and control path requirements
- Identify early any potential problems onboarding new servers into customer ecosystems
- Collaborate across Hardware Engineering, component, firmware, test, qualification, and integration teams
- Partner with datacenter operations to close the loop between field failures and design improvements
A day in the life
Your day-to-day responsibilities include interfacing with internal and external customers to understand product requirements and facilitate system development on top of your server designs. You will learn operational challenges facing our existing fleet with the goal of improving the current customer experience and developing improved systems for future designs. You will work directly with vendors and ODM (manufacture partners) to scale your product. Some days you're reviewing a new platform design with your ODM; other days you're deep in logs and telemetry data chasing a failure mode across the fleet. You thrive
on that range.
You will work closely with internal customers to understand their technical needs and business goals, leveraging your experience with server design and the knowledge of various teams to architect solutions we deploy at scale. To deliver your products, you will work with an interdisciplinary team of component, firmware, power, mechanical, electrical, test, qualification, manufacturing engineers, and lead our ODM (design and manufacturing partners) to bring these servers to the data center. After launch, you own the fleet — monitoring quality, driving reliability improvements, and ensuring servers continue to meet customer requirements throughout their
operational life.
This role demands deep technical curiosity and the willingness to jump in and personally solve the hardest problems. When a complex system failure occurs — whether during NPI qualification or in a production fleet of hundreds of thousands of servers — you roll up your sleeves, dive into the details across hardware, firmware, software, and physical layers, and drive to root cause. You don't wait for someone else to figure it out.
You will own end-to-end system reliability — proactively identifying deficiencies and driving toward zero-touch operations where automation detects, diagnoses, and resolves issues before customer impact. You will decompose complex server system problems (testability, reliability, diagnostics) into deliverable tasks and features, leading delivery yourself and through others in parallel.
This is a fast-paced, intellectually challenging position. You'll work with thought leaders in multiple technology areas, hold high standards for yourself and everyone you work with, and constantly look for ways to improve your products' performance, quality, and cost. We're changing an industry, and we want individuals who are ready for this challenge and want to reach beyond what is possible today.
Key job responsibilities
NPI — New Product Introduction
- Own the end-to-end NPI lifecycle for storage and/or accelerator (AI/ML/GPU) server platforms — from architecture definition through design, qualification, manufacturing ramp, and launch
- Lead technical solutions for complex server and rack system architectural challenges
- Work with ODM/manufacturing partners to develop, validate, and manufacture server products at scale
- Develop functional specifications, design verification plans, and test procedures
- Drive qualification and readiness milestones, ensuring new platforms meet performance, reliability, and cost targets before fleet deployment
- Identify and resolve technical risks early in the development cycle — don't let problems reach production
Fleet Health, Diagnostics & Automation
- Own fleet health for the server platforms you launch — reliability doesn't end at ship
- Design and implement predictive failure detection systems using telemetry, sensor data, error trending, and log correlation to identify hardware issues before they cause customer impact
- Drive toward zero-touch operations — help build detection, diagnoses, and remediation of faults without human intervention
- Debug complex system failures in time-sensitive settings — personally diving deep when the problem demands it
- Perform root cause analysis correlating across firmware, kernel, driver, thermal, power, and physical layers
Systems Design & Technical Depth
- Apply expertise across hardware, software, system design, x86 architecture, processes, and operations (compute, storage, network, GPU)
- Design and implement solutions to address system-level issues at large scale
- Decompose complex server system problems (testability, reliability, diagnostics) into deliverable tasks and features
- Collaborate with hardware, software, manufacturing, supply chain, and product management teams
Cross-Team Collaboration
- Work closely with internal customers to ensure new server hardware meets data path and control path requirements
- Identify early any potential problems onboarding new servers into customer ecosystems
- Collaborate across Hardware Engineering, component, firmware, test, qualification, and integration teams
- Partner with datacenter operations to close the loop between field failures and design improvements
A day in the life
Your day-to-day responsibilities include interfacing with internal and external customers to understand product requirements and facilitate system development on top of your server designs. You will learn operational challenges facing our existing fleet with the goal of improving the current customer experience and developing improved systems for future designs. You will work directly with vendors and ODM (manufacture partners) to scale your product. Some days you're reviewing a new platform design with your ODM; other days you're deep in logs and telemetry data chasing a failure mode across the fleet. You thrive
on that range.
Apply for this position
Please mention that you found this job on MoAIJobs, this helps us grow. Thank you!
Amazon
175 jobs posted
About the job
Posted on
May 29, 2026
Apply before
Jun 28, 2026
Job typeFull-time
Location
US, TX
Similar Jobs
- 10d

AI/ML Engineer
Ello
$155K - $205KSan Francisco, CA - 2d
ML/AI Engineer
AMD
Austin, Texas - 2d

Senior Software Development Engineer In Test, ML/AI
PlayStation
$183K - $275KUnited States, San Mateo, Canada - 18d
AI Software Development Engineer
AMD
Penang, Malaysia - 13d
AI Software Development Engineer
AMD
Penang, Malaysia - 30d
Software Development Engineer - AI/ML, AWS Neuron, Multimodal Inference
Amazon
US, WA - 27d
Software Development Engineer - Evisort AI
Workday
$130KUSA, GA - 25d
Sr. Staff AI/ML Engineer
Gusto
$245K - $272KDenver, COSan Francisco, CANew York, NY - 17d
Software Development Engineer - ML Ops
Workday
$130KUSA, GA - 6d
Support Engineer - AI Server Systems
Tenstorrent
Tokyo, Japan