SoC Firmware Engineering Manager, Annapurna Labs Machine Learning Acceleration, AWS
Posted 1 day ago
Job Description
When a new Trainium or Inferentia chip comes back from the fab, our code is the first software to touch it. We're looking for a hands-on engineering manager who lives and breathes low-level software — someone who's debugged register-level issues at 3am and wants to build a team that does it better.
Our SoC HAL (Hardware Abstraction Layer) team owns the lowest layer of user-space software on AWS's custom ML accelerator chips: the firmware that boots, configures, and manages every hardware block on the SoC. Your software runs as a shared library on embedded Linux, reaching into the chip to program PCIe links, initialize HBM controllers, configure PLLs, manage interrupt controllers, and orchestrate fabric interconnects across 270+ hardware block instances per chip — all deployed across millions of servers in AWS's global fleet.
Tech stack: C++17, CMake, GoogleTest, Python, SystemVerilog DPI, SPI, APB/AXI bus protocols, PCIe, UCIe, HBM, PLL, custom IPs
As the SoC Firmware Manager, you will:
- Manage, coach, and grow a team of 6 engineers — set technical direction, own hiring, and create an environment where strong engineers want to stay
- Coordinate deliverables across chip architects, RTL designers, verification engineers, validation engineers, and platform software teams — you're the single point of accountability for HAL readiness on every new chip program
- Own bring-up for new SoC tape-outs, from first-silicon power-on through production fleet deployment
- Prioritize work across multiple concurrent chip programs and customer teams, balancing urgent bring-up needs against long-term architecture investments
- Drive the architecture of our C++ template metaprogramming framework, BUTR (Built-in Unit Test for Registers), and HITL (Hardware-in-the-Loop) test infrastructure
- Ship the same C++ codebase to three execution environments: SystemVerilog DPI for chip verification, QEMU for emulation, and Carbon OS on embedded microcontrollers for production fleet
- Get into the weeds alongside your team — debug register-level HW/SW interactions, review code, and write code yourself when it matters
Most firmware teams target one platform and ship to a few thousand units. We target three platforms from a single source tree and deploy across AWS's global fleet — where a single register misconfiguration can impact millions of servers. Our software must be stateless, survive live-updates on running production servers without reboots, and be correct down to individual register bits. The microcontroller can reboot at any time — including during customer workloads — and the HAL must resume managing the SoC by querying hardware state on-demand. No cached state, no assumptions.
Your pre-silicon software runs in simulation and emulation months before real silicon arrives. When the chip comes back from the fab, you validate those predictions on real hardware — and when they don't match, you figure out whether it's a silicon bug or a software bug. For Trainium3, our HAL enabled a full ML training workload within 12 hours of first power-on: https://www.aboutamazon.com/news/aws/trainium-3-ultraserver-faster-ai-training-lower-cost
No ML background needed. Your firmware is the foundation that enables ML training across clusters of thousands of interconnected accelerators — you'll work on components like PCIe and HBM, but won't need to understand ML itself.
This role can be based in Cupertino, CA or Austin, TX. The team is split between the two sites.
Our SoC HAL (Hardware Abstraction Layer) team owns the lowest layer of user-space software on AWS's custom ML accelerator chips: the firmware that boots, configures, and manages every hardware block on the SoC. Your software runs as a shared library on embedded Linux, reaching into the chip to program PCIe links, initialize HBM controllers, configure PLLs, manage interrupt controllers, and orchestrate fabric interconnects across 270+ hardware block instances per chip — all deployed across millions of servers in AWS's global fleet.
Tech stack: C++17, CMake, GoogleTest, Python, SystemVerilog DPI, SPI, APB/AXI bus protocols, PCIe, UCIe, HBM, PLL, custom IPs
As the SoC Firmware Manager, you will:
- Manage, coach, and grow a team of 6 engineers — set technical direction, own hiring, and create an environment where strong engineers want to stay
- Coordinate deliverables across chip architects, RTL designers, verification engineers, validation engineers, and platform software teams — you're the single point of accountability for HAL readiness on every new chip program
- Own bring-up for new SoC tape-outs, from first-silicon power-on through production fleet deployment
- Prioritize work across multiple concurrent chip programs and customer teams, balancing urgent bring-up needs against long-term architecture investments
- Drive the architecture of our C++ template metaprogramming framework, BUTR (Built-in Unit Test for Registers), and HITL (Hardware-in-the-Loop) test infrastructure
- Ship the same C++ codebase to three execution environments: SystemVerilog DPI for chip verification, QEMU for emulation, and Carbon OS on embedded microcontrollers for production fleet
- Get into the weeds alongside your team — debug register-level HW/SW interactions, review code, and write code yourself when it matters
Most firmware teams target one platform and ship to a few thousand units. We target three platforms from a single source tree and deploy across AWS's global fleet — where a single register misconfiguration can impact millions of servers. Our software must be stateless, survive live-updates on running production servers without reboots, and be correct down to individual register bits. The microcontroller can reboot at any time — including during customer workloads — and the HAL must resume managing the SoC by querying hardware state on-demand. No cached state, no assumptions.
Your pre-silicon software runs in simulation and emulation months before real silicon arrives. When the chip comes back from the fab, you validate those predictions on real hardware — and when they don't match, you figure out whether it's a silicon bug or a software bug. For Trainium3, our HAL enabled a full ML training workload within 12 hours of first power-on: https://www.aboutamazon.com/news/aws/trainium-3-ultraserver-faster-ai-training-lower-cost
No ML background needed. Your firmware is the foundation that enables ML training across clusters of thousands of interconnected accelerators — you'll work on components like PCIe and HBM, but won't need to understand ML itself.
This role can be based in Cupertino, CA or Austin, TX. The team is split between the two sites.
Amazon
122 jobs posted
About the job
Posted on
Apr 3, 2026
Apply before
May 3, 2026
Job typeFull-time
CategoryMachine Learning
Location
US, CA
Skills
Similar Jobs
16d
Virtual Platform Engineering Sr. Manager, Annapurna Labs Machine Learning Accelerators, AWS
Amazon
US, CAVirtual Platform Engineering Sr. Manager, Annapurna Labs Machine Learning Accelerators, AWS
Amazon
US, CA16d
9dEngineering Manager - Machine Learning
PlayStation
Ireland
Engineering Manager - Machine Learning
PlayStation
Ireland9d
2dEngineering Manager, Machine Learning
PlayStation
Ireland
Engineering Manager, Machine Learning
PlayStation
Ireland2d29d
Senior Machine Learning Engineering Manager
Roblox
$345K - $399KSan Mateo, CASenior Machine Learning Engineering Manager
Roblox
$345K - $399KSan Mateo, CA29d25d
Engineering Manager - Content Machine Learning
Canva
Brisbane, QLD, AustraliaEngineering Manager - Content Machine Learning
Canva
Brisbane, QLD, Australia25d25d
Engineering Manager - Content Machine Learning
Canva
Perth, WA, AustraliaEngineering Manager - Content Machine Learning
Canva
Perth, WA, Australia25d25d
Engineering Manager - Content Machine Learning
Canva
Auckland, Auckland, New ZealandEngineering Manager - Content Machine Learning
Canva
Auckland, Auckland, New Zealand25d25d
Engineering Manager - Content Machine Learning
Canva
Melbourne, VIC, AustraliaEngineering Manager - Content Machine Learning
Canva
Melbourne, VIC, Australia25d22d
Engineering Manager, Machine Learning Platform
Stripe
TorontoEngineering Manager, Machine Learning Platform
Stripe
Toronto22d17d
Senior Manager, Machine Learning Engineering
Workday
$202K - $304KUSA, CASenior Manager, Machine Learning Engineering
Workday
$202K - $304KUSA, CA17d
Looking for something different?
Browse all AI jobsFree AI job alerts
Get the latest AI jobs delivered to your inbox every week. Free, no spam.