AI Performance Engineer - GPU

Amsterdam, Netherlands

Job Description

WHAT YOU DO AT AMD CHANGES EVERYTHING At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you’ll discover the real differentiator is our culture. We push the limits of innovation to solve the world’s most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career. AI Performance Engineers focus on pushing machine learning workloads to peak hardware efficiency, with emphasis on low-level optimization and kernel performance. As an AI Performance Engineer, you will: Analyze and explore recent ML models and workloads, understand their compute, memory, and instruction-level behavior, and optimize them on AMD GPUs for both inference and training Design, implement, and tune GPU kernels at a low level, including C++, intrinsics, and hand-written GPU assembly Perform deep profiling and bottleneck analysis across compute, memory, and execution pipelines Optimize instruction scheduling, memory access patterns, and occupancy to achieve the final 5% of performance uplift Work closely with hardware, compiler, and software teams to drive performance improvements across the full stack Collaboration with others: The position is part of an AI and GPU performance optimization workstream at AMD Collaborate with AI developers, compiler engineers, and hardware architects to understand performance limits and opportunities Work with multiple teams located locally in Finland and the UK, as well as internationally Communicate performance bottlenecks, solutions, and optimization strategies clearly across teams Main goals in the first 6 months: Benchmark, analyze, and optimize performance of key machine learning workloads on single- and multi-GPU AMD systems Design, implement, and tune high-performance GPU kernels for tensor operations such as matrix multiplication, attention, and convolutions Apply instruction-level and memory-level optimizations to achieve measurable performance improvements Deliver high-quality, well-documented, production-ready performance-critical code Ideal candidate profile: The ideal candidate has a strong interest in: Low-level performance optimization GPU programming and hardware architecture Instruction scheduling, memory hierarchies, and execution pipelines Extracting the maximum achievable performance from hardware under real-world constraints You are passionate about getting the best out of the hardware and motivated by the challenge of delivering that extra 5% performance uplift that separates good from exceptional.  Required skills & qualifications: Strong understanding of GPU architectures and low-level optimization techniques, including memory hierarchy, instruction scheduling, and performance tradeoffs GPU software development using HIP, CUDA, or OpenCL Strong C++ skills for performance-critical systems programming Experience with profiling, debugging, benchmarking, and performance analysis tools Experience in high-performance computing (HPC) or performance-critical systems Familiarity with modern ML frameworks (e.g., PyTorch, MIOpen) Familiarity with tile programming and related framework (Triton, Cutlass, etc)  Strong written and spoken English Experience with Docker, Singularity, Slurm, or Kubernetes is a plus BSc, MSc, PhD, or equivalent experience in Computer Science, Engineering, Physics, or a related technical field How to stand out from the crowd: Experience with GPU assembly programming and/or compiler backends Experience with CPU assembly (x86 or Arm) and microarchitectural optimization Background in compilers, code generation, or instruction-level optimization #LI-MH3 #LI-HYBRID Benefits offered are described: AMD benefits at a glance. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

Benefits offered are described: AMD benefits at a glance. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

AI Performance Engineers focus on pushing machine learning workloads to peak hardware efficiency, with emphasis on low-level optimization and kernel performance. As an AI Performance Engineer, you will: Analyze and explore recent ML models and workloads, understand their compute, memory, and instruction-level behavior, and optimize them on AMD GPUs for both inference and training Design, implement, and tune GPU kernels at a low level, including C++, intrinsics, and hand-written GPU assembly Perform deep profiling and bottleneck analysis across compute, memory, and execution pipelines Optimize instruction scheduling, memory access patterns, and occupancy to achieve the final 5% of performance uplift Work closely with hardware, compiler, and software teams to drive performance improvements across the full stack Collaboration with others: The position is part of an AI and GPU performance optimization workstream at AMD Collaborate with AI developers, compiler engineers, and hardware architects to understand performance limits and opportunities Work with multiple teams located locally in Finland and the UK, as well as internationally Communicate performance bottlenecks, solutions, and optimization strategies clearly across teams Main goals in the first 6 months: Benchmark, analyze, and optimize performance of key machine learning workloads on single- and multi-GPU AMD systems Design, implement, and tune high-performance GPU kernels for tensor operations such as matrix multiplication, attention, and convolutions Apply instruction-level and memory-level optimizations to achieve measurable performance improvements Deliver high-quality, well-documented, production-ready performance-critical code Ideal candidate profile: The ideal candidate has a strong interest in: Low-level performance optimization GPU programming and hardware architecture Instruction scheduling, memory hierarchies, and execution pipelines Extracting the maximum achievable performance from hardware under real-world constraints You are passionate about getting the best out of the hardware and motivated by the challenge of delivering that extra 5% performance uplift that separates good from exceptional.  Required skills & qualifications: Strong understanding of GPU architectures and low-level optimization techniques, including memory hierarchy, instruction scheduling, and performance tradeoffs GPU software development using HIP, CUDA, or OpenCL Strong C++ skills for performance-critical systems programming Experience with profiling, debugging, benchmarking, and performance analysis tools Experience in high-performance computing (HPC) or performance-critical systems Familiarity with modern ML frameworks (e.g., PyTorch, MIOpen) Familiarity with tile programming and related framework (Triton, Cutlass, etc)  Strong written and spoken English Experience with Docker, Singularity, Slurm, or Kubernetes is a plus BSc, MSc, PhD, or equivalent experience in Computer Science, Engineering, Physics, or a related technical field How to stand out from the crowd: Experience with GPU assembly programming and/or compiler backends Experience with CPU assembly (x86 or Arm) and microarchitectural optimization Background in compilers, code generation, or instruction-level optimization #LI-MH3 #LI-HYBRID

Please mention that you found this job on MoAIJobs, this helps us grow. Thank you!