Senior AI/ML Infrastructure Engineer

Austin, Texas

Full-time

Job Description

WHAT YOU DO AT AMD CHANGES EVERYTHING We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences – the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world’s most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives. AMD together we advance_ THE ROLE: AMD is looking for a specialized software engineer who is passionate about improving the performance of key applications and benchmarks. You will be a member of a core team of incredibly talented industry specialists and will work with the very latest hardware and software technology.  THE PERSON: The ideal candidate should be passionate about software engineering and possess leadership skills to drive sophisticated issues to resolution. Able to communicate effectively and work optimally with different teams across AMD. KEY RESPONSIBILITIES: Architect and maintain robust, scalable infrastructure for training and deploying machine learning and large language models, ensuring optimal performance. Collaborate with AI researchers, data scientists, and software engineers to streamline the end-to-end AI model lifecycle, from development to deployment and monitoring. Design, develop, and fine-tune large-scale language models and other deep learning models for various applications. Implement and manage CI/CD pipelines for AI models, facilitating continuous integration, continuous deployment, and continuous training practices. Monitor the performance of machine learning and large language models, identifying and addressing issues related to data drift, model degradation, and resource constraints. Develop and enforce best practices for version control, testing, and deployment of AI models, ensuring compliance with industry standards and regulatory requirements. Optimize computing resources for training and inference processes, leveraging cloud technologies and onPrem solutions. Stay updated with the latest advancements in AI/ML technologies, tools, and practices, integrating them into our operations to enhance efficiency and effectiveness. Implement best practices in model training, including managing overfitting, underfitting, and ensuring model generalizability across various domains. Fine-tune models for specific tasks or industries using targeted techniques and adapt models to new domains or applications. Develop and maintain tools and frameworks to streamline the model training, validation, and deployment process. Document methodologies, processes, and findings; effectively communicate complex technical information to both technical and non-technical stakeholders. Mentor junior team members and contribute to the team's collective knowledge and expertise in deep learning and AI. PREFERRED EXPERIENCE: Software Development (Systems Engineering Focus): Proven experience in designing, developing, and maintaining robust software systems, with a deep understanding of performance, scalability, and reliability. ML Ops Expertise: Hands-on experience in deploying, monitoring, and managing machine learning models in production environments, including automation of pipelines and CI/CD practices. Strong proficiency in Python and familiarity with deep learning frameworks like TensorFlow, PyTorch, and Keras. Problem-Solving: Demonstrated ability to troubleshoot complex issues, resolve critical bottlenecks, and drive root cause analysis under time-sensitive conditions. Cloud & Infrastructure Knowledge: Familiarity with cloud platforms (AWS, Azure, GCP) and containerization/orchestration technologies (Docker, Kubernetes). Understanding of the ethical considerations and security implications of deploying AI models, particularly large language models. Collaboration & Communication: Strong cross-functional collaboration skills with the ability to clearly communicate technical concepts to both technical and non-technical stakeholders. Continuous Learning & Adaptability: Proven track record of quickly adapting to new technologies, tools, and methodologies in a fast-paced environment. ACADEMIC CREDENTIALS: Bachelor’s or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent #LI-JG1 Benefits offered are described: AMD benefits at a glance. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

Benefits offered are described: AMD benefits at a glance. AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

THE ROLE: AMD is looking for a specialized software engineer who is passionate about improving the performance of key applications and benchmarks. You will be a member of a core team of incredibly talented industry specialists and will work with the very latest hardware and software technology.  THE PERSON: The ideal candidate should be passionate about software engineering and possess leadership skills to drive sophisticated issues to resolution. Able to communicate effectively and work optimally with different teams across AMD. KEY RESPONSIBILITIES: Architect and maintain robust, scalable infrastructure for training and deploying machine learning and large language models, ensuring optimal performance. Collaborate with AI researchers, data scientists, and software engineers to streamline the end-to-end AI model lifecycle, from development to deployment and monitoring. Design, develop, and fine-tune large-scale language models and other deep learning models for various applications. Implement and manage CI/CD pipelines for AI models, facilitating continuous integration, continuous deployment, and continuous training practices. Monitor the performance of machine learning and large language models, identifying and addressing issues related to data drift, model degradation, and resource constraints. Develop and enforce best practices for version control, testing, and deployment of AI models, ensuring compliance with industry standards and regulatory requirements. Optimize computing resources for training and inference processes, leveraging cloud technologies and onPrem solutions. Stay updated with the latest advancements in AI/ML technologies, tools, and practices, integrating them into our operations to enhance efficiency and effectiveness. Implement best practices in model training, including managing overfitting, underfitting, and ensuring model generalizability across various domains. Fine-tune models for specific tasks or industries using targeted techniques and adapt models to new domains or applications. Develop and maintain tools and frameworks to streamline the model training, validation, and deployment process. Document methodologies, processes, and findings; effectively communicate complex technical information to both technical and non-technical stakeholders. Mentor junior team members and contribute to the team's collective knowledge and expertise in deep learning and AI. PREFERRED EXPERIENCE: Software Development (Systems Engineering Focus): Proven experience in designing, developing, and maintaining robust software systems, with a deep understanding of performance, scalability, and reliability. ML Ops Expertise: Hands-on experience in deploying, monitoring, and managing machine learning models in production environments, including automation of pipelines and CI/CD practices. Strong proficiency in Python and familiarity with deep learning frameworks like TensorFlow, PyTorch, and Keras. Problem-Solving: Demonstrated ability to troubleshoot complex issues, resolve critical bottlenecks, and drive root cause analysis under time-sensitive conditions. Cloud & Infrastructure Knowledge: Familiarity with cloud platforms (AWS, Azure, GCP) and containerization/orchestration technologies (Docker, Kubernetes). Understanding of the ethical considerations and security implications of deploying AI models, particularly large language models. Collaboration & Communication: Strong cross-functional collaboration skills with the ability to clearly communicate technical concepts to both technical and non-technical stakeholders. Continuous Learning & Adaptability: Proven track record of quickly adapting to new technologies, tools, and methodologies in a fast-paced environment. ACADEMIC CREDENTIALS: Bachelor’s or Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent #LI-JG1

Please mention that you found this job on MoAIJobs, this helps us grow. Thank you!