High Performance Computing Engineer

Santa Clara HQ

Job Description

This job posting has expired and no longer accepting applications.

Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language, audio, and entertainment.

About The Role

We are looking for a Senior High Performance Computing Engineer to help us operate the GPUs, network and filesystem in our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, Infiniband, NVIDIA deepops, Ethernet networking and related tools are a big plus. You should be comfortable performing some amount of hardware configuration.

You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 20PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.

Boson AI is a startup building large language tools for everyone to use. Our founders (Alex Smola, Mu Li), and a team of Deep Learning, Optimization, NLP, AutoML and Statistics scientists and engineers are working on high quality generative AI models for language, audio, and entertainment.

About The Role

We are looking for a Senior High Performance Computing Engineer to help us operate the GPUs, network and filesystem in our datacenter deployment in Toronto. The ideal candidate needs to have strong problem solving skills and an ability to learn new tools. Experience with Slurm, MAAS, Ceph, Infiniband, NVIDIA deepops, Ethernet networking and related tools are a big plus. You should be comfortable performing some amount of hardware configuration.

You will have the opportunity to work with NVIDIA H100 and A100 GPUs, over 20PB of storage, Terabit networking and hundreds of computers. You will be responsible for deploying and operating a broad range of infrastructure technologies and hardware systems.

A day in the life:

Manage private large high-end GPU clusters

Responsible for full lifecycle of physical systems including deployments of new hardware, operations, triage and troubleshooting

Configure and maintain network switches (Tomahawk Ethernet, Mellanox Infiniband)

Configure and maintain MAAS, Ceph, Slurm and Kubernetes

Configure and automate on-premises Linux-based systems at scale using infrastructure-as-code practices

Configure and maintain network, e.g. Layer 3 networking

Learn about new tools and deploy them

You might be a great fit if you have:

Strong background in high performance computing

Experience with with on-premises Data Center operations and technologies

Experience in managing a large hardware cluster

Proficiency in at least one programming language (e.g. Python) and ability to write clean, maintainable code

Experience in designing, deploying, and maintaining production-grade machine learning systems at scale

Familiarity with GPU utilization for machine learning workloads and optimization techniques

Experience with managing firmware / systems updates for systems, e.g. on SuperMicro

The ability to solve problems and to learn new techniques is key.

Please mention that you found this job on MoAIJobs, this helps us grow. Thank you!

High Performance Computing Engineer

Job Description

A day in the life:

You might be a great fit if you have:

Boson AI

About the job

Share this job opportunity

Professional headshots increase response rates by 40%

Similar Jobs

Reliability Engineer | High-Performance AI

AI Performance Engineer - GPU

AI Performance Engineer - GPU

Research Engineer, Model Performance & Quality

Staff Machine Learning Engineer, ML Performance & Optimization

Machine Learning Performance Engineer, Annapurna Labs

Staff Machine Learning Performance Engineer, Inference Optimisation

Data Engineer

Data Engineer

Data Engineer