d-Matrix
Company
AI Infrastructure Solution Architect, Principal
Job Description
At d-Matrix, we are focused on unleashing the potential of generative AI to power the transformation of technology. We are at the forefront of software and hardware innovation, pushing the boundaries of what is possible. Our culture is one of respect and collaboration.
We value humility and believe in direct communication. Our team is inclusive, and our differing perspectives allow for better solutions. We are seeking individuals passionate about tackling challenges and are driven by execution. Ready to come find your playground? Together, we can help shape the endless possibilities of AI.
Location:
Hybrid, working onsite at our Santa Clara, Ca headquarters 3-5 days per week.
The Role: AI Infrastructure Solutions Architect, Principal
We are seeking a Solution Architect to develop and deliver comprehensive reference solutions that enable scalable, observable, and manageable deployment of AI inference workloads with d-Matrix hardware and software. You will define and implement full-stack reference solutions including provisioning, telemetry, alerting, and performance monitoring for Gen AI inference clusters. And you will collaborate closely with customers and ecosystem (OEMs, ISVs) to enable successful integration and deployment d-Matrix based solutions.
What You Will Do:
Develop end-to-end AI infrastructure reference solutions optimized for d-Matrix servers including compute, networking, storage, and orchestration layers, in collaboration with various internal teams.
Create reference blueprints that integrate smoothly into cloud-native and on-prem environments.
Develop infrastructure-as-code templates and examples using Ansible, Terraform, and Helm for provisioning d-Matrix-based nodes and clusters.
Integrate with Kubernetes-based systems to enable model deployment, auto-scaling, and fault-tolerant execution.
Design and deploy telemetry and monitoring frameworks to support real-time visibility into d-Matrix cluster health, job status, and system performance.
Integrate with industry-standard observability stacks (e.g., Prometheus, Grafana, OpenTelemetry) for data collection, visualization, and alerting.
Develop dashboards, health check systems, and metric pipelines that track performance, availability, and operational KPIs
Collaborate with performance and software teams to validate infrastructure using real-world workloads and benchmarks.
Incorporate telemetry hooks for benchmark reporting and feedback-driven tuning.
Create and publish detailed infrastructure deployment guides, monitoring configuration templates, and operational best practices.
Collaborate with customers and OEM/ISV ecosystem, enable them to adopt and customize reference solutions to their specific datacenter environments and/or software stacks.
What You Will Bring:
Bachelor's or Master’s degree in Computer Science, or related technical field.
10+ years of experience in infrastructure solution architecture, systems management, DevOps, or platform engineering roles.
Experience working with GPUs, custom AI accelerators or heterogeneous compute environments.
Proven expertise in building, managing, and monitoring full-stack AI infrastructure at scale.
Strong scripting/automation skills: Python, Bash, Ansible, Terraform, Helm, Docker/Kubernetes.
Deep understanding of orchestration technologies (Kubernetes, Ray, KServe, etc.), containerization, server clusters, multi-tenant serving, etc.
Experience with observability stacks (Prometheus, Grafana, OpenTelemetry, etc.)
Strong skills in scripting and automation (e.g., Python, Bash, Ansible, Terraform, Helm).
Familiarity with model serving and orchestration platforms (e.g., Triton Inference Server, Ray Serve, Kubeflow).
Strong system debugging and incident response skills.
Outstanding collaboration and communication skills
Equal Opportunity Employment Policy
d-Matrix is proud to be an equal opportunity workplace and affirmative action employer. We’re committed to fostering an inclusive environment where everyone feels welcomed and empowered to do their best work. We hire the best talent for our teams, regardless of race, religion, color, age, disability, sex, gender identity, sexual orientation, ancestry, genetic information, marital status, national origin, political affiliation, or veteran status. Our focus is on hiring teammates with humble expertise, kindness, dedication and a willingness to embrace challenges and learn together every day.
d-Matrix does not accept resumes or candidate submissions from external agencies. We appreciate the interest and effort of recruitment firms, but we kindly request that individual interested in opportunities with d-Matrix apply directly through our official channels. This approach allows us to streamline our hiring processes and maintain a consistent and fair evaluation of al applicants. Thank you for your understanding and cooperation.
d-Matrix
5 jobs posted
About the job
Similar Jobs
Discover more opportunities that match your interests
- 21 days ago
Senior AI Solution Architect
Celonis
Paris, FranceView details - 1 day ago
Solution Architect - AI
Celonis
Paris, FranceView details
27 days agoAI DD Principal
Brain Co.
San Francisco Bay AreaView details
24 days agoEnterprise AI Architect
Lucid Motors
Newark, CAView details
20 days agoSolutions Architect, AI
Valence
View details- 19 days ago
Principal, Agentic AI
Paypal
San Jose, California, United States of AmericaView details - 10 days ago
AI Infrastructure Engineer
AMD
San Jose, CaliforniaView details - 4 days ago
Principal Machine Learning Engineer, AI Platform – AI Infrastructure
Grab
Singapore, SGView details - 1 day ago
Principal AI Engineer
Mastercard
Arlington, VirginiaView details - 29 days ago
Principal Machine Learning Architect
Ema
India - BengaluruView details
Looking for something different?
Browse all AI jobs