AI-Infrastructure-Engineer

Location: San Francisco (Onsite)
Type: Full-time
Start Date: ASAP

What You'll Do

Design and build infrastructure for deploying, scaling, and managing AI/ML workloads
Develop automation for GPU cluster provisioning, configuration, and orchestration
Build systems for hardware-aware model deployment and inference optimization
Create tooling for AI infrastructure observability, debugging, and performance tuning
Work on integration between hardware intelligence and ML frameworks
Collaborate with customers deploying large-scale AI systems in production
Optimize resource utilization across heterogeneous compute (GPUs, TPUs, custom accelerators)

What You Bring

Strong experience with:

GPU cluster management and orchestration (SLURM, Kubernetes, Ray)
ML infrastructure and frameworks (PyTorch, TensorFlow, JAX, NVIDIA stack)
Distributed training and inference systems
Container orchestration for ML workloads (Docker, Kubernetes, KubeFlow)
Linux systems programming and performance optimization
Python and systems scripting

Familiarity with:

Hardware architectures for AI (NVIDIA GPUs, AMD GPUs, custom accelerators)
High-performance networking for distributed ML (NCCL, InfiniBand, RoCE)
Model serving infrastructure (Triton, vLLM, TensorRT)
Storage systems for ML workloads (distributed filesystems, object storage)
Infrastructure as Code and GitOps workflows

What We're Looking For

We're looking for an AI infrastructure engineer who understands the full stack from silicon to model serving — and can build systems that make AI deployment effortless.

You should have:

Deep understanding of what it takes to run AI workloads at scale
Experience with the operational challenges of GPU clusters and ML infrastructure
Ability to debug performance issues across hardware, networking, and software
Comfort working across infrastructure, ML frameworks, and developer experience
Excitement about building the foundational layer for physical AI systems

Requirements:

Bachelor's or Master's in Computer Science, Computer Engineering, or equivalent experience
3+ years of experience in ML infrastructure, MLOps, or AI platform engineering
Willingness to work startup hours, in-person (weekends included) at our San Francisco office
Work authorization in the United States

Why Join

We're building the intelligence layer for hardware — real-time systems that control physical machines with zero tolerance for latency or failure.

What we offer:

Startup-level equity and highly competitive salary
Ownership over AI infrastructure that powers next-generation systems
Problems at the intersection of hardware intelligence and machine learning
Close collaboration with customers pushing the boundaries of AI deployment

How to Apply

Email: team@cosmiclabs.io
Subject line: AI Infrastructure / [Your Name]

Include in your email:

Your name
Why this role and why Cosmic Labs
What you bring technically
Soonest available start date
GitHub or GitLab link
Confirmation of work authorization in the U.S.
Confirmation of willingness to work full-time, in-person in San Francisco

Attach: PDF resume

Careers