Location: San Francisco (Onsite)
Type: Full-time
Start Date: ASAP
- Design and build infrastructure for deploying, scaling, and managing AI/ML workloads
- Develop automation for GPU cluster provisioning, configuration, and orchestration
- Build systems for hardware-aware model deployment and inference optimization
- Create tooling for AI infrastructure observability, debugging, and performance tuning
- Work on integration between hardware intelligence and ML frameworks
- Collaborate with customers deploying large-scale AI systems in production
- Optimize resource utilization across heterogeneous compute (GPUs, TPUs, custom accelerators)
Strong experience with:
- GPU cluster management and orchestration (SLURM, Kubernetes, Ray)
- ML infrastructure and frameworks (PyTorch, TensorFlow, JAX, NVIDIA stack)
- Distributed training and inference systems
- Container orchestration for ML workloads (Docker, Kubernetes, KubeFlow)
- Linux systems programming and performance optimization
- Python and systems scripting
Familiarity with:
- Hardware architectures for AI (NVIDIA GPUs, AMD GPUs, custom accelerators)
- High-performance networking for distributed ML (NCCL, InfiniBand, RoCE)
- Model serving infrastructure (Triton, vLLM, TensorRT)
- Storage systems for ML workloads (distributed filesystems, object storage)
- Infrastructure as Code and GitOps workflows
We're looking for an AI infrastructure engineer who understands the full stack from silicon to model serving — and can build systems that make AI deployment effortless.
You should have:
- Deep understanding of what it takes to run AI workloads at scale
- Experience with the operational challenges of GPU clusters and ML infrastructure
- Ability to debug performance issues across hardware, networking, and software
- Comfort working across infrastructure, ML frameworks, and developer experience
- Excitement about building the foundational layer for physical AI systems
Requirements:
- Bachelor's or Master's in Computer Science, Computer Engineering, or equivalent experience
- 3+ years of experience in ML infrastructure, MLOps, or AI platform engineering
- Willingness to work startup hours, in-person (weekends included) at our San Francisco office
- Work authorization in the United States
We're building the intelligence layer for hardware — real-time systems that control physical machines with zero tolerance for latency or failure.
What we offer:
- Startup-level equity and highly competitive salary
- Ownership over AI infrastructure that powers next-generation systems
- Problems at the intersection of hardware intelligence and machine learning
- Close collaboration with customers pushing the boundaries of AI deployment
Email: team@cosmiclabs.io
Subject line: AI Infrastructure / [Your Name]
Include in your email:
- Your name
- Why this role and why Cosmic Labs
- What you bring technically
- Soonest available start date
- GitHub or GitLab link
- Confirmation of work authorization in the U.S.
- Confirmation of willingness to work full-time, in-person in San Francisco
Attach: PDF resume