AI Infrastructure Engineer
The foundation of enterprise-grade intelligence lies in the seamless orchestration of compute, storage, and networking tailored for high-dimensional workloads. By partnering with a premier AI infrastructure engineer and ML platform engineer, organizations bypass the common pitfalls of technical debt and latency, securing a competitive advantage that defines a successful AI DevOps career in the modern digital economy.
AI Infrastructure
Engineer
Architecting the backbone of global intelligence. We are looking for an elite systems engineer to build, scale, and optimize the distributed environments that power Sabalynx’s production AI deployments across 20+ countries.
The Foundation of Scale
At Sabalynx, AI is not a laboratory experiment; it is a mission-critical production utility. As an AI Infrastructure Engineer, you sit at the intersection of DevOps, Data Engineering, and Machine Learning. Your mission is to eliminate the friction between model development and global deployment.
You will architect high-availability inference clusters, manage massive-scale GPU orchestration, and implement zero-trust data pipelines that meet the stringent regulatory requirements of our Fortune 500 clients in healthcare, finance, and defense.
Target Environment Specs
SYSTEM STATUS: OPTIMIZED for multi-region failover and elastic GPU provisioning.
What You Will Master
Your daily work involves solving the “Hard Problems” of AI deployment.
Distributed GPU Orchestration
Design and maintain Kubernetes-based clusters (EKS, GKE, AKS) optimized for NVIDIA A100/H100 fleets, utilizing NVIDIA-Docker and MIG (Multi-Instance GPU) for maximum resource efficiency.
MLOps Pipeline Engineering
Architect end-to-end CI/CD/CT (Continuous Training) pipelines using Kubeflow, MLflow, or TFX to automate the transition from Jupyter notebooks to hardened production APIs.
Infrastructure as Code (IaC)
Lead the transition to absolute declarative infrastructure using Terraform and Crossplane, ensuring multi-cloud parity and rapid disaster recovery capabilities across regions.
Inference Optimization
Collaborate with ML engineers to implement and scale inference servers (Triton, TGI, vLLM) and optimize model serving through quantization, pruning, and caching strategies.
Observability & Drift Detection
Deploy sophisticated monitoring stacks using Prometheus, Grafana, and specialized tools like Arize or WhyLabs to track system health and model performance decay in real-time.
Security & Sovereignty
Implement Zero-Trust architectures and VPC-SC perimeters. Ensure data residency compliance (GDPR, HIPAA) through automated encryption and rigorous IAM auditing.
Large-Scale Vector DBs
Manage and scale vector database infrastructure (Pinecone, Milvus, Weaviate) to support RAG-based Generative AI applications for millions of concurrent users.
Edge & Hybrid Deployment
Architect solutions for hybrid-cloud and edge computing scenarios, ensuring low-latency intelligence in disconnected or bandwidth-constrained environments.
Technical DNA
Proven Background
5+ years of experience in SRE, DevOps, or Infrastructure Engineering, with at least 2 years specifically focused on supporting Machine Learning workloads.
Development Proficiency
Expertise in Python, Go, or Rust. You don’t just “use” tools; you write custom providers, operators, and middleware to bridge infrastructure gaps.
Container Mastery
Deep knowledge of Docker internals and Kubernetes orchestration, including custom resource definitions (CRDs) and service mesh implementations (Istio/Linkerd).
Cloud Sovereignty
Professional certification and hands-on experience across AWS, GCP, and Azure, with specific expertise in their respective AI/ML managed services.
Nice-to-Have Skills
- • Experience with low-level CUDA optimization or Triton kernels.
- • Contributions to open-source projects (Ray, Kubeflow, BentoML).
- • Background in Financial High-Frequency Trading or Large Scale Log Analysis.
- • Familiarity with Quantization techniques (AWQ, GPTQ).
What We Offer
Beyond the competitive salary, we offer the chance to work on the most complex AI deployments on the planet.
Radical Ownership
We hire experts to tell us what to do. You’ll have the autonomy to choose the right tools and define the architecture.
Learning Stimulus
Annual $5k budget for conferences, certifications, and high-end hardware. If it makes you better, we pay for it.
Truly Global
Work from anywhere. We provide co-working stipends and host annual all-hands retreats in world-class locations.
Equity Stake
Everyone at Sabalynx is an owner. We offer meaningful equity packages that align your success with the company’s.
Build the Future of
AI Infrastructure
We are reviewing applications on a rolling basis. Join a team where your infrastructure expertise is the primary catalyst for global innovation.
Architecting the Substrate of Intelligence
At Sabalynx, being an AI Infrastructure Engineer transcends standard DevOps or SRE roles. You are the architect of the high-performance computing environments that power global enterprise transformation.
Complex Heterogeneous Computing
We solve the “Cold Start” problem at a global scale. You will be managing multi-region, multi-cloud GPU clusters (A100s, H100s, and specialized ASICs) across 20+ countries. Our infrastructure is designed for 99.99% availability of inference services, necessitating deep expertise in NVIDIA Triton Inference Server, TensorRT, and dynamic quantization strategies.
Industrial-Grade MLOps
We move beyond simple CI/CD. Our engineers build autonomous retraining pipelines, drift detection systems that trigger at the edge, and feature stores that handle petabyte-scale streaming data. You’ll be working with Kubernetes Operators designed specifically for machine learning workloads, ensuring that our models remain performant and ethical in production.
Infrastructure as Code (IaC) at Scale
Our deployments are deterministic. We utilize Terraform, Pulumi, and Crossplane to treat infrastructure as a living, evolving software product. You will lead the migration from static provisioning to intent-based networking and software-defined storage, ensuring our AI diagnostic and financial fraud detection systems are immune to regional outages.
Security & Sovereignty
Working in 20+ countries means navigating complex data residency laws (GDPR, CCPA, and beyond). Our engineers build “Confidential Computing” environments using Intel SGX and AMD SEV to ensure that even at the inference layer, data remains encrypted and sovereign. We are the vanguard of secure AI.
Built for the Top 1%
Our interview process is rigorous, transparent, and designed to evaluate your ability to solve high-stakes engineering challenges under real-world constraints. We don’t do brain teasers; we do architecture.
The Architectural Deep-Dive
A 60-minute session with a Senior Staff Engineer. We skip the pleasantries and dive into your most complex deployment. We want to hear about the failures, the edge cases, and the specific metrics you used to validate your infrastructure choices. Expect questions on service mesh optimization and container orchestration internals.
System Design: Global Scale AI
This is the “Masterclass” stage. You will be asked to design an infrastructure for a hypothetical Sabalynx client — for example, a real-time computer vision network for 5,000 retail locations. You must account for latency, bandwidth costs, edge-vs-cloud trade-offs, and security. We evaluate your ability to balance cost-efficiency with extreme performance.
The Automation Stress Test
A hands-on coding and scripting session. You won’t be reversing a binary tree. Instead, you’ll be writing a Kubernetes controller, optimizing a Dockerfile for GPU acceleration, or debugging a Terraform state conflict in a multi-region environment. We value clean, maintainable code and a “security-first” mindset.
Leadership & Vision Alignment
A final conversation with our CTO or Lead Technical Architect. We discuss the future of AI infrastructure — things like Quantum Computing integration, neuromorphic hardware, and sustainable AI. We want to know how you stay ahead of the curve and how you contribute to a culture of radical transparency and continuous learning.
Ready to Deploy AI
Infrastructure Engineering?
Bridge the gap between experimental notebooks and production-grade reliability. Whether you are grappling with GPU orchestration, high-latency inference, or fragmented data pipelines, our engineering leads bring the technical rigor required to stabilize and scale your AI ecosystem. Let’s audit your current stack and architect a high-availability solution designed for the next decade of innovation.