AI inference needs GPUs. Web serving does not. Running both on the same cluster means mixed node pools and custom scheduling.
Our Approach
- Mixed Node Pools: CPU for web, GPU for inference, with node affinity
- Model Preloading: Init containers cache model weights, eliminating cold starts
- Spot Instances: 60-70% cost savings with graceful eviction migration
- Request Queuing: Custom queue smooths traffic spikes
Economics
T4 GPUs 24/7: $2,400/month per node. With spot + scheduling: $680/month — 72% reduction.