🎯 Spot Instance Strategy Guide
How to save 30-90% on GPU costs with preemptible instances
⚠️ Critical: Spot Instances Can Terminate Anytime
Spot instances are surplus capacity that can be reclaimed with 2-30 seconds notice (varies by provider). Only use for fault-tolerant workloads with checkpointing enabled.
📊 Spot Pricing Comparison (January 2024)
| Provider | GPU Model | On-Demand | Spot Price | Savings | Termination Notice | Availability |
|---|---|---|---|---|---|---|
| AWS EC2 | A100 40GB | $3.22/hr | $0.97/hr | -70% | 2 minutes | 65% |
| Google Cloud | A100 40GB | $2.77/hr | $0.83/hr | -70% | 30 seconds | 80% |
| Azure | A100 80GB | $3.67/hr | $1.47/hr | -60% | 30 seconds | 70% |
| Lambda Labs | A100 40GB | $2.40/hr | $1.20/hr | -50% | 5 minutes | 85% |
| Vast.ai | A100 40GB | $1.79/hr | $0.51/hr | -71% | No guarantee | Variable |
| RunPod | A100 80GB | $1.89/hr | $0.76/hr | -60% | 10 seconds | 75% |
🚀 The 5-Step Spot Instance Strategy
Step 1: Enable Aggressive Checkpointing
Save model weights every 10-30 minutes. This is non-negotiable for spot instances.
# PyTorch Lightning automatic checkpointing
from pytorch_lightning.callbacks import ModelCheckpoint
checkpoint_callback = ModelCheckpoint(
dirpath='./checkpoints',
filename='model-{epoch:02d}-{val_loss:.2f}',
save_top_k=3,
monitor='val_loss',
every_n_epochs=1,
save_on_train_epoch_end=True, # Critical for spot
auto_insert_metric_name=False
)
Step 2: Implement Termination Handlers
Detect termination signals and save state immediately.
# AWS Spot Instance termination handler
import signal
import requests
import torch
def check_spot_termination():
"""Check AWS metadata for termination notice"""
try:
r = requests.get(
'http://169.254.169.254/latest/meta-data/spot/instance-action',
timeout=1
)
if r.status_code == 200:
return True
except:
pass
return False
def emergency_checkpoint(model, optimizer, epoch, path):
"""Emergency save on termination"""
torch.save({
'epoch': epoch,
'model_state': model.state_dict(),
'optimizer_state': optimizer.state_dict(),
}, f"{path}/emergency_checkpoint.pt")
print("🚨 Emergency checkpoint saved!")
# Register signal handler
signal.signal(signal.SIGTERM, lambda s, f: emergency_checkpoint(...))
Step 3: Use Bid Strategies
Set maximum prices and use multiple regions for better availability.
# Terraform multi-region spot configuration
resource "aws_spot_instance_request" "gpu_spot" {
ami = "ami-gpu-ubuntu"
instance_type = "p3.2xlarge" # V100
spot_price = "1.00" # Max bid price
# Spread across availability zones
availability_zone = element(
data.aws_availability_zones.available.names,
count.index
)
tags = {
Name = "GPU-Spot-${count.index}"
}
}
Step 4: Implement Auto-Recovery
Automatically resume training when instances are terminated.
# Kubernetes Job with spot node selector
apiVersion: batch/v1
kind: Job
metadata:
name: training-job
spec:
backoffLimit: 100 # Retry on termination
template:
spec:
restartPolicy: OnFailure
nodeSelector:
node.kubernetes.io/lifecycle: spot
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: NoSchedule
containers:
- name: trainer
image: my-training:latest
command: ["python", "train.py", "--resume-from-checkpoint"]
Step 5: Mix Spot with On-Demand
Use hybrid strategy: critical components on-demand, workers on spot.
# Ray cluster with mixed instance types
cluster_config = {
"head_node": {
"instance_type": "g4dn.xlarge", # On-demand
"spot": False
},
"worker_nodes": {
"instance_type": "g4dn.12xlarge", # Spot
"spot": True,
"min_workers": 2,
"max_workers": 10,
"spot_price": 3.00
}
}
⚡ Best Practices by Workload Type
✅ PERFECT for Spot Instances:
- Hyperparameter tuning (parallel trials)
- Batch inference on large datasets
- Data preprocessing and augmentation
- Distributed training with fault tolerance
- Monte Carlo simulations
- Grid search / Random search
⚠️ USE WITH CAUTION:
- ❌ Single long-running training (>24h)
- ❌ Time-critical production inference
- ❌ Stateful applications without checkpointing
- ❌ Real-time serving endpoints
- ❌ Training without resumption capability
🎯 Provider-Specific Strategies
AWS EC2 Spot
Strategy: Use Spot Fleet with diversified instance types
- Best regions: us-east-2, us-west-2 (highest capacity)
- Use EC2 Spot Blocks for defined duration (1-6 hours)
- Monitor Spot Advisor for interruption rates
- Typical savings: 70-90%
Google Cloud Preemptible
Strategy: Combine with sustained use discounts
- Maximum runtime: 24 hours (automatic termination)
- Best for batch processing workloads
- Use with GKE for automatic rescheduling
- Typical savings: 60-80%
Azure Spot VMs
Strategy: Use eviction policies and max price caps
- Set eviction policy: Deallocate or Delete
- Use Azure Batch for managed spot pools
- Best regions: Central US, North Europe
- Typical savings: 60-72%
📈 Real-World Savings Examples
✅ Case Study: 70B LLM Fine-tuning
Setup: 8x A100 80GB spot instances on AWS
Strategy: Checkpoint every 500 steps, auto-resume on termination
Results:
- Total training time: 168 hours (7 days)
- Interruptions: 12 times
- On-demand cost: $4,200
- Spot cost: $1,260
- Savings: $2,940 (70%)
🛠️ Monitoring & Automation Tools
Recommended Tools:
- Spotinfo.io - Real-time spot price tracking
- EC2 Spot Advisor - AWS interruption frequency data
- Kubernetes Cluster Autoscaler - Auto-replace terminated nodes
- Ray Autoscaler - Distributed training with spot support
- SkyPilot - Multi-cloud spot orchestration
🔄 Recovery Script Template
#!/bin/bash
# Auto-recovery script for spot instances
CHECKPOINT_DIR="/mnt/checkpoints"
TRAINING_SCRIPT="train.py"
MAX_RETRIES=50
RETRY_COUNT=0
while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
echo "Starting training attempt $((RETRY_COUNT + 1))..."
# Check for existing checkpoint
if [ -f "$CHECKPOINT_DIR/latest.pt" ]; then
echo "Resuming from checkpoint..."
python $TRAINING_SCRIPT --resume "$CHECKPOINT_DIR/latest.pt"
else
echo "Starting fresh training..."
python $TRAINING_SCRIPT --checkpoint-dir "$CHECKPOINT_DIR"
fi
EXIT_CODE=$?
if [ $EXIT_CODE -eq 0 ]; then
echo "Training completed successfully!"
break
elif [ $EXIT_CODE -eq 143 ]; then # SIGTERM
echo "Spot instance terminated. Waiting 30s before retry..."
sleep 30
RETRY_COUNT=$((RETRY_COUNT + 1))
else
echo "Training failed with code $EXIT_CODE"
exit $EXIT_CODE
fi
done
if [ $RETRY_COUNT -eq $MAX_RETRIES ]; then
echo "Maximum retries reached. Exiting."
exit 1
fi
🎯 Golden Rule
Time Value Equation: If your time is worth more than the savings, use on-demand. If you can tolerate interruptions and have good checkpointing, spot instances are free money.