🎯 Spot Instance Strategy Guide

How to save 30-90% on GPU costs with preemptible instances

-73%

Average Savings

$0.51

Lowest A100/hr (spot)

6-24h

Typical Runtime

2-5min

Termination Notice

⚠️ Critical: Spot Instances Can Terminate Anytime

Spot instances are surplus capacity that can be reclaimed with 2-30 seconds notice (varies by provider). Only use for fault-tolerant workloads with checkpointing enabled.

📊 Spot Pricing Comparison (January 2024)

Provider	GPU Model	On-Demand	Spot Price	Savings	Termination Notice	Availability
AWS EC2	A100 40GB	$3.22/hr	$0.97/hr	-70%	2 minutes	65%
Google Cloud	A100 40GB	$2.77/hr	$0.83/hr	-70%	30 seconds	80%
Azure	A100 80GB	$3.67/hr	$1.47/hr	-60%	30 seconds	70%
Lambda Labs	A100 40GB	$2.40/hr	$1.20/hr	-50%	5 minutes	85%
Vast.ai	A100 40GB	$1.79/hr	$0.51/hr	-71%	No guarantee	Variable
RunPod	A100 80GB	$1.89/hr	$0.76/hr	-60%	10 seconds	75%

🚀 The 5-Step Spot Instance Strategy

Step 1: Enable Aggressive Checkpointing

Save model weights every 10-30 minutes. This is non-negotiable for spot instances.

# PyTorch Lightning automatic checkpointing
from pytorch_lightning.callbacks import ModelCheckpoint

checkpoint_callback = ModelCheckpoint(
    dirpath='./checkpoints',
    filename='model-{epoch:02d}-{val_loss:.2f}',
    save_top_k=3,
    monitor='val_loss',
    every_n_epochs=1,
    save_on_train_epoch_end=True,  # Critical for spot
    auto_insert_metric_name=False
)

Step 2: Implement Termination Handlers

Detect termination signals and save state immediately.

# AWS Spot Instance termination handler
import signal
import requests
import torch

def check_spot_termination():
    """Check AWS metadata for termination notice"""
    try:
        r = requests.get(
            'http://169.254.169.254/latest/meta-data/spot/instance-action',
            timeout=1
        )
        if r.status_code == 200:
            return True
    except:
        pass
    return False

def emergency_checkpoint(model, optimizer, epoch, path):
    """Emergency save on termination"""
    torch.save({
        'epoch': epoch,
        'model_state': model.state_dict(),
        'optimizer_state': optimizer.state_dict(),
    }, f"{path}/emergency_checkpoint.pt")
    print("🚨 Emergency checkpoint saved!")

# Register signal handler
signal.signal(signal.SIGTERM, lambda s, f: emergency_checkpoint(...))

Step 3: Use Bid Strategies

Set maximum prices and use multiple regions for better availability.

# Terraform multi-region spot configuration
resource "aws_spot_instance_request" "gpu_spot" {
  ami           = "ami-gpu-ubuntu"
  instance_type = "p3.2xlarge"  # V100
  spot_price    = "1.00"  # Max bid price

  # Spread across availability zones
  availability_zone = element(
    data.aws_availability_zones.available.names,
    count.index
  )

  tags = {
    Name = "GPU-Spot-${count.index}"
  }
}

Step 4: Implement Auto-Recovery

Automatically resume training when instances are terminated.

# Kubernetes Job with spot node selector
apiVersion: batch/v1
kind: Job
metadata:
  name: training-job
spec:
  backoffLimit: 100  # Retry on termination
  template:
    spec:
      restartPolicy: OnFailure
      nodeSelector:
        node.kubernetes.io/lifecycle: spot
      tolerations:
      - key: "spot"
        operator: "Equal"
        value: "true"
        effect: NoSchedule
      containers:
      - name: trainer
        image: my-training:latest
        command: ["python", "train.py", "--resume-from-checkpoint"]

Step 5: Mix Spot with On-Demand

Use hybrid strategy: critical components on-demand, workers on spot.

# Ray cluster with mixed instance types
cluster_config = {
    "head_node": {
        "instance_type": "g4dn.xlarge",  # On-demand
        "spot": False
    },
    "worker_nodes": {
        "instance_type": "g4dn.12xlarge",  # Spot
        "spot": True,
        "min_workers": 2,
        "max_workers": 10,
        "spot_price": 3.00
    }
}

⚡ Best Practices by Workload Type

✅ PERFECT for Spot Instances:

Hyperparameter tuning (parallel trials)
Batch inference on large datasets
Data preprocessing and augmentation
Distributed training with fault tolerance
Monte Carlo simulations
Grid search / Random search

⚠️ USE WITH CAUTION:

❌ Single long-running training (>24h)
❌ Time-critical production inference
❌ Stateful applications without checkpointing
❌ Real-time serving endpoints
❌ Training without resumption capability

🎯 Provider-Specific Strategies

AWS EC2 Spot

Strategy: Use Spot Fleet with diversified instance types

Best regions: us-east-2, us-west-2 (highest capacity)
Use EC2 Spot Blocks for defined duration (1-6 hours)
Monitor Spot Advisor for interruption rates
Typical savings: 70-90%

Google Cloud Preemptible

Strategy: Combine with sustained use discounts

Maximum runtime: 24 hours (automatic termination)
Best for batch processing workloads
Use with GKE for automatic rescheduling
Typical savings: 60-80%

Azure Spot VMs

Strategy: Use eviction policies and max price caps

Set eviction policy: Deallocate or Delete
Use Azure Batch for managed spot pools
Best regions: Central US, North Europe
Typical savings: 60-72%

📈 Real-World Savings Examples

✅ Case Study: 70B LLM Fine-tuning

Setup: 8x A100 80GB spot instances on AWS

Strategy: Checkpoint every 500 steps, auto-resume on termination

Results:

Total training time: 168 hours (7 days)
Interruptions: 12 times
On-demand cost: $4,200
Spot cost: $1,260
Savings: $2,940 (70%)

🛠️ Monitoring & Automation Tools

Recommended Tools:

Spotinfo.io - Real-time spot price tracking
EC2 Spot Advisor - AWS interruption frequency data
Kubernetes Cluster Autoscaler - Auto-replace terminated nodes
Ray Autoscaler - Distributed training with spot support
SkyPilot - Multi-cloud spot orchestration

🔄 Recovery Script Template

#!/bin/bash
# Auto-recovery script for spot instances

CHECKPOINT_DIR="/mnt/checkpoints"
TRAINING_SCRIPT="train.py"
MAX_RETRIES=50
RETRY_COUNT=0

while [ $RETRY_COUNT -lt $MAX_RETRIES ]; do
    echo "Starting training attempt $((RETRY_COUNT + 1))..."

    # Check for existing checkpoint
    if [ -f "$CHECKPOINT_DIR/latest.pt" ]; then
        echo "Resuming from checkpoint..."
        python $TRAINING_SCRIPT --resume "$CHECKPOINT_DIR/latest.pt"
    else
        echo "Starting fresh training..."
        python $TRAINING_SCRIPT --checkpoint-dir "$CHECKPOINT_DIR"
    fi

    EXIT_CODE=$?

    if [ $EXIT_CODE -eq 0 ]; then
        echo "Training completed successfully!"
        break
    elif [ $EXIT_CODE -eq 143 ]; then  # SIGTERM
        echo "Spot instance terminated. Waiting 30s before retry..."
        sleep 30
        RETRY_COUNT=$((RETRY_COUNT + 1))
    else
        echo "Training failed with code $EXIT_CODE"
        exit $EXIT_CODE
    fi
done

if [ $RETRY_COUNT -eq $MAX_RETRIES ]; then
    echo "Maximum retries reached. Exiting."
    exit 1
fi

🎯 Golden Rule

Time Value Equation: If your time is worth more than the savings, use on-demand. If you can tolerate interruptions and have good checkpointing, spot instances are free money.