Working with Checkpoints
What is a Checkpoint?​
A Checkpoint is a saved snapshot of your job's state at a specific point in time. In machine learning workflows, this is typically the model weights during a specific iteration / epoch.
Creating checkpoints allows you to recover in the event of a failure or crash.
This is especially helpful when using spot instances which are more likely to be terminated during a run.
Creating Checkpoints​
Manual Checkpoint Saving​
To save a checkpoint manually, use the lab.save_checkpoint() function within the Transformer Lab Python SDK. This function takes the path to your checkpoint file or directory and optionally a name for the checkpoint.
from lab import lab
# Initialize lab
lab.init(experiment_id="my_experiment")
# Inside your training loop - save a checkpoint file
checkpoint_file = "/path/to/your/checkpoint.pt"
saved_path = lab.save_checkpoint(checkpoint_file, name="epoch_5_checkpoint.pt")
lab.log(f"Saved checkpoint: {saved_path}")
# Or save an entire checkpoint directory (common with HuggingFace models)
checkpoint_dir = "/path/to/checkpoint-1000"
saved_path = lab.save_checkpoint(checkpoint_dir, name="checkpoint-1000")
The function will copy your checkpoint to the job's checkpoints folder and track it in the job metadata.
Automatic Checkpoint Saving with LabCallback​
If you're using HuggingFace's Trainer or SFTTrainer, you can enable automatic checkpoint saving using the built-in LabCallback. This callback automatically saves checkpoints to TransformerLab whenever the Trainer saves a checkpoint.
from lab import lab
from transformers import Trainer, TrainingArguments
# Initialize lab
lab.init(experiment_id="my_experiment")
# Get the automatic checkpoint callback
callback = lab.get_hf_callback()
# Configure training arguments with checkpoint saving
training_args = TrainingArguments(
output_dir="./checkpoints",
save_steps=500, # Save checkpoint every 500 steps
save_strategy="steps",
save_total_limit=3, # Keep only the last 3 checkpoints
)
# Create trainer with the callback
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
callbacks=[callback], # Add the callback here
)
# Start training - checkpoints will be saved automatically
trainer.train()
The LabCallback automatically:
- Saves checkpoints to Transformer Lab when the Trainer creates them
- Updates training progress in the UI
- Logs training metrics (loss, etc.)
- Tracks epoch completion
Managing Checkpoints​
Viewing Checkpoints​
You can view all saved checkpoints directly in the Jobs panel.
A list of all saved snapshots will appear, displaying their timestamps and associated metadata.
Restarting from a Checkpoint​
If you wish to fork a job or retry a specific run from a previous state:
Open the Checkpoints list for the job.
Find the specific checkpoint you wish to use.
Click Restart from Checkpoint.
This will launch a new job initialized with the data saved in that snapshot.
Handling Failures & Auto-Recovery​
Transformer Lab is designed to handle interruptions gracefully. If your training script is written correctly, it will automatically resume from the last successful checkpoint in the event of a crash or interruption.
To enable this, your script must check for existing checkpoints upon startup and load them if found.
Sample Code: View a robust implementation of auto-recovery logic in our GitHub Repository here