Task YAML Structure
This guide explains how to format YAML files for creating tasks in Transformer Lab. Tasks define jobs that run on compute providers and can include training scripts, evaluation scripts, or any other computational workloads.
Note: For detailed information about defining task parameters with validation and custom UI, see the Task Parameters guide.
Basic Structure​
The basic structure of a task YAML file includes the following sections:
name: task-name
resources:
compute_provider: provider-name-in-your-transformerlab-workspace
cpus: 2
memory: 4
minutes_requested: 60
envs:
KEY: value
setup: "command"
run: "command"
git_repo: "url"
git_repo_directory: "dir"
parameters: {...}
sweeps:
sweep_config: {...}
sweep_metric: "metric"
lower_is_better: true
Required Fields​
name​
The task name. This will be sanitized to create a safe filename and name of the cluster on the compute provider.
Type: String
Example:
name: my-training-task
Resources Configuration​
The resources section defines the compute resources required for the task.
resources.compute_provider​
The name of the compute provider to use. This should match a configured provider name in your workspace.
Type: String
Example:
resources:
compute_provider: skypilot-provider
Note: If not specified, the system will use the first available provider as a fallback.
resources.cpus​
Number of CPUs to allocate.
Type: Integer or String
Example:
resources:
cpus: 4
resources.memory​
Amount of memory to allocate (in GB).
Type: Integer or String
Example:
resources:
memory: 16
resources.disk_space​
Amount of disk space to allocate (in GB).
Type: Integer or String
Example:
resources:
disk_space: 100
resources.accelerators​
Accelerator specification (e.g., GPU type and count). Format depends on the provider. To look at supported formats in Skypilot, refer to their accelerator documentation and for SLURM, refer to their GPU documentation.
Type: String
Example:
resources:
accelerators: "H100:8"
resources.num_nodes​
Number of nodes for distributed training.
Type: Integer
Example:
resources:
num_nodes: 2
Complete Resources Example:
resources:
compute_provider: aws-ec2
cpus: 8
memory: 32
disk_space: 200
accelerators: "1xA100"
num_nodes: 1
Commands​
setup​
Command(s) to run before the main task execution. This is typically used for installing dependencies, setting up the environment, or downloading data.
Type: String
Example:
setup: "pip install -r requirements.txt"
Multi-line Setup:
setup: |
pip install -r requirements.txt
apt-get update
apt-get install -y git
python download_data.py
run​
The main command to execute for the task. This is the primary script or command that performs the actual work.
Type: String
Example:
run: "python train.py"
With Arguments:
run: "python train.py --epochs 10 --batch-size 32"
Multi-line Run:
run: |
python train.py \
--epochs 10 \
--batch-size 32 \
--learning-rate 2e-5
Environment Variables​
envs​
Environment variables to set for the task execution. These are passed as key-value pairs.
Type: Dictionary (key-value pairs)
Example:
envs:
CUDA_VISIBLE_DEVICES: "0"
WANDB_API_KEY: "your-api-key"
HF_TOKEN: "your-huggingface-token"
Quota Tracking​
minutes_requested​
Estimated number of minutes the task will run. This is used for quota tracking and resource allocation. When specified, a quota hold is created to reserve the estimated compute time.
Type: Integer
Example:
minutes_requested: 60
Note: This is an optional field but recommended for tasks running on remote compute providers to enable quota tracking and better resource management.
GitHub Integration​
git_repo​
GitHub repository URL to clone before running the task. The repository will be cloned to the working directory.
Type: String
Example:
git_repo: "https://github.com/username/repo.git"
git_repo_directory​
Subdirectory within the GitHub repository to use as the working directory. Useful when the repository contains multiple projects.
Type: String
Example:
git_repo: "https://github.com/username/multi-project-repo.git"
git_repo_directory: "project1"
Note: The final path where the cloned folder would be available is either: ~/git_repo_directory or ~/git_repo_name (if no directory is specified).
Complete GitHub Example:
git_repo: "https://github.com/transformerlab/examples.git"
git_repo_directory: "training/llm-finetuning"
setup: "pip install -r requirements.txt"
run: "python train.py"
Parameters​
parameters​
Task parameters (hyperparameters, configuration, etc.) that will be accessible via lab.get_config() in your scripts. These are passed to the job and can be used to configure the training or evaluation process.
Detailed documentation on this field is on its own page
Type: Dictionary (any JSON-serializable values)
Example:
parameters:
model_name: "gpt2"
learning_rate: 2e-5
batch_size: 8
num_epochs: 3
max_seq_length: 512
warmup_ratio: 0.03
weight_decay: 0.01
Nested Parameters:
parameters:
model:
name: "gpt2"
architecture: "GPT2LMHeadModel"
training:
learning_rate: 2e-5
batch_size: 8
num_epochs: 3
data:
dataset_name: "wikitext"
max_seq_length: 512
Note: Parameters can be accessed in your Python scripts using the Lab SDK:
from lab import lab
lab.init()
config = lab.get_config()
learning_rate = config.get("learning_rate")
model_name = config.get("model_name")
📖 For comprehensive parameter documentation, including:
- Parameter types (int, float, bool, enum, string, json, model, dataset)
- Schema validation (min, max, multipleOf)
- UI customization (ui_widget options)
- Special model and dataset selectors
- Complete examples
See the Task Parameters guide.
Hyperparameter Sweeps​
sweeps​
Configuration for hyperparameter sweeps. When sweeps are enabled, the system will generate multiple jobs, one for each combination of parameter values.
sweeps.sweep_config​
Dictionary mapping parameter names to lists of values to try. The system will generate jobs for all combinations of these values.
Type: Dictionary (parameter name -> list of values)
Example:
sweeps:
sweep_config:
learning_rate: ["1e-5", "3e-5", "5e-5"]
batch_size: ["4", "8", "16"]
lora_rank: ["8", "16", "32"]
sweeps.sweep_metric​
The metric to optimize during the sweep. This should match a metric name that your script logs (e.g., via wandb or in evaluation results).
Type: String
Example:
sweeps:
sweep_config:
learning_rate: ["1e-5", "3e-5", "5e-5"]
sweep_metric: "eval/loss"
Common Metrics:
"eval/loss"- Evaluation loss"train/loss"- Training loss"eval/accuracy"- Evaluation accuracy"eval/f1_score"- F1 score"eval/bleu"- BLEU score
sweeps.lower_is_better​
Whether lower values of the sweep metric are better (True) or higher values are better (False).
Type: Boolean
Example:
sweeps:
sweep_config:
learning_rate: ["1e-5", "3e-5", "5e-5"]
sweep_metric: "eval/loss"
lower_is_better: true # Lower loss is better
or
sweeps:
sweep_config:
learning_rate: ["1e-5", "3e-5", "5e-5"]
sweep_metric: "eval/accuracy"
lower_is_better: false # Higher accuracy is better
Complete Sweeps Example:
name: hyperparameter-sweep
resources:
compute_provider: aws-ec2
cpus: 4
memory: 16
accelerators: "1xV100"
run: "python train.py"
parameters:
model_name: "gpt2"
dataset_name: "wikitext"
sweeps:
sweep_config:
learning_rate: ["1e-5", "3e-5", "5e-5"]
batch_size: ["4", "8"]
lora_rank: ["8", "16"]
sweep_metric: "eval/loss"
lower_is_better: true
Complete Examples​
Example 1: Simple Training Task​
name: simple-training
resources:
compute_provider: local
cpus: 4
memory: 8
minutes_requested: 30
setup: "pip install transformers torch"
run: "python train.py"
parameters:
model_name: "gpt2"
learning_rate: 2e-5
batch_size: 8
num_epochs: 3
Example 2: Training Task with GitHub Repository​
name: finetune-llm
resources:
compute_provider: skypilot-provider
cpus: 8
memory: 32
accelerators: "H100:1"
minutes_requested: 120
git_repo: "https://github.com/username/llm-training.git"
git_repo_directory: "finetuning"
setup: |
pip install -r requirements.txt
pip install wandb
envs:
WANDB_API_KEY: "your-api-key"
HF_TOKEN: "your-huggingface-token"
run: "python train.py"
parameters:
model_name: "meta-llama/Llama-2-7b-hf"
dataset_name: "wikitext-2"
learning_rate: 2e-5
batch_size: 4
gradient_accumulation_steps: 8
num_epochs: 3
max_seq_length: 512
warmup_ratio: 0.03
weight_decay: 0.01
Example 3: Evaluation Task​
name: evaluate-model
resources:
compute_provider: local
cpus: 2
memory: 4
setup: "pip install transformers datasets"
run: "python evaluate.py"
parameters:
model_name: "gpt2"
dataset_name: "wikitext"
batch_size: 16
max_samples: 1000
Example 4: Hyperparameter Sweep​
name: lora-sweep
resources:
compute_provider: skypilot-provider
cpus: 4
memory: 16
accelerators: "H100:1"
minutes_requested: 180
git_repo: "https://github.com/username/llm-training.git"
setup: |
pip install -r requirements.txt
pip install wandb
envs:
WANDB_API_KEY: "your-api-key"
run: "python train.py"
parameters:
model_name: "gpt2"
dataset_name: "wikitext"
num_epochs: 3
sweeps:
sweep_config:
learning_rate: ["1e-5", "3e-5", "5e-5"]
batch_size: ["4", "8"]
lora_rank: ["8", "16", "32"]
lora_alpha: ["16", "32", "64"]
sweep_metric: "eval/loss"
lower_is_better: true
Best Practices​
-
Use Descriptive Names: Choose clear, descriptive task names that indicate what the task does.
name: finetune-gpt2-wikitext # Good
name: task1 # Bad -
Specify Resources Appropriately: Match resources to your workload. Don't request more than you need, but ensure you have enough for the task.
# For small models
resources:
cpus: 4
memory: 8
# For large models
resources:
cpus: 16
memory: 64
accelerators: "1xA100" -
Use Setup for Dependencies: Install dependencies in the
setupcommand rather than in theruncommand.setup: "pip install -r requirements.txt" # Good
run: "python train.py" -
Store Sensitive Data Securely: Don't hardcode API keys or tokens in YAML files. Use environment variables or secure configuration.
# Good - use environment variables
envs:
WANDB_API_KEY: "${WANDB_API_KEY}"
# Bad - hardcoded
envs:
WANDB_API_KEY: "abc123xyz" -
Use Parameters for Configuration: Store hyperparameters and configuration in the
parameterssection so they're accessible vialab.get_config().parameters:
learning_rate: 2e-5
batch_size: 8 -
Document Complex Sweeps: When using sweeps, document what you're optimizing and why.
sweeps:
# Testing different learning rates and batch sizes
sweep_config:
learning_rate: ["1e-5", "3e-5", "5e-5"]
batch_size: ["4", "8"]
sweep_metric: "eval/loss"
lower_is_better: true -
Use GitHub for Code: Store your code in a GitHub repository and reference it with
git_reporather than uploading files manually.git_repo: "https://github.com/username/my-project.git"
git_repo_directory: "training" -
Test Locally First: Test your task configuration locally before running on expensive cloud resources.
resources:
compute_provider: local # Test locally first -
Use Multi-line Strings for Long Commands: Use YAML's
|or>syntax for multi-line commands.setup: |
pip install -r requirements.txt
python download_data.py
python preprocess_data.py -
Validate YAML Syntax: Ensure your YAML is valid before submitting. Use a YAML validator or linter.
Common Issues and Solutions​
Issue: YAML Parsing Errors​
Problem: Invalid YAML syntax causes parsing errors.
Solution: Validate your YAML syntax. Common issues:
- Missing colons after keys
- Incorrect indentation (use spaces, not tabs)
- Unquoted strings with special characters
Issue: Parameters Not Accessible​
Problem: Parameters defined in YAML are not accessible via lab.get_config().
Solution: Ensure parameters are at the root level under parameters: key:
parameters:
learning_rate: 2e-5 # Correct
Not:
config:
parameters:
learning_rate: 2e-5 # Wrong
Issue: Sweeps Not Running​
Problem: Sweeps are defined but not generating multiple jobs.
Solution: Ensure the sweeps section includes all required fields:
sweeps:
sweep_config:
learning_rate: ["1e-5", "3e-5"]
sweep_metric: "eval/loss" # Required
lower_is_better: true # Required
Issue: Provider Not Found​
Problem: compute_provider name doesn't match any configured provider.
Solution: Check the exact provider name in your workspace. The system will use the first available provider as a fallback, but it's better to specify the correct name.