Training Big Models on Small Machines with Colossal-AI
Sept 13, 2025
Training large AI models like GPT-2 or BERT has often been out of reach for developers with limited hardware. You usually need multiple GPUs with large amounts of VRAM, or you hit out-of-memory errors right away. This is where Colossal-AI teps in. It’s an open-source library designed to scale models efficiently while keeping costs manageable.
Introduction
Training large AI models like GPT-2 or BERT has traditionally required heavy infrastructure, multi-GPU servers or expensive cloud clusters. For most developers and researchers, this has meant that experimenting with large language models was out of reach. Colossal-AI changes this. It’s an open-source library built to scale deep learning efficiently, making it possible to train and fine-tune big models even on smaller hardware.
In this tutorial, we’ll explore how Colossal-AI helps you overcome memory limits and run models with just a few lines of code changes. By the end, you’ll run a miniature training session with GPT-2 on your own machine.

Why Colossal-AI?
At its core, Colossal-AI brings distributed training techniques like data parallelism, tensor parallelism, and pipeline parallelism into an easy-to-use package. What makes it especially powerful for smaller setups is its Gemini Plugin, which automatically manages GPU and CPU memory. When your GPU memory fills up, Colossal-AI can offload parts of the model to the CPU without breaking training. Combined with mixed precision (using bf16 or fp16), this lets you stretch your hardware much further than standard PyTorch allows.
The beauty is that you don’t need to redesign your code. Colossal-AI integrates seamlessly with PyTorch and Hugging Face, so you can keep your existing training loops and just wrap them with Colossal-AI’s booster.
What We’ll Build
We’ll fine-tune a tiny GPT-2 model (distilgpt2) on a small slice of the WikiText-2 dataset. This won’t give us a state-of-the-art language model, but it will demonstrate how Colossal-AI enables training without running out of memory. You’ll see loss values printed as the model learns, and at the end, it will generate a short sample text.
Step 1: Install Dependencies
Start with a clean Python environment and install the following:
pip install "torch>=2.2" transformers datasets accelerate
pip install colossalai
Step 2: Create the Training Script
Save the following code as mini_gpt2_colossalai.py:
import os
import torch
from torch.utils.data import DataLoader
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from colossalai.booster import Booster
from colossalai.booster.plugin import GeminiPlugin
from colossalai.nn.optimizer import HybridAdam
# Config
MODEL_NAME = "distilgpt2"
SEQ_LEN, BATCH_SIZE, MAX_STEPS = 128, 2, 40
# Data
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
def tokenize(batch):
return tokenizer(batch["text"], truncation=True, max_length=SEQ_LEN, padding="max_length")
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")["train"].select(range(512))
dataset = dataset.map(tokenize, batched=True, remove_columns=["text"])
dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
loader = DataLoader(dataset, batch_size=BATCH_SIZE, shuffle=True, drop_last=True)
# Model + Optimizer
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME)
optimizer = HybridAdam(model.parameters(), lr=3e-5)
# Colossal-AI Booster
plugin = GeminiPlugin(precision="bf16" if torch.cuda.is_bf16_supported() else "fp16", placement_policy="auto")
booster = Booster(plugin=plugin)
model, optimizer, _, loader, _ = booster.boost(model, optimizer=optimizer, dataloader=loader)
# Training loop
model.train()
for step, batch in enumerate(loader):
if step >= MAX_STEPS:
break
batch = {k: v.to(model.device) for k, v in batch.items()}
loss = model(**batch).loss
booster.backward(loss, optimizer)
optimizer.step()
optimizer.zero_grad(set_to_none=True)
if plugin.is_rank_0() and step % 10 == 0:
print(f"step={step} loss={loss.item():.4f}")
# Quick text generation
if plugin.is_rank_0():
inputs = tokenizer("In software testing,", return_tensors="pt").to(model.device)
gen = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(gen[0], skip_special_tokens=True))
Step 3: Run the Script
Run it with Python:
# Quick text generation
if plugin.is_rank_0():
inputs = tokenizer("In software testing,", return_tensors="pt").to(model.device)
gen = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(gen[0], skip_special_tokens=True))
On a single GPU, you’ll see a few lines of output with step numbers and loss values. After training, the script will generate a short sample continuation of the text “In software testing,”.
If you have more than one GPU, you can scale this up without changing the code:
colossalai run --nproc_per_node 2 mini_gpt2_colossalai.py
Step 4: Observe the Results
The script only trains for 40 steps on 512 samples, so don’t expect fluent text yet. The goal is to confirm that the training loop runs smoothly without running out of memory. You’ll notice that Colossal-AI automatically handles precision scaling and memory offloading, which is what makes this possible on modest hardware.
You’ll also see how little the training code differs from plain PyTorch. By adding the Colossal-AI Booster, you’ve unlocked advanced memory management and distributed training features without rewriting your workflow.
Conclusion
Colossal-AI lowers the barrier to experimenting with large models. It allows researchers, students, and developers to train models that would otherwise be impossible on limited hardware. By automating memory management, optimizing precision, and supporting multi-GPU scaling, it makes deep learning more accessible and cost-efficient.
If you’ve ever wanted to fine-tune large models like GPT-2 or BERT but thought your hardware wasn’t enough, Colossal-AI is the library to explore. Start small, test ideas locally, and scale up only when you need to.
