In this tutorial, we explore how to apply post-training quantization to an instruction-tuned language model using llmcompressor. We start with an FP16 baseline and then compare multiple compression strategies, including FP8 dynamic quantization, GPTQ W4A16, and SmoothQuant with GPTQ W8A8. Along the way, we benchmark each model variant for disk size, generation latency, throughput, perplexity, and output quality. We also prepare a reusable calibration dataset, save compressed model artifacts, and inspect how each recipe changes practical inference behavior. By the end, we get a practical understanding of how different quantization methods affect model efficiency, deployment readiness, and performance trade-offs. [Codes with Notebook]
import subprocess, sys def pip(*pkgs): subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", *pkgs]) pip("llmcompressor", "compressed-tensors", "transformers>=4.45", "accelerate", "datasets") import os, gc, time, json, math from pathlib import Path import torch from transformers import AutoModelForCausalLM, AutoTokenizer from datasets import load_dataset assert torch.cuda.is_available(), "Enable a GPU: Runtime > Change runtime type > T4 GPU" print("GPU:", torch.cuda.get_device_name(0), "| CUDA:", torch.version.cuda, "| torch:", torch.__version__) MODEL_ID = "Qwen/Qwen2.5-0.5B-Instruct" WORKDIR = Path("https://www.marktechpost.com/content/quant_lab"); WORKDIR.mkdir(exist_ok=True) os.chdir(WORKDIR) def free_mem(): gc.collect(); torch.cuda.empty_cache() def dir_size_gb(path): total = 0 for root, _, files in os.walk(path): for f in files: total += os.path.getsize(os.path.join(root, f)) return total / 1e9 def time_generation(model, tok, prompt, max_new_tokens=64): """Greedy decode; reports latency & tokens/sec after a brief warmup.""" inputs = tok(prompt, return_tensors="pt").to(model.device) _ = model.generate(**inputs, max_new_tokens=4, do_sample=False) torch.cuda.synchronize() t0 = time.time() out = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False, pad_token_id=tok.eos_token_id) torch.cuda.synchronize() dt = time.time() - t0 new_ids = out[0][inputs["input_ids"].shape[1]:] return tok.decode(new_ids, skip_special_tokens=True), dt, max_new_tokens/dt @torch.no_grad() def wikitext_ppl(model, tok, seq_len=512, max_chunks=20, stride=512): """Light WikiText-2 perplexity probe (fast, indicative).""" ds = load_dataset("wikitext", "wikitext-2-raw-v1", split="test") text = "nn".join(t for t in ds["text"][:400] if t.strip()) enc = tok(text, return_tensors="pt").input_ids.to(model.device) nll_sum, tok_count = 0.0, 0 for begin in range(0, enc.size(1) - seq_len, stride): chunk = enc[:, begin:begin+seq_len] out = model(chunk, labels=chunk) nll_sum += out.loss.float().item() * seq_len tok_count += seq_len if tok_count // seq_len >= max_chunks: break return math.exp(nll_sum / tok_count) results = {} PROMPT = ("<|im_start|>usernIn two sentences, explain why post-training " "quantization works for large language models.<|im_end|>n" "<|im_start|>assistantn") def benchmark(label, model_path_or_id): free_mem() print(f"n──── benchmarking: {label} ────") tok = AutoTokenizer.from_pretrained(model_path_or_id) m = AutoModelForCausalLM.from_pretrained( model_path_or_id, torch_dtype="auto", device_map="cuda").eval() sample, dt, tps = time_generation(m, tok, PROMPT) ppl = wikitext_ppl(m, tok) size = dir_size_gb(model_path_or_id) if os.path.isdir(str(model_path_or_id)) else None results[label] = {"size_gb": size, "ppl": round(ppl, 3), "latency_s": round(dt, 3), "tok_per_s": round(tps, 1), "sample": sample.strip().replace("n", " ")[:180]} print(json.dumps(results[label], indent=2)) del m; free_mem()
We install all required libraries, import the core packages, and verify that a CUDA-enabled GPU is available in Colab. We define the base Qwen2.5 instruction model, create a working directory, and prepare helper functions for memory cleanup, model size calculation, generation timing, and perplexity evaluation. We also create a reusable benchmark function that loads any model variant, tests its generation speed, calculates perplexity, and stores the results for final comparison.
print("n════════════ Baseline (FP16) ════════════") benchmark("00_fp16_baseline", MODEL_ID) from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier print("n════════════ Recipe 1: FP8_DYNAMIC ════════════") model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") tok = AutoTokenizer.from_pretrained(MODEL_ID) recipe_fp8 = QuantizationModifier( targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"], ) oneshot(model=model, recipe=recipe_fp8) FP8_DIR = "Qwen2.5-0.5B-FP8-Dynamic" model.save_pretrained(FP8_DIR, save_compressed=True) tok.save_pretrained(FP8_DIR) del model; free_mem() benchmark("01_fp8_dynamic", FP8_DIR)
We first benchmark the original FP16 model to establish a reliable baseline for subsequent comparisons. We then apply FP8 dynamic quantization using llmcompressor, where linear layers are compressed while the language modeling head remains in higher precision. We save the compressed FP8 model and run the same benchmark again to compare its size, latency, throughput, and perplexity against the baseline.
NUM_CALIB_SAMPLES = 256 MAX_SEQ_LEN = 1024 tok = AutoTokenizer.from_pretrained(MODEL_ID) raw = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_CALIB_SAMPLES}]") def to_text(ex): return {"text": tok.apply_chat_template(ex["messages"], tokenize=False)} def tokenize(ex): return tok(ex["text"], padding=False, truncation=True, max_length=MAX_SEQ_LEN, add_special_tokens=False) calib_ds = (raw.shuffle(seed=42) .map(to_text) .map(tokenize, remove_columns=raw.column_names)) print("Calibration set:", len(calib_ds), "samples, max_seq_len =", MAX_SEQ_LEN)
We build a small calibration dataset using UltraChat samples so that the calibrated quantization recipes can observe realistic instruction-style inputs. We convert each chat example into model-compatible text through the tokenizer’s chat template. We then tokenize the samples with a fixed maximum sequence length, creating a reusable dataset for GPTQ and SmoothQuant-based compression.
from llmcompressor.modifiers.quantization import GPTQModifier print("n════════════ Recipe 2: GPTQ W4A16 ════════════") model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") recipe_w4a16 = GPTQModifier( targets="Linear", scheme="W4A16", ignore=["lm_head"], dampening_frac=0.01, ) oneshot( model=model, dataset=calib_ds, recipe=recipe_w4a16, max_seq_length=MAX_SEQ_LEN, num_calibration_samples=NUM_CALIB_SAMPLES, ) W4A16_DIR = "Qwen2.5-0.5B-W4A16-G128" model.save_pretrained(W4A16_DIR, save_compressed=True) tok.save_pretrained(W4A16_DIR) del model; free_mem() benchmark("02_gptq_w4a16", W4A16_DIR)
We apply GPTQ W4A16 quantization to compress the model’s linear weights into 4-bit precision while keeping activations in higher precision. We use the calibration dataset to enable GPTQ to reduce reconstruction error and preserve model quality during compression. We save the W4A16 compressed model and benchmark it to study how aggressive 4-bit weight compression affects speed, size, and perplexity.
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier print("n════════════ Recipe 3: SmoothQuant + GPTQ W8A8 ════════════") model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto") recipe_w8a8 = [ SmoothQuantModifier(smoothing_strength=0.8), GPTQModifier(targets="Linear", scheme="W8A8", ignore=["lm_head"]), ] oneshot( model=model, dataset=calib_ds, recipe=recipe_w8a8, max_seq_length=MAX_SEQ_LEN, num_calibration_samples=NUM_CALIB_SAMPLES, ) W8A8_DIR = "Qwen2.5-0.5B-W8A8-SmoothQuant" model.save_pretrained(W8A8_DIR, save_compressed=True) tok.save_pretrained(W8A8_DIR) del model; free_mem() benchmark("03_smoothquant_w8a8", W8A8_DIR) print("n══════════════════════ FINAL SUMMARY ══════════════════════") print(f"{'Variant':<26}{'Size GB':>9}{'PPL':>10}{'tok/s':>9}{'Latency':>11}") print("-" * 65) for k, v in results.items(): size = f"{v['size_gb']:.3f}" if v['size_gb'] else " (hub) " print(f"{k:<26}{size:>9}{v['ppl']:>10.2f}{v['tok_per_s']:>9.1f}" f"{v['latency_s']:>10.2f}s") print("nSample completions (greedy, 64 new tokens):") for k, v in results.items(): print(f"n[{k}]n → {v['sample']}")
We combine SmoothQuant with GPTQ W8A8 to create an advanced quantization pipeline that handles activation outliers before applying 8-bit compression. We save and benchmark this SmoothQuant-based model using the same evaluation setup as the earlier variants. Also, we print a summary table and sample completions to compare all quantized models against the FP16 baseline in one place.
In conclusion, we built a complete quantization workflow that compresses and evaluates a small instruction-tuned LLM using modern PTQ techniques. We saw that FP8 dynamic quantization offers a fast, data-free option, while GPTQ-based methods use calibration data to achieve stronger compression and improved accuracy recovery. We also compared all variants through consistent benchmarks, which helps us understand the trade-offs between size, speed, latency, and perplexity. By saving each quantized model and testing generation quality, we made the workflow closer to a real deployment pipeline. This gives us a reusable Colab-ready framework for testing LLM compression methods before deploying efficient models in real-world inference systems.
Check out the Codes with Notebook here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
Sana Hassan
Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

