Technology – Thejesh GN

Thank you for attending Back to Basics: Build Your Own LLM from Scratch

Thejesh GN — Thu, 18 Jun 2026 12:34:19 +0000

I sent this email to all workshop attendees. It made sense to publish it as a post as well.

Thank you for attending the “Back to Basics: Build Your Own LLM from Scratch” session. It was great to see so much curiosity and many questions in the room, the feedback many of you shared echoed this sentiment. I’m really glad you found it useful.

As we discussed during the session, I have put together a detailed blog post that walks through everything we covered, including the slides. You can also find the code from the sessions on GitHub and Codeberg.

Blog with slides and code: https://thejeshgn.com/2026/06/14/back-to-basics-build-your-own-llm-from-scratch/
GitHub: https://github.com/thejeshgn/workshop-back-basics-llm-build-scratch
Codeberg: https://codeberg.org/thejeshgn/workshop-back-basics-llm-build-scratch

I would also like all of you to continue building on what you learned. You could do that by just forking the code above and adding more features. Some ideas to improve are

Swap the simple character tokenizer we used for a word-level tokenizer, or go further and implement a BERT-style WordPiece/subword tokenizer, then compare vocabulary size and how each handles unseen words. Also, what effect does it have on the model?
Train on a relatively larger corpus rather than the very small sample we used in the workshop. Project Gutenberg’s Top 100 is a great source of popular, public-domain books that are not very large.
You can also use thematic corpora, such as public-domain poetry collections, recipe collections, or the collected works of Gandhi or Ambedkar, and see how the model’s generated text picks up the style and vocabulary of that specific body of work. For some of it, you will also have to write data cleaners. Also, be mindful when downloading external content; make sure you don’t overload their servers and use public domain content. Check whether they already provide downloads in BitTorrent or in other formats, instead of scraping first.
Experiment with model parameters (in GPTConfig) such as context length, heads, and layers once your pipeline works end to end. See how changes in parameters change the model and its output.

If you build something interesting, feel free to reach out. I’d love to hear what you create. Thanks again for being part of the session.

You can read this blog using RSS Feed. But if you are the person who loves getting emails, then you can join my readers by signing up.

Back to Basics: Build Your Own LLM from Scratch

Thejesh GN — Sat, 13 Jun 2026 19:06:52 +0000

I did a workshop titled “Back to Basics: Build Your Own LLM from Scratch” at IITM/Paradox 2026, which kind of included some basic theory on how a transformer works, and then building a very small LLM. The idea was to demystify an LLM (or transformer) by understanding what goes on and then building one to deepen our understanding. I had to skip some slides because the planned session was only two hours. Ideally, I want it to be around 4 hours, split into 2 sessions: one for theory and one for lab. Maybe next time, when I plan, I will make it 4 hours so I can do it at a slower pace.

Of course, there are other similar workshops available online, and some of them are linked in the references section. This is just my take on it and what I used for my own understanding.

Suppose you want to try it at your own pace. Try the slides below and then use annotated code to read and run. Slides and code are in a repo (CB, GH) too if you prefer that.

#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.10"
# dependencies = [
#   "torch==2.8.0",
# ]
# [tool.uv]
# extra-index-url = ["https://download.pytorch.org/whl/cpu"]
# ///
"""
build_and_test.py  (ANNOTATED VERSION FOR WORKSHOP)
A minimal, single-file GPT for "Back to Basics: Build Your Own LLM from Scratch".

═══════════════════════════════════════════════════════════════════════════════
WALKTHROUGH MAP — suggested order
═══════════════════════════════════════════════════════════════════════════════

  ① GPTConfig            ← slide "The whole model in 5 numbers"
  ② CharTokenizer        ← slides "Tokens" / "Tokenizer"
  ③ GPT.__init__/forward ← slide "Recap: what we just built" (the big pipeline)
  ④ CausalSelfAttention  ← slides "Single-head attention" → "Combining heads with Wo"
  ⑤ FeedForward          ← slide "Feed-Forward Network (FFN)"
  ⑥ TransformerBlock     ← slide "One full transformer block"
  ⑦ get_batch            ← (where x/y "next-token" pairs come from)
  ⑧ train()              ← slides "Cross-entropy" → "Training"
  ⑨ GPT.generate()       ← slide "Generation: from logits to text"

Comments marked 💬 are things worth SAYING out loud.
Comments marked ❓ are good questions to ASK the room.
Comments marked ⚠️ are common gotchas / likely audience questions.

Usage:
    # Train on a text file (CPU only, by design)
    uv run build_and_test.py train --data ../data/shakespeare.txt --max-steps 2000

    # Generate from a saved checkpoint
    uv run build_and_test.py generate --checkpoint ../checkpoints/run1/final_checkpoint.pt --prompt "To be, or not " --num-new-tokens 200 --temperature 0.8 --top-k 40 --seed 42
"""

import argparse
import csv
import math
import os
import sys
import time
from dataclasses import dataclass, asdict

import torch
import torch.nn as nn
import torch.nn.functional as F


# ═════════════════════════════════════════════════════════════════════════════
# ① CONFIG                                  [Slide: "The whole model in 5 numbers"]
# ═════════════════════════════════════════════════════════════════════════════
# 💬 "These five numbers ARE the model. Everything else is derived from them."
#    Point back to this class every time a shape like (B, T, C) appears below:
#       B = batch_size, T ≤ block_size, C = n_embd.

@dataclass
class GPTConfig:
    # Architecture (the 5 numbers from the slides)
    vocab_size: int = 65        # how many unique tokens (set from data, after tokenizer)
    n_embd:     int = 256       # C: the vector size each token is represented by
    n_head:     int = 4         # parallel attention heads (d_k = n_embd/n_head = 64)
    block_size: int = 256       # T_max: the longest context the model can ever see
    n_layer:    int = 4         # how many TransformerBlocks we stack

# Training knobs — deliberately NOT in GPTConfig:
# 💬 "These shape the *training run*, not the *model*. A checkpoint doesn't need them."
batch_size = 32   # sequences per training step (the B in (B, T, C))
dropout = 0.1     # ⚠️ on during training, automatically off in eval()
                  #    [Slide: "Training vs. inference" — dropout row]

output_path = "../checkpoints/run1"

# ═════════════════════════════════════════════════════════════════════════════
# ② TOKENIZER                                    [Slides: "Tokens" / "Tokenizer"]
# ═════════════════════════════════════════════════════════════════════════════
# 💬 "A tokenizer has exactly two jobs: encode (text → IDs) and decode (IDs → text).
#    We picked character-level — the simplest of the three choices on the slide.
#    GPT/LLaMA/Claude use subword BPE; same idea, fancier vocab."
#
# ❓ Ask: "If our vocab is the 65 unique characters in Shakespeare, what happens
#    when you prompt with an emoji?" → see encode(): it silently drops unknowns.

class CharTokenizer:
    """Smallest possible tokenizer: one character = one token."""

    def __init__(self, vocab: list[str]):
        self.vocab = vocab
        # stoi = "string to int", itos = "int to string" — two dicts, that's it.
        self.stoi = {ch: i for i, ch in enumerate(vocab)}
        self.itos = {i: ch for i, ch in enumerate(vocab)}

    @classmethod
    def from_text(cls, text: str) -> "CharTokenizer":
        # 💬 "The vocab is just every unique character that occurs in the data."
        # Sorted so the vocab is deterministic across runs
        # ⚠️ Without sorted(), set() ordering varies → token IDs change between runs
        #    → an old checkpoint would decode to garbage. This one line is why we
        #    can reload checkpoints reliably.
        vocab = sorted(list(set(text)))
        return cls(vocab)

    def encode(self, s: str) -> list[int]:
        # text → list of integers.  [Slide: "Tokenizer" — encode example]
        # `if c in self.stoi`: characters not in the training data are dropped.
        return [self.stoi[c] for c in s if c in self.stoi]

    def decode(self, ids: list[int]) -> str:
        # integers → text. Perfect inverse of encode (for known chars).
        return "".join(self.itos[i] for i in ids)

    @property
    def vocab_size(self) -> int:
        # 💬 "This becomes the first of our 5 numbers — vocab_size in GPTConfig."
        return len(self.vocab)


# ═════════════════════════════════════════════════════════════════════════════
# ④ ATTENTION             [Slides: "Why attention?" → "Combining heads with Wo"]
# ═════════════════════════════════════════════════════════════════════════════
# 💬 "This class is the heart of the workshop. The 7 numbered steps in forward()
#    map one-to-one onto the attention slides. Everything else is plumbing."
#
# Teaching tip: walk forward() with a concrete shape, e.g. B=32, T=256, C=256,
# n_head=4, d_k=64 — and write the shapes on the board as you go.

class CausalSelfAttention(nn.Module):
    """Multi-head causal self-attention. One linear projects to Q,K,V together."""

    def __init__(self, cfg: GPTConfig):
        super().__init__()
        # d_k = n_embd / n_head must divide evenly — each head gets a clean slice.
        # [Slide: "Multi-head attention" — d_k = n_embd / n_head]
        assert cfg.n_embd % cfg.n_head == 0, "n_embd must be divisible by n_head"
        self.n_head = cfg.n_head
        self.n_embd = cfg.n_embd
        self.d_k = cfg.n_embd // cfg.n_head

        # 💬 "The slides show three separate matrices Wq, Wk, Wv, each (C × C).
        #    In code we fuse them into ONE (C × 3C) matrix for efficiency —
        #    one matmul instead of three. Same math, same parameter count."
        # [Slide: "Single-head attention: Q, K, V"]
        self.qkv = nn.Linear(cfg.n_embd, 3 * cfg.n_embd)

        # Wo from the slides: lets the heads talk to each other after running
        # independently. Without it, heads would be siloed.
        # [Slide: "Combining heads with Wo"]
        self.proj = nn.Linear(cfg.n_embd, cfg.n_embd)
        self.dropout = nn.Dropout(dropout)

        # 💬 "The causal mask is NOT learned — it's a fixed triangle of 1s.
        #    Row i has 1s up to column i: 'token i may look at tokens 0..i'."
        # [Slide: "Causal mask"] and [Slide: "What gets learned, what stays fixed"]
        # register_buffer = "part of the model, moves with .to(device),
        # saved in state_dict, but NO gradients" — perfect for a constant.
        mask = torch.tril(torch.ones(cfg.block_size, cfg.block_size))
        self.register_buffer("mask", mask.view(1, 1, cfg.block_size, cfg.block_size))

    def forward(self, x):
        B, T, C = x.shape  # batch, seq_len, n_embd — write these on the board

        # ── 1) Project to Q, K, V ──────────────── [Slide: "Q, K, V"]
        # One big matmul gives (B, T, 3C); split() carves it into three (B, T, C).
        # 💬 "Query: what am I looking for? Key: what do I offer? Value: what do
        #    I pass along if matched?"
        q, k, v = self.qkv(x).split(self.n_embd, dim=2)

        # ── 2) Split into heads ────────── [Slide: "Multi-head attention"]
        # (B, T, C) → (B, T, n_head, d_k) → transpose → (B, n_head, T, d_k)
        # 💬 "No new computation here — we're just reshaping so each head can run
        #    the SAME attention math independently on its own d_k-sized slice."
        q = q.view(B, T, self.n_head, self.d_k).transpose(1, 2)
        k = k.view(B, T, self.n_head, self.d_k).transpose(1, 2)
        v = v.view(B, T, self.n_head, self.d_k).transpose(1, 2)

        # ── 3) Scaled dot-product scores ──── [Slide: "Attention scores (scaled)"]
        # (B, nh, T, d_k) @ (B, nh, d_k, T) → (B, nh, T, T)
        # 💬 "A T×T grid per head: how relevant is every token to every other token."
        # ⚠️ The 1/√d_k scaling is the line students forget. Without it, dot
        #    products grow with d_k → softmax saturates → gradients vanish.
        scores = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(self.d_k))

        # ── 4) Causal mask ──────────────────────── [Slide: "Causal mask"]
        # Where the triangle has 0 (future positions), drop in -inf.
        # 💬 "-inf BEFORE softmax becomes exactly 0 AFTER softmax — the model
        #    literally cannot peek at the answer."
        # [:T, :T] crops the precomputed block_size mask to the actual seq length.
        scores = scores.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))

        # ── 5) Softmax → attention weights ──── [Slide: "Softmax intuition"]
        # Each row becomes a probability distribution: positive, sums to 1,
        # behaves like "importance".
        attn = F.softmax(scores, dim=-1)
        attn = self.dropout(attn)  # regularization: randomly drop some attention links

        # ── 6) Apply attention to V ────────────── [Slide: "Apply attention"]
        # (B, nh, T, T) @ (B, nh, T, d_k) → (B, nh, T, d_k)
        # 💬 "Each output row is a weighted BLEND of value vectors from earlier
        #    positions. THIS is the heart of the transformer."
        out = attn @ v

        # ── 7) Re-combine heads + Wo ──── [Slide: "Combining heads with Wo"]
        # (B, nh, T, d_k) → (B, T, nh, d_k) → (B, T, C): concat of all heads.
        # ⚠️ .contiguous() is needed because transpose only changes the view,
        #    not memory layout — .view() requires contiguous memory.
        out = out.transpose(1, 2).contiguous().view(B, T, C)
        out = self.proj(out)  # Wo: mixes information across heads
        return out


# ═════════════════════════════════════════════════════════════════════════════
# ⑤ FFN                                  [Slide: "Feed-Forward Network (FFN)"]
# ═════════════════════════════════════════════════════════════════════════════
# 💬 "Attention mixes information ACROSS tokens; the FFN processes each token
#    INDEPENDENTLY — same two layers applied to every position. Expand to 4×
#    (room to think), non-linearity, compress back."

class FeedForward(nn.Module):
    """Two-layer MLP: expand to 4x, GELU, compress back."""

    def __init__(self, cfg: GPTConfig):
        super().__init__()
        d_ff = 4 * cfg.n_embd            # the classic 4× expansion (slide: d_ff = 4d)
        self.fc1 = nn.Linear(cfg.n_embd, d_ff)   # W1: expand  (C → 4C)
        self.fc2 = nn.Linear(d_ff, cfg.n_embd)   # W2: compress (4C → C)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # expand → GELU (GPT's choice over ReLU) → compress → dropout
        # ❓ Ask: "Why is the non-linearity essential?" → without it, fc2(fc1(x))
        #    collapses into a single linear layer; depth would buy nothing.
        return self.dropout(self.fc2(F.gelu(self.fc1(x))))


# ═════════════════════════════════════════════════════════════════════════════
# ⑥ TRANSFORMER BLOCK                  [Slide: "One full transformer block"]
# ═════════════════════════════════════════════════════════════════════════════
# 💬 "This is the repeating unit — the slide diagram in 3 lines of code.
#    PRE-norm: LayerNorm goes BEFORE each sublayer (GPT-2/modern convention),
#    and the residual 'x +' is the highway that lets gradients flow through
#    deep stacks."
# [Slide: "Residual + LayerNorm"]

class TransformerBlock(nn.Module):
    """Pre-norm block: LN -> Attn -> +residual -> LN -> FFN -> +residual"""

    def __init__(self, cfg: GPTConfig):
        super().__init__()
        self.ln1 = nn.LayerNorm(cfg.n_embd)   # γ, β — learned (2 × n_embd params)
        self.attn = CausalSelfAttention(cfg)
        self.ln2 = nn.LayerNorm(cfg.n_embd)   # second LN, own γ, β
        self.ffn = FeedForward(cfg)

    def forward(self, x):
        # 💬 Read these aloud as: "x plus attention-of-normalized-x" —
        #    the residual means each sublayer only learns a CORRECTION to x.
        x = x + self.attn(self.ln1(x))   # sublayer 1: communicate (across tokens)
        x = x + self.ffn(self.ln2(x))    # sublayer 2: compute (per token)
        return x


# ═════════════════════════════════════════════════════════════════════════════
# ③ THE FULL GPT                          [Slide: "Recap: what we just built"]
# ═════════════════════════════════════════════════════════════════════════════
# 💬 Teaching tip: show __init__ + forward() FIRST as the bird's-eye view —
#    it mirrors the recap-slide pipeline line by line — then descend into
#    attention. People hold details better once they've seen the skeleton.

class GPT(nn.Module):
    """The whole model: embeddings + N blocks + final LN + LM head."""

    def __init__(self, cfg: GPTConfig):
        super().__init__()
        self.cfg = cfg

        # Token embedding table: (vocab_size × n_embd), learned lookup.
        # [Slide: "Embeddings" — "initialized randomly, updated during training"]
        self.tok_emb = nn.Embedding(cfg.vocab_size, cfg.n_embd)

        # LEARNED positional encoding (GPT-style), one vector per position.
        # [Slide: "Positional encoding" — the 'Learned' flavor, not sinusoidal]
        self.pos_emb = nn.Embedding(cfg.block_size, cfg.n_embd)

        self.drop = nn.Dropout(dropout)

        # The stack: n_layer identical-shaped blocks, each with its OWN weights.
        # [Slide: "Stacking layers"]
        self.blocks = nn.ModuleList([TransformerBlock(cfg) for _ in range(cfg.n_layer)])

        self.ln_f = nn.LayerNorm(cfg.n_embd)  # final LN before the head

        # LM head: project (B, T, C) back to (B, T, vocab) — a score per token.
        # [Slide: "Output logits"]
        self.head = nn.Linear(cfg.n_embd, cfg.vocab_size, bias=False)

        # 💬 WEIGHT TYING: the output head and the input embedding SHARE one
        #    matrix (used transposed in the matmul). Saves vocab_size × n_embd
        #    params and works well in practice — mentioned on the logits slide.
        # ⚠️ This is why the parameter count printout is ~16K lower than the
        #    worked example on the slides (which counts the head separately).
        self.head.weight = self.tok_emb.weight

        # Initialize all weights small and Gaussian (std 0.02, the GPT-2 recipe).
        # ❓ "Why not zeros?" → all-zero weights = all neurons identical = no
        #    symmetry breaking; nothing distinct to learn.
        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.Linear):
            nn.init.normal_(m.weight, mean=0.0, std=0.02)
            if m.bias is not None:
                nn.init.zeros_(m.bias)
        elif isinstance(m, nn.Embedding):
            nn.init.normal_(m.weight, mean=0.0, std=0.02)

    def num_parameters(self) -> int:
        # 💬 Compare the printout with the slide's worked example (~3.25M).
        # [Slide: "Parameter count: worked example"]
        return sum(p.numel() for p in self.parameters() if p.requires_grad)

    def forward(self, idx, targets=None):
        """
        idx: (B, T) token IDs
        targets: (B, T) next-token IDs (for training); None for inference
        returns: logits (B, T, vocab_size), loss or None

        💬 "ONE forward pass serves both training and inference — same brain,
           different loop. targets=None is the only switch."
        [Slide: "Training vs. inference"]
        """
        B, T = idx.shape
        assert T <= self.cfg.block_size, f"sequence length {T} > block_size {self.cfg.block_size}"

        # ── The recap-slide pipeline, line by line ──
        tok = self.tok_emb(idx)                                       # (B, T, C)  token IDs → vectors
        pos = self.pos_emb(torch.arange(T, device=idx.device))        # (T, C)     position 0..T-1 → vectors
        x = self.drop(tok + pos)                                      # (B, T, C)  X = E + position
        # ⚠️ tok is (B,T,C), pos is (T,C) — broadcasting adds the same position
        #    vectors to every sequence in the batch. Worth pausing on.

        for block in self.blocks:        # n_layer blocks, each refines x
            x = block(x)
        x = self.ln_f(x)                 # final LayerNorm  [Slide: "Stacking layers"]
        logits = self.head(x)                                         # (B, T, vocab)

        loss = None
        if targets is not None:
            # ── TRAINING branch ──        [Slides: "Cross-entropy loss"]
            # 💬 "The model predicts the next token at EVERY position in
            #    parallel — T predictions per sequence, not 1. That's why
            #    transformer training is so efficient."
            # Flatten (B, T, vocab) → (B·T, vocab) and (B, T) → (B·T,)
            # because F.cross_entropy wants (N, classes) and (N,) of true IDs.
            # ⚠️ cross_entropy takes raw LOGITS — it applies softmax + -log(p_t)
            #    internally. Don't softmax twice (a classic live-coding bug).
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1),
            )
        return logits, loss

    # ═════════════════════════════════════════════════════════════════════════
    # ⑨ GENERATION                  [Slide: "Generation: from logits to text"]
    # ═════════════════════════════════════════════════════════════════════════
    @torch.no_grad()   # inference: no gradients, no backprop — weights frozen
    def generate(self, idx, num_new_tokens: int, temperature: float = 1.0,
                 top_k: int | None = None) -> torch.Tensor:
        """Autoregressively generate num_new_tokens tokens.

        💬 "The slide's loop: last logits → softmax → sample → append → repeat.
           One token at a time. This is how GPT writes a sentence."
        """
        self.eval()  # switches dropout OFF [Slide: "Training vs. inference"]
        for _ in range(num_new_tokens):
            # If the running text exceeds block_size, keep only the last
            # block_size tokens — the model can't attend beyond its context.
            # 💬 "This IS the 'context window' people talk about in big LLMs."
            idx_cond = idx if idx.size(1) <= self.cfg.block_size else idx[:, -self.cfg.block_size:]

            logits, _ = self(idx_cond)   # full forward pass; loss is None here

            # Take only the LAST position's logits — the next-token prediction.
            # (Training used all T positions; inference uses just one.)
            # TEMPERATURE: divide logits before softmax.
            #   <1.0 sharpens (more confident/repetitive), >1.0 flattens (wilder).
            # ⚠️ max(temperature, 1e-8) guards against divide-by-zero at temp=0.
            logits = logits[:, -1, :] / max(temperature, 1e-8)

            # TOP-K: keep only the k highest-scoring tokens, set the rest to
            # -inf (so softmax gives them probability 0). Stops the model from
            # ever sampling a wildly unlikely character.
            if top_k is not None and top_k > 0:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = float("-inf")  # v[:, [-1]] = k-th best score

            probs = F.softmax(logits, dim=-1)                 # scores → probabilities
            next_id = torch.multinomial(probs, num_samples=1) # SAMPLE (not argmax) (B, 1)
            # ❓ Ask: "What changes if we use argmax instead?" → deterministic,
            #    and typically loops/repeats. Sampling is where variety comes from.
            idx = torch.cat([idx, next_id], dim=1)            # append & loop
        return idx


# ═════════════════════════════════════════════════════════════════════════════
# ⑦ DATA: making (input, target) pairs
# ═════════════════════════════════════════════════════════════════════════════
# 💬 "Where do the 'answer keys' come from? The text itself. The target is the
#    input shifted one character to the right — free labels, no human needed.
#    This is what 'self-supervised' means."
#
#    data:   [T, h, e, _, c, a, t]
#    x  =     [T, h, e, _, c, a]
#    y  =        [h, e, _, c, a, t]   ← y[i] is the 'next token' after x[i]

def get_batch(data: torch.Tensor, block_size: int, batch_size: int,
              device: torch.device) -> tuple[torch.Tensor, torch.Tensor]:
    """Sample batch_size random windows of length block_size from data."""
    # Random start indices; -1 leaves room for the shifted target.
    ix = torch.randint(0, len(data) - block_size - 1, (batch_size,))
    x = torch.stack([data[i:i + block_size] for i in ix])          # inputs
    y = torch.stack([data[i + 1:i + 1 + block_size] for i in ix])  # same, shifted +1
    return x.to(device), y.to(device)
    # ⚠️ Random windows ≠ epochs. We sample with replacement, so "one epoch"
    #    isn't well-defined here — we just count steps. Fine at this scale.


# ═════════════════════════════════════════════════════════════════════════════
# ⑧ TRAIN COMMAND        [Slides: "Why we minimize it" → "Training"]
# ═════════════════════════════════════════════════════════════════════════════
# 💬 The training slide's loop, in code:
#    batch → forward → loss → backward (gradients) → optimizer step → repeat.
#    Map each line of the loop below onto that diagram as you scroll.

def train(args):
    device = torch.device("cpu")  # CPU-only, by design for the workshop
    torch.manual_seed(1337)       # fixed seed → everyone in the room gets the
                                  # same loss curve. All of us get same model.

    # ── 1) Load data: just a plain text file ──
    if not os.path.exists(args.data):
        sys.exit(f"Data file not found: {args.data}")
    with open(args.data, "r", encoding="utf-8") as f:
        text = f.read()
    print(f"Loaded {len(text):,} characters from {args.data}")

    # ── 2) Build tokenizer FROM the data ──
    # 💬 "vocab_size isn't chosen by us — it falls out of the data. For tiny
    #    Shakespeare it's 65: letters, digits, punctuation, newline."
    tokenizer = CharTokenizer.from_text(text)
    print(f"Vocab size: {tokenizer.vocab_size}")

    # ── 3) Encode the whole corpus ONCE; split 90/10 train/val ──
    # ❓ Ask: "Why hold out a validation set?" → train loss can fall from
    #    memorization; val loss tells us if the model GENERALIZES. Watch the
    #    gap between the two columns in the printout.
    data = torch.tensor(tokenizer.encode(text), dtype=torch.long)
    n_train = int(0.9 * len(data))
    train_data = data[:n_train]
    val_data = data[n_train:]
    print(f"Train tokens: {len(train_data):,}   Val tokens: {len(val_data):,}")

    # ── 4) Build the model from the 5 numbers ──
    cfg = GPTConfig(vocab_size=tokenizer.vocab_size)
    model = GPT(cfg).to(device)
    # 💬 Pause on this printout and reconcile it with the parameter-count
    #    slides (~3.25M). Slight difference = weight tying (head not double-counted).
    print(f"Model parameters: {model.num_parameters():,}")

    # ── 5) Optimizer: AdamW ──   [Slide: "From gradients to weight updates"]
    # 💬 "AdamW = the slide's `w -= lr × gradient`, but with per-weight adaptive
    #    step sizes from running averages of past gradients. Used by GPT-2/3,
    #    LLaMA — and by us."
    # weight_decay gently pulls weights toward 0 — regularization.
    optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)

    # LR schedule: linear WARMUP (first 100 steps), then COSINE DECAY to min_lr.
    # 💬 "Warmup: start gentle while weights are random garbage. Cosine: take
    #    smaller steps as we converge — like slowing down when parallel parking."
    # ❓ "What's lr at step 0? At step `warmup`? At step max_steps?" → trace it.
    def lr_at(step: int, max_steps: int, base_lr: float = 3e-4,
              warmup: int = 100, min_lr: float = 3e-5) -> float:
        if step < warmup:
            return base_lr * (step + 1) / warmup            # ramp 0 → base_lr
        progress = (step - warmup) / max(1, max_steps - warmup)
        progress = min(1.0, progress)
        return min_lr + 0.5 * (base_lr - min_lr) * (1 + math.cos(math.pi * progress))

    # ── 6) Training loop ──
    # Fixed sample prompts so the audience can WATCH the same prompt improve
    # from noise → words → Shakespeare-ish as training progresses.
    sample_prompts = ["To be, or not ", "For I am falser than vows made in"]
    log_path = f"{output_path}/loss_log.csv"      # for graphing the loss curve later
    log_file = open(log_path, "w", newline="")
    log_writer = csv.writer(log_file)
    log_writer.writerow(["step", "train_loss", "val_loss", "lr"])

    # Derived intervals: 10 evals, 5 sample dumps, 4 checkpoints per run,
    # regardless of --max-steps.
    eval_every = max(1, args.max_steps // 10)
    sample_every = max(1, args.max_steps // 5)
    ckpt_every = max(1, args.max_steps // 4)

    t0 = time.time()
    model.train()  # dropout ON
    for step in range(args.max_steps):
        # Set this step's learning rate (PyTorch optimizers read it from
        # param_groups; we overwrite it each step with our schedule).
        lr = lr_at(step, args.max_steps)
        for g in optimizer.param_groups:
            g["lr"] = lr

        # ══ THE four lines that ARE training ══  [Slide: "Training"]
        xb, yb = get_batch(train_data, cfg.block_size, batch_size, device)
        _, loss = model(xb, yb)               # 1. forward → loss (one number)

        optimizer.zero_grad(set_to_none=True) # 2. clear last step's gradients
        # ⚠️ Forgetting zero_grad is THE classic bug: PyTorch ACCUMULATES
        #    gradients by default, so they'd pile up across steps.
        loss.backward()                       # 3. backprop: chain rule, automatic
                                              #    [Slide: "How backprop computes gradients"]
        optimizer.step()                      # 4. nudge EVERY weight a tiny bit

        # ── Periodic eval + logging ──
        if step % eval_every == 0 or step == args.max_steps - 1:
            model.eval()                      # dropout OFF for a fair measurement
            with torch.no_grad():             # no gradient bookkeeping needed
                xv, yv = get_batch(val_data, cfg.block_size, batch_size, device)
                _, val_loss = model(xv, yv)
            elapsed = time.time() - t0
            # 💬 Narrate the first line: loss ≈ 4.17 ≈ ln(65) — the loss of a
            #    UNIFORM guess over 65 chars ("Model B" on the cross-entropy
            #    slide). Watching it fall below that = the model is learning.
            print(f"step {step:5d} | lr {lr:.2e} | train {loss.item():.4f} | "
                  f"val {val_loss.item():.4f} | {elapsed:.1f}s")
            log_writer.writerow([step, loss.item(), val_loss.item(), lr])
            log_file.flush()                  # so the CSV is graphable mid-run
            model.train()                     # back to training mode

        # ── Periodic samples: the workshop's "wow" moment ──
        # 💬 Early samples are gibberish; mid-run grows words and line breaks;
        #    late samples look like a drunk Shakespeare. Same prompts each time
        #    makes the progress visible.
        if step % sample_every == 0 and step > 0:
            model.eval()
            for p in sample_prompts:
                ids = torch.tensor([tokenizer.encode(p)], dtype=torch.long, device=device)
                out = model.generate(ids, num_new_tokens=80, temperature=0.8, top_k=20)
                generated = tokenizer.decode(out[0].tolist())
                print(f"   sample: {generated!r}")
            model.train()

        # ── Periodic checkpoints (resume / compare across training stages) ──
        if step > 0 and step % ckpt_every == 0:
            save_checkpoint(model, tokenizer, cfg, f"{output_path}/ckpt_step_{step}.pt")

    # Final checkpoint — this is what the generate command loads.
    save_checkpoint(model, tokenizer, cfg, f"{output_path}/final_checkpoint.pt")
    log_file.close()
    print(f"Done. Wrote final_checkpoint.pt and {log_path}.")


def save_checkpoint(model: GPT, tokenizer: CharTokenizer, cfg: GPTConfig, path: str):
    # 💬 "A checkpoint must contain everything needed to rebuild the model:
    #    1. the weights, 2. the 5 numbers (shape), 3. the vocab (so token IDs
    #    decode to the same characters). Forget the vocab → garbage output."
    torch.save({
        "model_state": model.state_dict(),   # every learned tensor, by name
        "config": asdict(cfg),               # the 5 numbers
        "vocab": tokenizer.vocab,            # the character list
    }, path)
    print(f"  saved {path}")


# ═════════════════════════════════════════════════════════════════════════════
# GENERATE COMMAND — checkpoint in, text out
# ═════════════════════════════════════════════════════════════════════════════
# 💬 "Inference = rebuild the exact same model, load frozen weights, loop
#    generate(). No loss, no gradients, no optimizer — compare the two columns
#    of the 'Training vs. inference' slide."

def generate(args):
    device = torch.device("cpu")
    if args.seed is not None:
        torch.manual_seed(args.seed)  # same seed + same prompt → same output
        # 💬 Good demo: run twice with --seed 42 (identical), then without (varies).

    if not os.path.exists(args.checkpoint):
        sys.exit(f"Checkpoint not found: {args.checkpoint}")
    # ⚠️ weights_only=False because our checkpoint also carries config + vocab
    #    (not just tensors). Fine for our OWN files; for untrusted downloads
    #    you'd want weights_only=True (it restricts unpickling).
    ckpt = torch.load(args.checkpoint, map_location=device, weights_only=False)

    # Rebuild the exact architecture and tokenizer the checkpoint was saved with:
    cfg = GPTConfig(**ckpt["config"])        # the 5 numbers → same shapes
    tokenizer = CharTokenizer(ckpt["vocab"]) # same vocab → same ID↔char mapping

    model = GPT(cfg).to(device)
    model.load_state_dict(ckpt["model_state"])  # pour the learned weights back in
    model.eval()                                # inference mode: dropout off

    # prompt → IDs → generate → IDs → text. The full round trip from slide 1.
    ids = torch.tensor([tokenizer.encode(args.prompt)], dtype=torch.long, device=device)
    out = model.generate(
        ids,
        num_new_tokens=args.num_new_tokens,
        temperature=args.temperature,   # ❓ live demo: try 0.2 vs 1.5 and compare
        top_k=args.top_k,
    )
    print(tokenizer.decode(out[0].tolist()))


# ═════════════════════════════════════════════════════════════════════════════
# CLI — two subcommands, as promised on the "Hands-on: the plan" slide
# ═════════════════════════════════════════════════════════════════════════════

def main():
    parser = argparse.ArgumentParser(description="Tiny GPT: train and generate")
    sub = parser.add_subparsers(dest="cmd", required=True)

    # train: only data + steps are CLI args; architecture lives in GPTConfig.
    p_train = sub.add_parser("train", help="Train the model from a text file")
    p_train.add_argument("--data", required=True, help="path to UTF-8 text file")
    p_train.add_argument("--max-steps", type=int, default=2000)
    p_train.set_defaults(func=train)

    # generate: checkpoint + prompt + the sampling knobs from the slides.
    p_gen = sub.add_parser("generate", help="Generate text from a checkpoint")
    p_gen.add_argument("--checkpoint", required=True)
    p_gen.add_argument("--prompt", required=True)
    p_gen.add_argument("--num-new-tokens", type=int, default=500)
    p_gen.add_argument("--temperature", type=float, default=0.8)
    p_gen.add_argument("--top-k", type=int, default=40)
    p_gen.add_argument("--seed", type=int, default=None)
    p_gen.set_defaults(func=generate)

    args = parser.parse_args()
    args.func(args)   # dispatch to train() or generate()


if __name__ == "__main__":
    main()

You can read this blog using RSS Feed. But if you are the person who loves getting emails, then you can join my readers by signing up.

Exploring Epicure the Food Embedding Model

Thejesh GN — Thu, 11 Jun 2026 09:54:38 +0000

FlavorGraph is a large-scale graph network that combines data from over a million recipes with chemical compound information from 1,500+ flavor molecules to predict ingredient pairings. It uses graph embedding methods to represent foods as dense vectors, enabling data-driven food pairing suggestions that go beyond human or chef intuition. In FlavorGraph, the chemical and recipe context signals are fused at training time via a fixed metapath design, leaving no inference-time knob to adjust their relative weights in the final embeddings.

One could call Epicure an enhanced FlavorGraph. It builds on FlavorGraph to produce 300-D embeddings, but instead of a single embedding that combines both chemical and recipe context signals. It has three embedding models Cooc, Chem, and Core. That way, as a user, you can choose the embedding you want. It also includes more recipes from other languages, not just English. Recipes from English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English are included. They are machine-translated into English for this. It’s good to see Indian recipes covered as well.

They have also normalized the raw ingredient strings to 1,790 canonical entries using an LLM. I found this interesting. I also found that this doesn’t necessarily cover everything that an Indian recipe needs. For example, there are only coriander and coriander_root. It’s usually coriander seeds or coriander leaves in India. In their list, there is no way to differentiate it.

The most interesting part is how they traverse the graph to construct metapaths for training Metapath2Vec models. The three models follow different logic to construct it. But they all have the same architecture and hyperparameters: 300-dim embeddings, walks_per_node=100, walk_length=50, context_size=7, 5 negative samples, batch_size=32,768, lr=0.0025, 20 epochs, no warm restart.

From the paper, the methodology (I have embedded the SVG flowchart below) is easy to understand and, if required, can be replicated. But the raw data is not available, though sources are mentioned. Maybe it can be replicated with a different dataset. They have also used LLMs quite a bit in the data and training pipeline. For me, constructing the metapaths was the most interesting part.

Epicure-Cooc. Walks the Cooc graph: pure I–I random walks weighted by NPMI. No compound nodes.
Epicure-Core. Walks the typed-compound graph and injects pure I–I walks at --ii_repeat=10 alongside the typed-compound metapaths. Edge transitions are weighted so I–C hops are not oversampled relative to the smaller I–I edge set. The resulting embedding blends chemical and recipe-context signal.
Epicure-Chem. Walks the typed-compound graph but with –ii_repeat=0: the I–I templates are absent and the only walks the skip-gram sees are compound-mediated. The chemistry extreme of the family.

Epicure Methodology Flowchart

You can explore the embedding online or using a simple script below. In my exploration, I found that it is okay to use the nearest neighbors for an ingredient in a recipe, in chemistry, or in mixed contexts. But I don’t think it’s at a level where we could just replace ingredients. It also doesn’t have any information about allergens, texture, etc. But it is small and can be used to build on top of it.

Online explorer Epicure three sibling ingredient embeddings.

# /// script
# requires-python = ">=3.11"
# dependencies = [
#   "numpy",
#   "huggingface_hub",
# ]
# ///

from epicure import Epicure

m = Epicure.from_pretrained("Kaikaku/epicure-core")

print("neighbors : chicken")
print(m.neighbors("chicken", k=5))

print("neighbors : coriander")
print(m.neighbors("coriander", k=5))

print("slerp : rice, cuisine:South_Asian")
print(m.slerp("rice", "cuisine:South_Asian", theta_deg=30, k=5))

print("closest_mode : chocolate")
print(m.closest_mode("chocolate", kind="factor", k=3))

To run the above script, include the epicure.py file from the HF repo in the same folder. The epicure package on PyPI is a different package.

# Output of above script
neighbors : chicken
[('pork', 0.5807677507400513), ('beef', 0.5712239742279053), ('chicken_broth', 0.5498523116111755), ('peanut', 0.5233361124992371), ('cream_of_chicken_soup', 0.5217206478118896)]
neighbors : coriander
[('cumin', 0.7044016122817993), ('scallion', 0.6711025238037109), ('chili_pepper', 0.6609357595443726), ('turmeric', 0.6468461155891418), ('chicken_broth', 0.6463753581047058)]
slerp : rice, cuisine:South_Asian
[('turmeric', 0.7607102225586673), ('mustard_seed', 0.756659547118539), ('fenugreek_seed', 0.7468295496269882), ('coriander', 0.7428115853809022), ('cumin', 0.7388300342030064)]
closest_mode : chocolate
[('F_15/M0', 'American sweet confections and dessert bases', 0.7751955986022949), ('F_7/M1', 'Sweet liqueurs and confections', 0.7553147673606873), ('F_8/M4', 'Sweet confections and dessert ingredients', 0.7396374344825745)]

Definitions

Word2Vec is a neural method for building word embeddings from raw text. Words appearing in similar contexts end up close together in the vector space. It comes in two variants skip-gram and CBOW.

skip-gram: predict the surrounding context words from a target word
CBOW: predict the target word from its surrounding context

Metapath2Vec (PDF) It does random walks through the graph guided by a defined metapath, then feeds those walks to a skip-gram model to learn node embeddings. It works well for Heterogeneous graphs. Basically, it’s like creating sentences based on the metapath, then running Word2Vec skip-gram on them.

A metapath is a typed node sequence, such as author – paper – venue – paper – author. It’s predefined by manually by the model designer.

You can read this blog using RSS Feed. But if you are the person who loves getting emails, then you can join my readers by signing up.

Embedding user code in your app using Extism

Thejesh GN — Tue, 19 May 2026 14:40:06 +0000

Every application I love has some kind of power-user mode where I can add my own code or scripting to make it useful to me. Simple examples are Firefox with its addons or VLC with its plugins. Ideally, any significant or valuable application should have such a feature.

But as a developer or builder, it’s not easy to build such a feature securely. I have tried it before with embedding Lua with some restrictions. But it isn’t great in sandboxing. I’ve always had an eye on the WASM-based implementation because browsers have decent sandboxing. Very recently, when I was updating Navidrome, I came across its plugin system, which uses Extism and is based on WebAssembly. Ideally, at some point, even WordPress can do this, allowing us to sandbox plugins.

So I wrote a simple plugin to try out this infrastructure. I wrote a plugin with internet access to a given URL. Its only goal is to test the plugin’s ability to make web API calls in a controlled way. And to explore how plugins work. The language I chose for the plugin is Python, but there are many better supported languages. I tested it using CLI. And it’s not difficult to call it from a host language.

I installed and used Extism CLI, version 1.6.2. Direct binary download from the repository. I wrote a simple Python script that, as a plugin, expects a JSON input containing URL and payload for a web post. There are other ways to pass the data to the plugin, but JSON seemed most flexible.

Inside the plugin, I read the JSON input, parse the parameters, and make a web POST request using Existims Http. Remember, you need to use the Python PDK (Plugin Development Kit). Those PDKs can have some limitations of their own. So be mindful when choosing a plugin language.

Also as you can see, a plugin can have many functions. The ones annotated with @extism.plugin_fn can be called from the host environment. The code is easy to understand.

#webhook.py
import json
import extism
from extism import Http

@extism.plugin_fn
def post_json():
    params = extism.input_json()

    url = params["url"]
    payload = params["payload"]

    response = Http.request(
        url,
        meth="POST",
        headers={"Content-Type": "application/json"},
        body=json.dumps(payload),
    )

    result = {
        "status": response.status_code,
        "body": response.data_str(),
    }

    extism.output_str(json.dumps(result))

To call this plugin first I need to build it. To build use the extism-py which is installed as part of Python PDK. Then run

extism-py webhook.py webhook.wasm

That will produce webhook.wasm, the plugin code that one would probably ship. You can call or invoke it via the CLI for testing. As you can see, I have to grant internet access to the WASM by allowing access to the url webhook.site.

extism call webhook.wasm post_json \
--allow-host "webhook.site" --wasi \
--input '{ "url": "https://webhook.site/0b757518-7120-4919-a12f-252d3dfbc8b5", "payload": { "name": "Thejesh from inside the sandbox", "seq": 1 } }'

But in real world you would be calling it from host code/program. Let’s say your host language is also Python, then you can call this WASM plugin using the following piece of code. Again, it’s not that difficult.

# host.py
# /// script
# dependencies = ["extism"]
# ///

import json
import extism

wasm_file = "webhook.wasm"

input_data = {
    "url": "https://webhook.site/dc895b92-0b21-46db-a9c0-766dd87e8b0f",
    "payload": {
        "name": "Thejesh from inside the sandbox",
        "seq": 1,
    },
}

manifest = {
    "wasm": [{"path": wasm_file}],
    "allowed_hosts": ["webhook.site"],
}

with extism.Plugin(
    manifest,
    wasi=True,
) as plugin:
    result = plugin.call("post_json", json.dumps(input_data).encode())
    print(result.decode())

All of a sudden, safe plugin implementation doesn’t sound that difficult, isn’t it?

You can read this blog using RSS Feed. But if you are the person who loves getting emails, then you can join my readers by signing up.

30 Days of DXing

Thejesh GN — Mon, 20 Apr 2026 09:45:23 +0000

#30DaysOfDXing is where I am trying to receive various radio wave transmissions and listen to them using either my Radio, SDR, etc. I am currently using ShortwaveSchedule, Short-Wave Info, ShorWave DB, QSL.net, etc. as my sources. Look at the project page for more details.

I am a radio enthusiast and also an E&C engineer. Though fields and waves, or antenna theory, were not my favorite subjects in engineering, I love radio waves and listening to them. I think if I had more practice-oriented classes during my engineering studies, I would have loved those subjects much more than I do now that I practice. May be something for me to remember when I teach or share.

DXing, taken from DX, the telegraphic shorthand for “distance” or “distant”, is the hobby of receiving and identifying distant radio or television signals, or making two-way radio contact with distant stations in amateur radio, citizens band radio or other two-way radio communications. – Wikipedia

If you are new to Short Wave Listening (SWL) or the HAM radio world, DXing generally means listening to distant signals. Still, I am not limiting myself to only the “distant” signals in this project, nor to radio or television signals. I am going to try local signals as well, for example, Non-Directional Beacon (NDB) from Bangalore Airport is a fair game. I might also do local FM stations, just to learn. The focus is on learning and trying new things.

SW 13710kHz on 20260415 at 1834 GMT heard in Bengaluru. Transmission by China Radio International, Kunming Anning, China. Received using Eton Elite Traveler.

My plan is not to spend more than 30 minutes a day on this. I should be able to achieve it. I have already started logging them at #30DaysOfDXing, along with all the details. If it interests you, contact me. Maybe we can do it together.

NESDR and RH 795 with SMA Male to BNC Female cable.

Eton Elite Traveler Radio

You can read this blog using RSS Feed. But if you are the person who loves getting emails, then you can join my readers by signing up.

Running Bash programs on Moodle CodeRunner

Thejesh GN — Mon, 09 Mar 2026 06:09:45 +0000

I have been teaching a course at APU that includes Bash scripting. I have a love-hate relationship with Bash. It’s a weird combination of a programming language and an iterative CLI. It’s confusing, easy to make mistakes, and hard to debug, but on the other hand, it’s available on almost all systems. It builds on the power of CLI and CLI tools available in the OS. Easy to write small and useful scripts that can automate your daily painful work. Hence, it’s worth knowing a bit of it even today.

The platform we (APU) use is Moodle. I have been in the MOOC industry for a decade now, and I have heard of Moodle so much, but this is the first time I have used it to run a course. To run a programming course, you will need an easy programming environment to challenge students. In my previous cases, we have used an NSJail-based environment with CourseBuilder (now called Seek), which works really well. But in this case, for Moodle, it’s CodeRunner plugin. It seems fairly easy to use. That said, Bash is not supported out of the box as user language. So I had to use Python to make it possible. This also assumes the environment (CodeRunner/JOBE) has Bash installed, though not directly accessible through API as user language.

import subprocess

script = """{{ TEST.testcode | e('py') }}""" + '\n' + """{{ STUDENT_ANSWER | e('py') }}""" + '\n' + """{{ TEST.extra | e('py') }}"""
input = """{{ TEST.stdin | e('py') }}"""

with open('__prog__.sh', 'w') as outfile:
    outfile.write(script)

result = subprocess.run(['/bin/bash', '__prog__.sh'], capture_output=True, text=True, input=input, timeout=5)
stdout = result.stdout.strip()
print(stdout)

The template Python code takes the TEST.testcode and prepends it to the user-entered STUDENT_ANSWER code. It also takes TEST.extra code and appends it. Then, it runs it as a bash script using Python subprocess by passing TEST.stdin as input. Captures STDOUT and prints it for comparison.

CodeRunner in Moodle. Customization using Templates for using Python to run Bash scripts written by user.

Example Question 1: Read an input as score. If the score is greater than or equal to 40, then print P. If the score is less than 40, then print U.

CodeRunner in Moodle. Example Bash question test case where STDIO is read and used.

Example Question 2: Write a function called cube. If a number is passed, it returns the cube of that number. For example cube 5 # will return 25

CodeRunner in Moodle. Example Bash question where we want to call a function user has written at the end. So it can be tested.

We did about three in-class labs using this. It worked well for all our cases. Though I must say we tried only Bash basics. Maybe there are use cases where this might break, but I think for most of it, this should work.

You can read this blog using RSS Feed. But if you are the person who loves getting emails, then you can join my readers by signing up.

Motorola T82 Extreme is my PMR446 Walkie-Talkie

Thejesh GN — Mon, 09 Feb 2026 01:50:54 +0000

As a kid, I always wanted a set of Walkie Talkies to chat with friends. We got cell phones as adults, but that desire remained. A few years back, I took the HAM exam, but then COVID hit us, and I didn’t apply for a call sign. Now I have to retake the exams online, as the system now doesn’t allow me to apply for a call sign with paper exam results. I will take that exam again. But in the meantime, I also wanted something that I could use with friends and family who I don’t think are interested in taking an exam or applying for a license. Hence this search.

I do have a kind of Walkie-talkie in the form of Helmet communications. There are some issues with using it as a general communications device. Some that matter to me are

It’s not open; it’s proprietary. The only part of that ecosystem that is open is Sena’s Universal Intercom protocol, which uses Bluetooth. It comes with its own limitations.
It’s made for helmet communication. Even if you plan to reuse it, it’s challenging to use in other situations due to its form factor.
It uses 2.5 GHz (the same frequency range as Bluetooth, Wi-Fi, etc.), so its range is somewhat limited. In our tests, it was around 700-900m direct line of sight. Some support a proprietary mesh network to increase the range.

Requirements

With all this, I was left to look for a walkie-talkie.

Is the license free in India? Ideally, across the world, but at least in India.
Open protocols
Decent range
Multi purpose
Easy to use, it shouldn’t take more than five minutes to learn its functionality
Easy to manage, charge, and rugged. I plan to use it everywhere.
Widely recognized brand and model. So I don’t get stopped at the airport and other places while I am carrying it.
Supported in India
True walkie-talkie, not PTT over the cell network.

License-free and open

There are two bands used for public, license-free communication without encryption. The Family Radio Service (FRS) is used in the USA and Canada. FRS operates between 462.5625 MHz and 467.7125 MHz. And PMR446 (private mobile radio used in most of Europe and India.

PMR446 is a private mobile radio that operates between 446.0 – 446.2 MHz in India. It’s a license-free band, as per the Gazette Notification G.S.R. 1047(E). Since I am dealing with India, I will focus on it. The rules limit output power to 0.5 W. It doesn’t seem like much, but you can reach a kilometer or more in urban conditions with buildings and trees. A 5KM to 10KM in direct line of sight conditions. It’s much better than my helmet communication system. The rules allow channel spacing of 6.25 kHz (Digital) and 12.5 kHz (analog). The frequency range can accommodate 16 analog or 32 digital channels. I have a dedicated page about the PMR446 band, channels, channel guarding, etc. The most important thing to know is that we need to select walkie-talkies that operate PMR446 in India.

G.S.R. 1047(E) [PART II—SEC. 3(i)] Table V – Personal Mobile Radio at 446 MHz

I have shown the snippet from the Gazette and given link to the rules for your reference. Please get your own legal advice. I am not a lawyer, and this is not legal advice.

The protocols are standardized, and most follow the ETSI Harmonised European Standard. Since this is something all vendors use, if someone wants to build, the protocol is not a secret sauce. This also makes the technology and walkies vendor-neutral.

Privacy

There is no encryption on PMR446. So you shouldn’t communicate anything that needs encryption on these bands. There is no privacy. There are ways to keep a set of people independent by using a specific channel and Squelch (CTSS or DCS). Squelch is called PL (Private Line) tone, CG (Channel Guard) tone, or QC (Quiet Channel) tone, depending on the vendor, but all refer to the same functionality. Basically, you use a channel and a specific CTSS or DCS code to keep your conversations independent from others. This setup doesn’t stop others from using the same combo and listening to you.

Motorola T82 Extreme

Armed with this information and requirement, I started my search, and my final list was

Motorola Talkabout T82 Extreme – Manual
Kenwood TK 3501
Wavex PT100
Vertel Team Talkie Radio
Sanchar G3U
Baofeng GT-68 PMR Walkie Talkie

If I had money, I would have tried each of them and then decided which to go with. Based on the information I found online, I went with the T82 by Motorola. If you have any other model listed here, let me know, and we can test them together.

In terms of cost, it wasn’t cheap; a pair of them cost me INR 18,000. It came in a package with two walkies, a carrying case, a charger, NIMH rechargeable batteries, earpieces, belt clips, etc. It’s probably the most expensive pair. G3U or PT100 is half the price.

Things I like

Brand recognition. Everyone knows Motorola. It also looks colorful, playful, rugged, and harmless. The support in India is good, and the community around it is also good. I have had other Motorola phones (pre-smartphone era cell phones), and they have been very rugged and have had very positive experiences in general.
It’s rugged. It can be used in conditions where others can’t be used. I won’t be scared to put it on a motorcycle handlebar or hang it outside my backpack. I am also not worried about water splashing on it (IPX4), though I wouldn’t submerge it. Also, not worried about accidentally dropping them.
A replaceable, rechargeable, 800 mAh NIMH battery powers it. It can be recharged using a micro USB charger. That said, you can replace the NIMH battery with 3 x AA Alkaline batteries in an emergency. The promised battery life is around 18 hours. Even if it’s just half, I will be happy. One can also upgrade to a 1300 mAh NiMH battery if required.
It comes with a headset with a boom mic. That makes it usable inside the helmet if you want, or use it hands-free. It has VOX/iVOX (Internal Voice Operated / Voice Operated Transmission ), which lets you transmit without pressing a button. It has three levels of sensitivity for voice activation. By default, it is in PTT (Push To Talk) mode, where you need to press and hold a button to talk. In VOX mode, sound activates the transmission. So all you do is speak, and the radio will transmit for you. This is especially useful when you are riding or doing manual work.
It supports 8 PMR Channels. User expandable to 16 Channels in countries where it’s allowed by government authorities. And 121 Sub-Codes (38 CTCSS Codes & 83 DCS codes). So, there are plenty of options to isolate your group from others.
It’s very easy to use. Manuals and settings are very easy. It probably takes less than 5 minutes to teach someone to start using it effectively.
It has an Emergency Alert Mode with a dedicated button that can signal members of your group for help.
It has dual-channel monitoring. It lets you listen to two channels and engage with the primary one.
There are other features, such as a flashlight, roger Tone, channel monitoring and scanning, easy pairing, etc.
The range has been good. I was able to get at least 1 KM in urban settings with buildings. I continue to do more range tests. I will write a separate post about it.
As such, there is no limit to the number of folks you can have on the same channel. This is true for most Walkie-Talkies.

Things to improve

Micro USB. It should have been USB-C even if it was just 5 watts.
Price. It is expensive. I am looking to try Wavex PT100 and Vertel, which are half the price. If you have, let me know. We can compare.

Conclusion

Should you get it? Maybe. It depends on your need. But once you start using it, you will see how useful it becomes. Especially if you ride or drive a lot in teams, or have a farm or work in a place with poor or no network, trek, or run events, etc. As far as me, I am still going to use BluArmor while riding. But I will surely carry it along with me.

It’s also a good entry point into HAM radio. It’s good enough to generate interest among kids and adults about analog communications.

You can read this blog using RSS Feed. But if you are the person who loves getting emails, then you can join my readers by signing up.

One Video Format

Thejesh GN — Mon, 19 Jan 2026 18:14:45 +0000

I have a bunch of screen-casts that I want to release. I have not found a good PeerTube instance to host my videos. I am happy to pay a monthly fee, and I hope one day there will be a platform that provides this.

For now, I will add them to my site, Archive.org, and YouTube. They are all CC-BY-SA.

I am thinking of uploading just one version of each video. No server side transcoding. Hence, I want to convert the videos into the most optimized format and size. I have been doing some experiments and seem to have found decent settings. Its 1080p or 720p with 24 frames.

If its mobile screencast then I will pad them with color, so it can be a proper FHD or HD.

I am going to use WebM as the container with AV1 as video codec and Opus as audio codec. Most screencasts could be just 720p.

OSMAnd routing using GPX file, 720p aka HD, 2 minutes, 17 seconds, 10MB

Test video – Big Buck Bunny 1080p, 10 seconds, 1 MB file size

You can read this blog using RSS Feed. But if you are the person who loves getting emails, then you can join my readers by signing up.