Enter a corpus, train the educational model, and see how tokens become next-word probabilities.
The model is not trained yet. Click "Train model".
Line thickness in the attention block shows average attention between tokens in the current context.
The row is the current token, and the column is the token it attends to. A darker cell means a higher attention weight.
In this educational visualization, a layer "neuron" is shown not as each vector element, but as a whole token vector. One circle means one token with a vector of the configured "Vector size", not separate v1, v2, v3 elements.
The calculation shows an educational estimate for JavaScript Float64 numbers: 8 bytes per number. A real browser uses more because of objects, arrays, and runtime structures.
This tab shows the trainable weights of the educational model: the logits matrix for the "current token -> next token" transition. Attention on the "Layers" tab is calculated during the text pass, while these weights are the ones changed during training.
This tool is a visual educational model of a large language model. It does not call external neural networks and does not send text to the server: training and generation run in the browser. The service shows tokenization, vocabulary, embeddings, positional features, multi-head attention, feed-forward processing, next-token probabilities, and step-by-step generation.
Real LLMs have billions of parameters trained on huge corpora. This simulator uses a small educational model: it builds a vocabulary from the training corpus, creates numeric token vectors, and trains a next-token transition matrix with softmax gradient steps. It is simplified, but preserves the core idea: the model receives previous tokens and computes probabilities for the next token.
The attention visualization shows which tokens influence the current token most. Multiple attention heads can highlight different relationships: nearby words, similar numeric vectors, repeated tokens, and context structure. Layer cards show how data moves through a Transformer: attention mixes information between tokens, then the feed-forward block transforms each token separately.
Settings let you change layers, attention heads, vector size, learning rate, epochs, generation temperature, and output length. Low temperature favors likely continuations; high temperature explores rarer variants. This makes it visible that an LLM repeatedly chooses the next token from a probability distribution.
The service is useful for students and beginning developers: it explains why word order matters, why vectors are needed, how attention connects tokens to each other, what error-based training means, and why text generation is a repeated next-token choice.
Comments