InsightsUnderstanding Word2Vec
Machine Learning

Understanding Word2Vec

Gaurav ChopraGaurav Chopra·November 9, 2025

Understanding Word2Vec

A Step-by-Step Interactive Journey Through Word Embeddings

What is Word2Vec? It's a breakthrough technique in Natural Language Processing that transforms words into mathematical vectors, allowing computers to understand semantic relationships between words. Words with similar meanings get similar vector representations.

Today's Journey: We'll explore an interactive browser-based Word2Vec demo that uses TensorFlow.js. We'll walk through each step of the process using a real example to see how machines learn the meaning of words.

1 Step 1: Text Input & Tokenization

Every Word2Vec journey begins with text. The first step is breaking down your input into individual units called tokens (usually words).

Our Input Text:

The emperor and the empress explored the royal courtyard, while the puppy chased a dragonfly along the lake, as children jumped with shiny balloons near the playground.

What Happens:

The tool performs tokenization - splitting the text by spaces and punctuation to create a vocabulary of unique words. Each word becomes a token, and we count how often it appears.

Resulting Vocabulary:

Token Frequency Word Index
the60
emperor11
and12
empress13
explored14
royal15
courtyard16
while17
puppy18
chased19
a110
dragonfly111
along112
lake113
as114
children115
jumped116
with117
shiny118
balloons119
near120
playground121
Key Insight: We have 22 unique words. Notice that "the" appears 6 times - it's a common word. Each word gets assigned a unique index number (0-21) which will be important in later steps.

2 Step 2: Creating Training Pairs (Context-Target Generation)

Word2Vec learns from context - the fundamental principle is that words appearing in similar contexts have similar meanings. To train the model, we create pairs of words that appear near each other.

The Sliding Window Approach:

We use a "window size" (let's say 2) that looks at words within 2 positions on either side of each target word.

Example from our text:

Sequence: "The emperor and the empress explored"

When target = "emperor":

  • Context words: "The", "and", "the"
  • Training pairs created:
    • "The" → "emperor"
    • "and" → "emperor"
    • "the" → "emperor"

When target = "empress":

  • Context words: "and", "the", "explored"
  • Training pairs created:
    • "and" → "empress"
    • "the" → "empress"
    • "explored" → "empress"

When target = "puppy":

  • Context words: "the", "chased", "a"
  • Training pairs created:
    • "the" → "puppy"
    • "chased" → "puppy"
    • "a" → "puppy"
Why This Matters: By creating these pairs, the model learns that "emperor" and "empress" appear in similar contexts (both appear with "the", "and", "explored"). This helps the model understand they're related concepts!

From our 22-word sentence, we might generate 60-80 training pairs depending on the window size. Each pair teaches the model something about word relationships.

3 Step 3: Converting Words to Numbers (Vectorization)

Neural networks can't process text directly - they need numbers. This step transforms our words into numerical vectors.

Two-Step Conversion Process:

Step 3A: Word Indexing

Each unique word gets a number (we already saw this in Step 1):

"the" → 0 "emperor" → 1 "and" → 2 "empress" → 3 "explored" → 4 "royal" → 5 "courtyard" → 6 "while" → 7 "puppy" → 8 ... "playground" → 21

Step 3B: One-Hot Encoding

Each word becomes a vector of 22 numbers (one for each word in vocabulary) - all zeros except one 1.

Interactive Visualization:

Interactive heatmap: Each row is a word, each column is a position. Yellow = 1 (active), Purple = 0 (inactive). Hover to see exact values.

Notice the pattern: Each word has exactly ONE yellow cell (value = 1) at its unique index position, and all other cells are purple (value = 0). This is why it's called "one-hot" encoding!
Key Insight: This one-hot encoding is sparse (mostly zeros) and doesn't capture any semantic meaning yet. That's what the neural network will learn! The training pairs now become:
  • Input: one-hot vector for "the" → Output: one-hot vector for "emperor"
  • Input: one-hot vector for "chased" → Output: one-hot vector for "puppy"

4 Step 4: Neural Network Architecture & Training

Now comes the magic! We build a neural network that will learn meaningful word representations.

Network Architecture:

Interactive diagram: The hidden layer learns the word embeddings!

Training Process:

Let's walk through what happens when we train on the pair: "chased" → "puppy"

1. Forward Pass:

  1. Input: One-hot vector for "chased" [0,0,0,0,0,0,0,0,0,1,0,0,...]
  2. Hidden Layer: Multiplies input by weights, produces embedding: [0.3, 0.6, 0.2, -0.4, 0.7]
  3. Output Layer: Produces probabilities for each word
    • "the": 2%
    • "emperor": 1%
    • "puppy": 45% ← Prediction!
    • "dragonfly": 38%
    • ... others ...

2. Calculate Loss:

Using categorical cross-entropy, we measure how wrong the prediction was. Since the target was "puppy" and we predicted 45%, the loss might be 0.8 (lower is better).

3. Backpropagation:

The network adjusts its weights to improve. Next time it sees "chased", it should predict "puppy" with higher confidence.

The Magic - Word Embeddings:

Here's the breakthrough insight: The hidden layer weights ARE your word embeddings!

After training, the weight matrix from Input → Hidden layer contains a row for each word. These rows are the learned embeddings:

Interactive heatmap: Learned embeddings (5 dimensions). Notice how similar words have similar color patterns! Darker blue = negative values, Darker red = positive values

Notice: "emperor" and "empress" have very similar patterns (similar colors across dimensions)! "puppy" and "dragonfly" also show similarities! The network learned semantic relationships by converting sparse 22-dimensional one-hot vectors into dense 5-dimensional embeddings.

Training Parameters:

Optimizer (Adam recommended):

The algorithm that updates weights. Adam is smart - it tracks momentum and adapts learning rates automatically for each parameter.

Learning Rate (e.g., 0.01):

Controls step size during training:

  • Too high (0.5): Training is unstable, might never converge
  • Just right (0.01): Steady progress, converges in reasonable time
  • Too low (0.0001): Very slow, might need 10x more epochs

Epochs (e.g., 50):

How many times to go through all training pairs. Each epoch refines the embeddings:

  • Epoch 1: Loss = 3.2 (random weights, poor predictions)
  • Epoch 10: Loss = 1.5 (learning patterns)
  • Epoch 30: Loss = 0.4 (good embeddings formed)
  • Epoch 50: Loss = 0.1 (well-trained!)

5 Step 5: t-SNE Visualization

We now have word embeddings, but there's a problem: if we chose 10 dimensions, we can't visualize 10-dimensional space! We need to compress it to 2D for visualization.

Enter t-SNE:

t-SNE (t-distributed Stochastic Neighbor Embedding) is an algorithm that reduces high-dimensional data to 2D or 3D while preserving the relationships between points.

The Transformation:

BEFORE t-SNE (10 dimensions - invisible):"emperor" = [0.62, 0.73, 0.18, -0.25, 0.81, 0.12, -0.45, 0.33, 0.88, -0.19] "empress" = [0.58, 0.76, 0.15, -0.28, 0.79, 0.14, -0.42, 0.31, 0.85, -0.17] "puppy" = [-0.42, 0.15, 0.88, 0.65, -0.31, 0.52, 0.23, -0.61, 0.14, 0.77] "dragonfly" = [-0.38, 0.18, 0.91, 0.62, -0.28, 0.49, 0.21, -0.58, 0.16, 0.74]
↓↓↓ t-SNE Magic ↓↓↓
AFTER t-SNE (2 dimensions - can plot!):"emperor" = [2.3, 5.2] "empress" = [2.5, 5.4] ← Close to emperor! "puppy" = [7.1, 3.3] "dragonfly" = [7.4, 3.1] ← Close to puppy! "courtyard" = [4.2, 2.1] "playground" = [4.5, 2.3] ← Close to courtyard!

The Visualization Plot:

Interactive plot: Hover over points to see words, zoom and pan to explore clusters

What You'll Discover:

Semantic Clusters Emerge:
  • Royalty: "emperor", "empress", "royal" group together
  • Animals: "puppy", "dragonfly" are near each other
  • Places: "courtyard", "playground", "lake" form a cluster
  • Actions: "explored", "chased", "jumped" may group
  • Common words: "the", "and", "a" typically in their own area

t-SNE Parameters:

Perplexity (5-50):

  • Low (5-10): Focuses on local structure, very tight clusters
  • Medium (15-30): Balanced view, recommended for most cases
  • High (40-50): Emphasizes global structure, looser clusters

Learning Rate (10-200):

How fast points move during t-SNE optimization. Usually 10-50 works well.

Iterations (250-1000):

More iterations = better positioning, but slower. Usually 500 is sufficient.

Putting It All Together

Let's trace one word's complete journey from text to visualization:

The Journey of "empress":

  1. Tokenization: "empress" is identified as token #3
  2. Training Pairs: Creates pairs like:
    • "and" → "empress"
    • "the" → "empress"
    • "explored" → "empress"
  3. One-Hot Encoding: [0, 0, 0, 1, 0, 0, ...] (22 numbers)
  4. Neural Network Training: Over 50 epochs, learns embedding:
    [0.58, 0.76, 0.15, -0.28, 0.79] (5 numbers - dense representation!)
  5. t-SNE Visualization: Compressed to [2.5, 5.4] for plotting
    Appears next to "emperor" on the 2D plot ✓

Try It Yourself!

The Word2Vec demo makes all of this interactive and visual. You can experiment with different texts, adjust parameters, and see how word embeddings capture meaning.

What makes this tool special:

  • Runs entirely in your browser (TensorFlow.js)
  • No installation or setup required
  • Real-time visualization of embeddings
  • Educational and hands-on learning
  • Adjust parameters and see immediate results
Try the Word2Vec Demo →

Key Takeaway: Word2Vec transforms sparse, meaningless word representations into dense, meaningful vectors that capture semantic relationships. Words that appear in similar contexts end up with similar embeddings - this is how machines begin to "understand" language!

Written with passion for making AI accessible and understandable.

Share this blog if you found it helpful! 🚀

Gaurav Chopra
Gaurav Chopra

Gaurav is a Co-Founder of Eightgen AI

Work with us

Found this useful? Let's talk about your build.

We write about what we build. If any of this resonates with a challenge you're facing, book a free 30-minute call — no prep needed.