Machine Learning

Understanding Word2Vec

Gaurav Chopra·November 9, 2025

Understanding Word2Vec

A Step-by-Step Interactive Journey Through Word Embeddings

Published: November 2025 | Reading Time: 15 minutes

What is Word2Vec? It's a breakthrough technique in Natural Language Processing that transforms words into mathematical vectors, allowing computers to understand semantic relationships between words. Words with similar meanings get similar vector representations.

Today's Journey: We'll explore an interactive browser-based Word2Vec demo that uses TensorFlow.js. We'll walk through each step of the process using a real example to see how machines learn the meaning of words.

1 Step 1: Text Input & Tokenization

Every Word2Vec journey begins with text. The first step is breaking down your input into individual units called tokens (usually words).

Our Input Text:

The emperor and the empress explored the royal courtyard, while the puppy chased a dragonfly along the lake, as children jumped with shiny balloons near the playground.

What Happens:

The tool performs tokenization - splitting the text by spaces and punctuation to create a vocabulary of unique words. Each word becomes a token, and we count how often it appears.

Resulting Vocabulary:

Token	Frequency	Word Index
the	6	0
emperor	1	1
and	1	2
empress	1	3
explored	1	4
royal	1	5
courtyard	1	6
while	1	7
puppy	1	8
chased	1	9
a	1	10
dragonfly	1	11
along	1	12
lake	1	13
as	1	14
children	1	15
jumped	1	16
with	1	17
shiny	1	18
balloons	1	19
near	1	20
playground	1	21

Key Insight: We have 22 unique words. Notice that "the" appears 6 times - it's a common word. Each word gets assigned a unique index number (0-21) which will be important in later steps.

2 Step 2: Creating Training Pairs (Context-Target Generation)

Word2Vec learns from context - the fundamental principle is that words appearing in similar contexts have similar meanings. To train the model, we create pairs of words that appear near each other.

The Sliding Window Approach:

We use a "window size" (let's say 2) that looks at words within 2 positions on either side of each target word.

Example from our text:

Sequence: "The emperor and the empress explored"

When target = "emperor":

Context words: "The", "and", "the"
Training pairs created:
- "The" → "emperor"
- "and" → "emperor"
- "the" → "emperor"

When target = "empress":

Context words: "and", "the", "explored"
Training pairs created:
- "and" → "empress"
- "the" → "empress"
- "explored" → "empress"

When target = "puppy":

Context words: "the", "chased", "a"
Training pairs created:
- "the" → "puppy"
- "chased" → "puppy"
- "a" → "puppy"

Why This Matters: By creating these pairs, the model learns that "emperor" and "empress" appear in similar contexts (both appear with "the", "and", "explored"). This helps the model understand they're related concepts!

From our 22-word sentence, we might generate 60-80 training pairs depending on the window size. Each pair teaches the model something about word relationships.

3 Step 3: Converting Words to Numbers (Vectorization)

Neural networks can't process text directly - they need numbers. This step transforms our words into numerical vectors.

Two-Step Conversion Process:

Step 3A: Word Indexing

Each unique word gets a number (we already saw this in Step 1):

"the" → 0
"emperor" → 1
"and" → 2
"empress" → 3
"explored" → 4
"royal" → 5
"courtyard" → 6
"while" → 7
"puppy" → 8
...
"playground" → 21

Step 3B: One-Hot Encoding

Each word becomes a vector of 22 numbers (one for each word in vocabulary) - all zeros except one 1.

Interactive Visualization:

Interactive heatmap: Each row is a word, each column is a position. Yellow = 1 (active), Purple = 0 (inactive). Hover to see exact values.

Notice the pattern: Each word has exactly ONE yellow cell (value = 1) at its unique index position, and all other cells are purple (value = 0). This is why it's called "one-hot" encoding!

Key Insight: This one-hot encoding is sparse (mostly zeros) and doesn't capture any semantic meaning yet. That's what the neural network will learn! The training pairs now become:

Input: one-hot vector for "the" → Output: one-hot vector for "emperor"
Input: one-hot vector for "chased" → Output: one-hot vector for "puppy"

4 Step 4: Neural Network Architecture & Training

Now comes the magic! We build a neural network that will learn meaningful word representations.

Network Architecture:

Interactive diagram: The hidden layer learns the word embeddings!

Training Process:

Let's walk through what happens when we train on the pair: "chased" → "puppy"

1. Forward Pass:

Input: One-hot vector for "chased" [0,0,0,0,0,0,0,0,0,1,0,0,...]
Hidden Layer: Multiplies input by weights, produces embedding: [0.3, 0.6, 0.2, -0.4, 0.7]
Output Layer: Produces probabilities for each word
- "the": 2%
- "emperor": 1%
- "puppy": 45% ← Prediction!
- "dragonfly": 38%
- ... others ...

2. Calculate Loss:

Using categorical cross-entropy, we measure how wrong the prediction was. Since the target was "puppy" and we predicted 45%, the loss might be 0.8 (lower is better).

3. Backpropagation:

The network adjusts its weights to improve. Next time it sees "chased", it should predict "puppy" with higher confidence.

The Magic - Word Embeddings:

Here's the breakthrough insight: The hidden layer weights ARE your word embeddings!

After training, the weight matrix from Input → Hidden layer contains a row for each word. These rows are the learned embeddings:

Interactive heatmap: Learned embeddings (5 dimensions). Notice how similar words have similar color patterns! Darker blue = negative values, Darker red = positive values

Notice: "emperor" and "empress" have very similar patterns (similar colors across dimensions)! "puppy" and "dragonfly" also show similarities! The network learned semantic relationships by converting sparse 22-dimensional one-hot vectors into dense 5-dimensional embeddings.

Training Parameters:

Optimizer (Adam recommended):

The algorithm that updates weights. Adam is smart - it tracks momentum and adapts learning rates automatically for each parameter.

Learning Rate (e.g., 0.01):

Controls step size during training:

Too high (0.5): Training is unstable, might never converge
Just right (0.01): Steady progress, converges in reasonable time
Too low (0.0001): Very slow, might need 10x more epochs

Epochs (e.g., 50):

How many times to go through all training pairs. Each epoch refines the embeddings:

Epoch 1: Loss = 3.2 (random weights, poor predictions)
Epoch 10: Loss = 1.5 (learning patterns)
Epoch 30: Loss = 0.4 (good embeddings formed)
Epoch 50: Loss = 0.1 (well-trained!)

5 Step 5: t-SNE Visualization

We now have word embeddings, but there's a problem: if we chose 10 dimensions, we can't visualize 10-dimensional space! We need to compress it to 2D for visualization.

Enter t-SNE:

t-SNE (t-distributed Stochastic Neighbor Embedding) is an algorithm that reduces high-dimensional data to 2D or 3D while preserving the relationships between points.

The Transformation:

BEFORE t-SNE (10 dimensions - invisible):"emperor" = [0.62, 0.73, 0.18, -0.25, 0.81, 0.12, -0.45, 0.33, 0.88, -0.19] "empress" = [0.58, 0.76, 0.15, -0.28, 0.79, 0.14, -0.42, 0.31, 0.85, -0.17] "puppy" = [-0.42, 0.15, 0.88, 0.65, -0.31, 0.52, 0.23, -0.61, 0.14, 0.77] "dragonfly" = [-0.38, 0.18, 0.91, 0.62, -0.28, 0.49, 0.21, -0.58, 0.16, 0.74]

↓↓↓ t-SNE Magic ↓↓↓

AFTER t-SNE (2 dimensions - can plot!):"emperor" = [2.3, 5.2] "empress" = [2.5, 5.4] ← Close to emperor! "puppy" = [7.1, 3.3] "dragonfly" = [7.4, 3.1] ← Close to puppy! "courtyard" = [4.2, 2.1] "playground" = [4.5, 2.3] ← Close to courtyard!

The Visualization Plot:

Interactive plot: Hover over points to see words, zoom and pan to explore clusters

What You'll Discover:

Semantic Clusters Emerge:

Royalty: "emperor", "empress", "royal" group together
Animals: "puppy", "dragonfly" are near each other
Places: "courtyard", "playground", "lake" form a cluster
Actions: "explored", "chased", "jumped" may group
Common words: "the", "and", "a" typically in their own area

t-SNE Parameters:

Perplexity (5-50):

Low (5-10): Focuses on local structure, very tight clusters
Medium (15-30): Balanced view, recommended for most cases
High (40-50): Emphasizes global structure, looser clusters

Learning Rate (10-200):

How fast points move during t-SNE optimization. Usually 10-50 works well.

Iterations (250-1000):

More iterations = better positioning, but slower. Usually 500 is sufficient.

Putting It All Together

Let's trace one word's complete journey from text to visualization:

The Journey of "empress":

Tokenization: "empress" is identified as token #3
Training Pairs: Creates pairs like:
- "and" → "empress"
- "the" → "empress"
- "explored" → "empress"
One-Hot Encoding: [0, 0, 0, 1, 0, 0, ...] (22 numbers)
Neural Network Training: Over 50 epochs, learns embedding:
[0.58, 0.76, 0.15, -0.28, 0.79] (5 numbers - dense representation!)
t-SNE Visualization: Compressed to [2.5, 5.4] for plotting
Appears next to "emperor" on the 2D plot ✓

Try It Yourself!

The Word2Vec demo makes all of this interactive and visual. You can experiment with different texts, adjust parameters, and see how word embeddings capture meaning.

What makes this tool special:

Runs entirely in your browser (TensorFlow.js)
No installation or setup required
Real-time visualization of embeddings
Educational and hands-on learning
Adjust parameters and see immediate results

Try the Word2Vec Demo →

Key Takeaway: Word2Vec transforms sparse, meaningless word representations into dense, meaningful vectors that capture semantic relationships. Words that appear in similar contexts end up with similar embeddings - this is how machines begin to "understand" language!

Written with passion for making AI accessible and understandable.

Share this blog if you found it helpful! 🚀

Gaurav Chopra

Gaurav is a Co-Founder of Eightgen AI

Work with us

Found this useful? Let's talk about your build.

We write about what we build. If any of this resonates with a challenge you're facing, book a free 30-minute call — no prep needed.

Book a free call →contact@eightgen.ai

Understanding Word2Vec

1 Step 1: Text Input & Tokenization

Our Input Text:

What Happens:

Resulting Vocabulary:

2 Step 2: Creating Training Pairs (Context-Target Generation)

The Sliding Window Approach:

Example from our text:

3 Step 3: Converting Words to Numbers (Vectorization)

Two-Step Conversion Process:

Step 3A: Word Indexing

Step 3B: One-Hot Encoding

Interactive Visualization:

4 Step 4: Neural Network Architecture & Training

Network Architecture:

Training Process:

1. Forward Pass:

2. Calculate Loss:

3. Backpropagation:

The Magic - Word Embeddings:

Training Parameters:

Optimizer (Adam recommended):

Learning Rate (e.g., 0.01):

Epochs (e.g., 50):

5 Step 5: t-SNE Visualization

Enter t-SNE:

The Transformation:

The Visualization Plot:

What You'll Discover:

t-SNE Parameters:

Perplexity (5-50):

Learning Rate (10-200):

Iterations (250-1000):

Putting It All Together

The Journey of "empress":

Try It Yourself!

McKinsey Lilli Breach : Research Report And Learnings

Silicon Photonics: The AI Infrastructure Shift

Indian IT: Adapt or Commoditize

The Memory Wall Problem

Advance Python : Part 3

Advance Python : Part 2

Found this useful? Let's talk about your build.