Skip to main content

Command Palette

Search for a command to run...

Text Vectorization in NLP: How Machines Read Text (With Python Code)

Published
8 min read
V

Hi, I’m Vineet.

I’m exploring artificial intelligence, machine learning, and computing systems through hands-on projects, experiments, and writing. My interests span NLP, representation learning, software engineering, and applied research. I use this space to document what I learn, share practical insights, and build a strong technical foundation for research and industry.

Always learning. Always building.

Ever wondered how your phone figures out you love cheeseburgers just from skimming a few reviews?
Or how a spam filter catches shady emails in milliseconds?

The secret begins with one surprisingly simple yet absolutely crucial step:
turning messy human language into cold, hard numbers that machines can actually understand.

Welcome to text vectorization, the invisible bridge that connects everyday words to the mathematical world of machine learning models.

What actually is text vectorization?

At its core, text vectorization is the process of converting raw text, such as words, sentences, or entire documents, into numerical vectors (lists or arrays of numbers) so that machine learning models can process them.

Why do we need numbers at all?

Machine learning algorithms are built entirely on mathematical operations, including addition, multiplication, distance calculations, and gradients.
Words like "cheeseburger" or "awesome" have zero mathematical meaning on their own until we assign them coordinates in a number space.

Simple mental model

Imagine your entire collection of text (your "corpus") has only these unique words:

cheeseburger, love, I, reviews, find, best

→ That's your vocabulary (6 words).

Now every document is turned into a vector with exactly 6 dimensions (one per word):

  • "I love cheeseburger" → [1, 1, 1, 0, 0, 0]
    (1 = word appears once, 0 = doesn't appear)

  • "Find best cheeseburger reviews" → [1, 0, 0, 1, 1, 1]

The position (dimension) tells the model which word we're talking about, and the value tells it how much that word is present.

Visual example (for intuition)

Here’s what the full matrix looks like for three tiny documents:

DocumentcheeseburgerloveIreviewsfindbest
I love cheeseburger111000
Find best cheeseburger reviews100111
Best reviews ever000101

This matrix is exactly what a basic vectorizer produces:
rows = documents, columns = words.

Key concepts to understand

  • High-dimensional space
    Real-world datasets usually have thousands or even millions of unique words → vectors can easily reach 10,000–100,000+ dimensions.
    Computers handle this fine, but the vectors are very sparse (mostly zeros).

  • Sparse vs dense (quick preview)
    In classic vectorization (what we're covering here), most values are 0 → sparse representation.
    Later, with learned embeddings (Word2Vec, BERT, etc.), we get dense vectors where almost every value is a small non-zero number that captures semantic meaning.

  • The order is usually ignored
    In the simplest methods (the "Bag-of-Words" family), word order doesn't matter.
    "I love cheeseburger" and "cheeseburger love I" produce the same vector.
    (More advanced techniques preserve order; we'll discuss them in future posts.)

The simplest version: Bag-of-Words with CountVectorizer

Now that we know what a vector looks like, let's create one the easiest way by simply counting words.

Core intuition

Picture this: you throw every word from a document into a bag, shake it, and only count which words appear and how many times, ignoring:

  • word order

  • grammar

  • sentence structure

  • punctuation

That's the whole idea.
The name Bag of Words comes directly from this: words are treated as an unordered collection (a multiset), and you just tally them up.

Why it actually works well

For many tasks, especially basic classification, the presence and frequency of certain words carry most of the signal, while exact order matters surprisingly little.

Examples:

  • Spam detection → words like "free", "viagra", "lottery" scream spam no matter the sentence

  • Sentiment analysis → lots of "great", "awesome" vs "boring", "terrible"

  • Topic classification → "goal", "penalty", "football" vs "stock", "earnings"

In these cases, raw word counts already give strong clues.

How CountVectorizer implements it

CountVectorizer (from scikit-learn) turns this intuition into numbers in three steps:

  1. Builds a vocabulary of all unique words across documents (lowercased + tokenized)

  2. Assigns each word a column index

  3. Counts occurrences → each document becomes a vector where value = word count

Result: a sparse matrix (mostly zeros)
→ rows = documents, columns = vocabulary words

Quick visual

Think of a giant spreadsheet:

  • Rows = documents

  • Columns = every unique word

  • Cells = count of that word in that document
    → 99% zeros = very sparse

Quick code example

Let's see it in action with scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer

# Our toy documents
documents = [
    "I love cheeseburger",
    "Cheeseburger is great",
    "I love great cheeseburger"
]

# Create and fit the vectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# See what we got
print("Vocabulary:", vectorizer.get_feature_names_out())
# Output: ['cheeseburger' 'great' 'is' 'love']

print("\nSparse matrix (dense view):\n", X.toarray())
# Output:
# [[1 0 0 1]    ← "I love cheeseburger"
#  [1 1 1 0]    ← "Cheeseburger is great"
#  [1 1 0 1]]   ← "I love great cheeseburger"

Notice how:

  • The vocabulary is built automatically (only 4 unique words after lowercasing).

  • Each row is a document vector.

  • Values are simple counts (no weighting yet).

  • The matrix is sparse, so many zeros!

This is pure Bag-of-Words: order ignored, just raw frequencies.

Main limitations

  • No word order → "dog bites man" = "man bites dog"

  • No meaning/similarity → "good" and "great" are unrelated (different columns)

  • High dimensionality → thousands of columns

  • Common words ("the", "is") dominate unless removed

That's why we usually move to TF-IDF next (downweights frequent words) and later to dense embeddings that capture semantics.

Smarter weighting: TF-IDF

Why CountVectorizer isn't always enough

In real-world text, extremely common words like "the", "is", "and", or even dataset-specific frequent terms, appear in almost every document.

They rack up high counts everywhere → but carry almost no discriminative power for topic, sentiment, or meaning.

TF-IDF fixes this by automatically downweighting words that are frequent across the whole collection, while boosting words that are frequent in one document but rare overall.

Quick intuition

TF-IDF score = (how often a word appears in this document) × (how rare the word is across all documents)

  • High score → word is important here and distinctive (not everywhere) → jackpot!

  • Low score → word is either very rare overall or too common everywhere

It's like giving extra credit to words that truly characterize a document.

The formulas (gentle version)

  • TF (Term Frequency) = how often the word appears in this document
    (usually raw count, or normalized by document length)

  • IDF (Inverse Document Frequency) = log( total documents / documents containing the word )
    (The log prevents extreme values; +1 smoothing is common)

  • TF-IDF = TF × IDF

Higher IDF = rarer word → bigger boost
Lower IDF = common word → gets shrunk

Code example (same documents)

from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    "I love cheeseburger",
    "Cheeseburger is great",
    "I love great cheeseburger"
]

tfidf = TfidfVectorizer()
X_tfidf = tfidf.fit_transform(documents)

print("Vocabulary:", tfidf.get_feature_names_out())
# ['cheeseburger' 'great' 'is' 'love']

print("\nTF-IDF matrix (rounded):\n", X_tfidf.toarray().round(3))
# [[0.613 0.    0.    0.79 ]   ← "I love cheeseburger"
#  [0.425 0.548 0.72  0.   ]   ← "Cheeseburger is great"
#  [0.481 0.62  0.    0.62 ]]  ← "I love great cheeseburger"

Quick side-by-side comparison (Count vs TF-IDF)

Documentcheeseburger (Count → TF-IDF)greatislove
I love cheeseburger1 → 0.6130 → 00 → 01 → 0.790
Cheeseburger is great1 → 0.4251 → 0.5481 → 0.7200 → 0
I love great cheeseburger1 → 0.4811 → 0.6200 → 01 → 0.620

What changed?

  • "is" (appears in only one doc but is a stop-word-like term) gets high weight in Count but penalized in TF-IDF (0.720 → relatively high but contextual)

  • Content words like "love", "great", "cheeseburger" keep a stronger relative importance

  • The vectors are now normalized (rows sum to ~1 in L2 norm by default)

This makes the representation more meaningful for most downstream tasks (classification, clustering, similarity).

Quick practical tips

  • Use stop_words='english' in both vectorizers to auto-remove "the", "is", etc.

  • Try max_features=5000 if your vocabulary explodes.

  • Set ngram_range=(1,2) for bigrams ("cheese burger") if context matters a bit.

  • Prefer TfidfVectorizer over CountVectorizer for almost every real task, unless you're doing pure probabilistic models like Naive Bayes.

Summary & What's Next

We started with the big why: machines need numbers, not words.
We then built intuition around vectors, saw how simple Bag-of-Words (via CountVectorizer) turns text into count matrices, and improved it with TF-IDF to give meaningful weights instead of raw frequencies.

Key takeaways:

  • Vectorization = the essential bridge from human language to ML.

  • CountVectorizer → easy but naive (raw counts, common words dominate).

  • TF-IDF → smarter (downweights ubiquitous words, highlights distinctive ones).

  • Both are sparse → great for classical ML, but limited in capturing meaning or order.

This is still just the foundation. In the next post, we'll go deeper: design/build a TF-IDF vectorizer from scratch in pure Python/NumPy (no scikit-learn) to really see what's happening under the hood and understand why libraries are so convenient.

If this helped you understand how your phone "reads" reviews or your email filter works, drop a clap, share it, or comment below:

  • What's your favorite NLP task?

  • Any questions about when to choose Count vs TF-IDF?

See you in the next one — where we roll up our sleeves and design it ourselves. 🚀