Utility

TF-IDF Calculator

Rank terms by importance across a document and corpus

TF-IDF Calculator

TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document relative to a collection of documents (corpus). It is one of the most fundamental techniques in information retrieval and natural language processing, used in search engines, document clustering, and keyword extraction.

How to Use This Calculator

  1. Target document — paste the single document you want to analyze.
  2. Corpus — paste multiple documents, one per line. The IDF component uses this corpus to penalize common words.
  3. Top N terms — how many highest-scoring terms to display (default 10).
  4. The calculator shows each term's TF, IDF, TF-IDF score, and document frequency in the corpus.

The TF-IDF Formula

TF(t, d) = (number of times term t appears in document d) / (total terms in d)

IDF(t) = log((N + 1) / (df(t) + 1)) + 1   (smooth IDF)

TF-IDF(t, d) = TF(t, d) × IDF(t)

Where N is the total number of documents and df(t) is how many documents contain term t. Smooth IDF adds 1 to prevent division by zero and dampens the effect of zero-frequency terms.

Why TF-IDF Works

High TF means the term appears frequently in the document. High IDF means the term is rare across the corpus. Their product is highest for terms that are frequent in the document but rare overall — exactly the kind of distinctive keywords that characterize a document's topic.

Common words like "the", "is", and "and" have near-zero TF-IDF because their IDF is very low. Specialized terms like "convolution" in a machine learning document have high TF-IDF because they appear rarely across general text.

Real-World Examples

Search engines: Google's early ranking used TF-IDF to determine which documents were most relevant to a query term.

Document clustering: TF-IDF vectors are used as features in k-means clustering to group similar articles together.

Keyword extraction: The top TF-IDF terms in a news article can serve as automatic keywords for tagging and summarization.

Spam filtering: Emails with high TF-IDF scores for words like "discount" and "offer" across a training corpus of spam exhibit a clear signal.

Frequently Asked Questions

What does TF-IDF measure?
TF-IDF measures the relative importance of a word in a specific document compared to its frequency across a whole corpus. Words that appear often in one document but rarely elsewhere get high scores — these are typically the most informative and distinctive terms.
What is the difference between TF and IDF?
TF (Term Frequency) measures how often a term appears in the target document normalized by document length. IDF (Inverse Document Frequency) measures how rare the term is across all documents — common words get low IDF. TF-IDF multiplies them together.
Why are stop words removed?
Stop words like "the", "and", "is" appear in almost every document, giving them near-zero IDF. Removing them reduces noise, speeds up computation, and makes the results more meaningful by focusing on content words.
Why does the corpus need multiple documents?
IDF requires comparing across documents. With only one document, every term has an IDF of 1 and TF-IDF degenerates into just term frequency. The more diverse the corpus, the more meaningful the IDF scores become.
What is smooth IDF?
Smooth IDF adds 1 to both the numerator and denominator: log((N+1)/(df+1)) + 1. This prevents division by zero for terms not appearing in the corpus, and the +1 outside the log ensures the result is always positive.
Can TF-IDF handle multi-word phrases?
Standard TF-IDF operates on individual tokens (unigrams). For phrase-level analysis you would need n-grams. This calculator uses unigrams only. For bigram TF-IDF, preprocess your text to join common phrases before pasting.