Utility

TF-IDF Calculator

Rank terms by importance across a document and corpus

Target document

Corpus — one document per line

Top N terms to show

Frequently Asked Questions

What does TF-IDF measure?

TF-IDF measures the relative importance of a word in a specific document compared to its frequency across a whole corpus. Words that appear often in one document but rarely elsewhere get high scores — these are typically the most informative and distinctive terms.

What is the difference between TF and IDF?

TF (Term Frequency) measures how often a term appears in the target document normalized by document length. IDF (Inverse Document Frequency) measures how rare the term is across all documents — common words get low IDF. TF-IDF multiplies them together.

Why are stop words removed?

Stop words like "the", "and", "is" appear in almost every document, giving them near-zero IDF. Removing them reduces noise, speeds up computation, and makes the results more meaningful by focusing on content words.

Why does the corpus need multiple documents?

IDF requires comparing across documents. With only one document, every term has an IDF of 1 and TF-IDF degenerates into just term frequency. The more diverse the corpus, the more meaningful the IDF scores become.

What is smooth IDF?

Smooth IDF adds 1 to both the numerator and denominator: log((N+1)/(df+1)) + 1. This prevents division by zero for terms not appearing in the corpus, and the +1 outside the log ensures the result is always positive.

Can TF-IDF handle multi-word phrases?

Standard TF-IDF operates on individual tokens (unigrams). For phrase-level analysis you would need n-grams. This calculator uses unigrams only. For bigram TF-IDF, preprocess your text to join common phrases before pasting.