TF-IDF Calculator
Rank terms by importance across a document and corpus
TF-IDF Calculator
TF-IDF (Term Frequency-Inverse Document Frequency) is a numerical statistic that reflects how important a word is to a document relative to a collection of documents (corpus). It is one of the most fundamental techniques in information retrieval and natural language processing, used in search engines, document clustering, and keyword extraction.
How to Use This Calculator
- Target document — paste the single document you want to analyze.
- Corpus — paste multiple documents, one per line. The IDF component uses this corpus to penalize common words.
- Top N terms — how many highest-scoring terms to display (default 10).
- The calculator shows each term's TF, IDF, TF-IDF score, and document frequency in the corpus.
The TF-IDF Formula
TF(t, d) = (number of times term t appears in document d) / (total terms in d)
IDF(t) = log((N + 1) / (df(t) + 1)) + 1 (smooth IDF)
TF-IDF(t, d) = TF(t, d) × IDF(t)
Where N is the total number of documents and df(t) is how many documents contain term t. Smooth IDF adds 1 to prevent division by zero and dampens the effect of zero-frequency terms.
Why TF-IDF Works
High TF means the term appears frequently in the document. High IDF means the term is rare across the corpus. Their product is highest for terms that are frequent in the document but rare overall — exactly the kind of distinctive keywords that characterize a document's topic.
Common words like "the", "is", and "and" have near-zero TF-IDF because their IDF is very low. Specialized terms like "convolution" in a machine learning document have high TF-IDF because they appear rarely across general text.
Real-World Examples
Search engines: Google's early ranking used TF-IDF to determine which documents were most relevant to a query term.
Document clustering: TF-IDF vectors are used as features in k-means clustering to group similar articles together.
Keyword extraction: The top TF-IDF terms in a news article can serve as automatic keywords for tagging and summarization.
Spam filtering: Emails with high TF-IDF scores for words like "discount" and "offer" across a training corpus of spam exhibit a clear signal.