Frequency Analyzer

Analyze character, word or bigram frequencies in any text

What is it and how does it work?

Character and word frequency analysis counts how many times each character or word appears in a text. It's the fundamental technique behind classical cryptanalysis: in English, the letter E is the most common (about 13%), followed by T (9.1%), A (8.2%), O (7.5%), I (7.0%), N (6.7%). If you have a substitution cipher and a letter appears in ~13% of positions, it's almost certainly E. Frequency analysis broke every monoalphabetic cipher in history before polyalphabetic ciphers were invented.

Modern uses go beyond cryptography: word frequency analysis identifies the most common terms in a corpus for keyword research, NLP preprocessing (stopword lists, TF-IDF), content analysis (which words dominate a politician's speeches), and stylometric analysis (identifying an author by their characteristic vocabulary). This tool shows frequency tables with counts, percentages, and visualisations for both character and word frequency in any text you paste.

Common use cases

Frequently asked questions

What are the most common letters in English?

In order of frequency: E (12.7%), T (9.1%), A (8.2%), O (7.5%), I (7.0%), N (6.7%), S (6.3%), H (6.1%), R (6.0%), D (4.3%), L (4.0%), C (2.8%). The mnemonic "ETAOIN SHRDLU" covers the 12 most common — old Linotype operators knew this sequence by heart.

What is the difference between character frequency and word frequency?

Character frequency counts individual letters (ignoring spaces and punctuation in most implementations). Word frequency counts whole words as tokens. For cryptanalysis, character frequency is key. For NLP and content analysis, word frequency (and its normalized version, relative frequency or TF-IDF) is more useful.

What is Zipf's Law in word frequency?

Zipf's Law states that in natural language, the frequency of a word is inversely proportional to its rank: the 2nd most common word appears roughly half as often as the 1st, the 3rd roughly a third as often, and so on. In English, "the" appears about twice as often as "of", three times as often as "and". This power-law distribution appears in almost all natural language corpora.

How does Index of Coincidence differ from simple frequency analysis?

Simple frequency analysis counts character occurrences. Index of Coincidence (IC) measures the probability that two randomly chosen characters are the same. English plaintext has IC ≈ 0.065; random text has IC ≈ 0.038. IC is used to detect polyalphabetic ciphers: a Vigenère cipher with key length N will have IC between random and English — useful for determining the key length before frequency analysis.

Data

CSV Viewer · Data Faker · List Sorter · Number List Statistics · Array / Set Operations · Duplicate Line Finder