Some Language Statistics

I know many of these statistics are out there on the internet, but searching for them takes time, so this post collects a bunch of them in one place (likely just for my future consumption).

English Language Statistics

Peter Norvig has a list of the 100,000 most frequent english words derived from the “Google Web Trillion Word Corpus” ¹.

English Word Length Relative Frequencies

I like relative frequencies, so here are some histograms showing english word length and relative frequency.

Top 1k

Top 3k

Top 10k

Top 30k

Top 100k

English Word Length Culmulative Frequencies

It’s also useful to have culmulative frequencies around, so here are the cumulative frequencies.

Top 1k Culmulative

Top 3k Culmulative

Top 10k Culmulative

Top 30k Culmulative

Top 100k Culmulative

Tokenizer Statistics

Tiktoken is a “fast BPE tokeniser for use with OpenAI’s models”. I was curious what token lengths looked like in the past, so I generated plots, treating OpenAI’s tokenizers as (mostly) representative of most Tokenizer’s.

Disclaimer: I took the bytes and converted them into utf-8 strings and computed the length. I think this should be fine but I haven’t checked this as closely as I normally would.

The GPT-4o Tokenizer

The GPT-4o tokenizer is o200k_base². Both the PMF and CMF are plotted below.

Token frequency Cumulative token frequency

The GPT-4 Tokenizer

The GPT-4 tokenizer is cl100k_base² . Both the PMF and CMF are plotted below.

Token frequency Cumulative token frequency

The Codex Tokenizer

The codex tokenizer is r50k_base. PMF and CMF found below.

Codex Token frequency Codex Cumulative token frequency

The GPT-3 Tokenizer

The original GPT-3 tokenizer is r50k_base. PMF and CMF found below.

GPT-3 Token frequency GPT-3 Cumulative token frequency

The GPT-2 Tokenizer

The original GPT-2 tokenizer is gpt-2. PMF and CMF found below.

GPT-2 Token frequency GPT-2 Cumulative token frequency

The Llama 3 Tokenizer

The llama 3 tokenizer (not included in tiktoken) has, for lack of a better word, has a surprising token length distribution ³.

Llama-3 Token frequency Llama-3 Cumulative token frequency

The specific list used in generating the above plots is https://norvig.com/ngrams/count_1w100k.txt. ↩︎
The full list of tokenizers and their associated models can be found here. ↩︎ ↩︎
It might be interesting to investigate why this is the case. ↩︎

English Language Statistics#

English Word Length Relative Frequencies#

English Word Length Culmulative Frequencies#

Tokenizer Statistics#

The GPT-4o Tokenizer#

The GPT-4 Tokenizer#

The Codex Tokenizer#

The GPT-3 Tokenizer#

The GPT-2 Tokenizer#

The Llama 3 Tokenizer#