The Largest Vocabulary in Hip Hop
Matt Daniels is a designer, coder, and data scientist living in New York City. His past works include the Etymology of "Shorty" and Outkast, in graphs and charts. He decided to examine the vocabulary of hip hop artists, and this is what he found. – May 2014
Literary elites love to rep Shakespeare’s vocabulary: across his entire corpus, he uses 28,829 words, suggesting he knew over 100,000 words and arguably had the largest vocabulary, ever.
I decided to compare this data point against the most famous
artists in hip hop. I used each artist’s first 35,000 lyrics. That way,
prolific artists, such as Jay-Z, could be compared to newer artists,
such as Drake.
35,000 words covers 3-5 studio albums and EPs. I included mixtapes if the artist was just short
of the 35,000 words. Quite a few rappers don’t have enough official
material to be included (e.g., Biggie, Kendrick Lamar). As a benchmark, I
included data points for Shakespeare and Herman Melville, using the
same approach (35,000 words across several plays for Shakespeare, first
35,000 of Moby Dick).
I used a research methodology called token analysis to determine each artist’s vocabulary. Each word is counted once, so pimps, pimp, pimping, and pimpin
are four unique words. To avoid issues with apostrophes (e.g., pimpin’
vs. pimpin), they’re removed from the dataset. It still isn’t perfect.
Hip hop is full of slang that is hard to transcribe (e.g., shorty vs.
shawty), compound words (e.g., king sh*t), featured vocalists, and
It’s still directionally interesting. Of the 85 artists in the dataset, let’s take a look at who is on top.
When I first published this analysis, I excluded Aesop Rock,
figuring he was too obscure. The Reddit hip hop community was in uproar,
claiming Aesop would absolutely be #1. Sure enough, Aesop Rock
is well-above every artist in my dataset and I was obliged to add him to
the chart. In fact, his datapoint is so far to the right that he should
be off the chart (I'm lazy and didn't adjust the scale).
#2, #6, #7, #9, #20, and #23 wu-tang clan aint nothin ta wit
Wu-Tang Clan at #6 is impressive given
that 10 members, with vastly different styles, are equally contributing
lyrics. Add the fact that GZA, Ghostface, Raekwon, and Method Man's solo
works are also in the top 20 – notably, GZA at #2. Perhaps their countless hours of studio time together (and RZA’s mentorship) exposed each rapper’s vocabulary to one another.
Let’s take a deeper look at Wu-Tang five studio albums to better
understand each member’s contribution. Here's a breakdown of the number
and percent of words used by each member.
To understand each rapper's vocabulary (# of unique words) in
Wu-Tang's first five albums, I chose a 3,500 word threshold so that each
person was on an equal footing. That way, we could include GZA, but
unfortunately had to exclude Ol' Dirty Bastard, Cappadonna, and Masta
Killa, who have too few verses across Wu-Tang's corpus.
U-God and GZA clearly bolster the group’s average. Raekwon and
Method Man’s contributions have a lower average compared to other
members, but recognize that their data points would exceed most artists
in hip hop.
#3 - 5 Kool Keith, Canibus, Cunninlynguists
Moving past Wu-Tang’s dominance, the next three artists are relatively not as well-known. Of the three, Kool Keith has the most diverse vocabulary. For a taste of his work, check out his album with the largest vocab: Dr. Octagonecologyst. #2 and #3 are two relatively underground (yet accomplished) acts: Jamaican-born rapper Canibus and southern-based group CunninLyguists.
#14 - 15 Outkast and E-40
Of course E-40 is
in the top 20; he’s considered to be the inventor of much slang. Just a
few that he’s been responsible for: all good, pop ya collar, shizzle,
and you feel me.
At #15, Outkast’s deep vocabulary is definitely a function of
their style: frequent use of portmanteau (e.g., ATLiens, Stankonia),
southern drawl (e.g., nahmsayin, ery’day), and made-up slang (e.g.,
As expected, other southern-based acts aren’t in Outkast’s league. Take a look at the regional break-out below:
The south has the lowest average (4,268) and the east-coast the
highest (4,804). In fact, only 4 of the 17 southern-based artists in the
dataset are above average. My guess is that this is a function of crunk
music's call-and-response style, resulting in more repetition of words.
#26 and #33 busta rhymes and Twista
Since both rappers are known for their speed, it’s nice to see that their verses are just as lyrically diverse as their peers.
And skipping ahead to the bottom of the dataset...
#67, #68, #71, and #72 snoop dogg, 2pac, Kanye west, and lil wayne
Some of the biggest names in hip hop were in the bottom 20%. Let’s take another look at the data:
While Lil Wayne has never been celebrated for the complexity of
his word choices, I expected 2pac, Snoop, and Kanye to be well above
It's also worth noting that Drake, one of the most popular artists of late, is #83 on this list.
At #85 and in last place: DMX. But this shouldn't undermine an
artist whose raw energy and honesty were the most memorable qualities of
So what's all this mean?
io9 writer Robert Gonzalez
blew my mind with this point, "On The Black Album track 'Moment of
Clarity,' Jay-Z contrasts his lyricism with that of Common and Talib
Kweli (both of whom "rank" higher than him, when it comes to the
diversity of their vocabulary):
I dumbed down for my audience to double my dollars
They criticized me for it, yet they all yell "holla"
If skills sold, truth be told, I'd probably be
Lyrically Talib Kweli
Truthfully I wanna rhyme like Common Sense
But I did 5 mil - I ain't been rhyming like Common since