Thursday, December 31, 2020

Green Eggs and Zipf's Law

 In the 1930s and 40s American linguist, George Kingsley Zipf wrote about a pattern found in written texts.  Now known as Zipf's Law, the pattern shows an eerie similarity in the distribution of word frequencies across essentially all documents whether they are books, essays, short stories or letters.  If you take a given set of words, count up all the words and order them by frequency you will find that the second most common word occurs almost exactly half as often as the first.  You will also find that the third most common occurs one third as often as the first, the fourth most common occurs one fourth as often, the fifth one fifth and so on with each word occurring 1/n as much as the most common word where n=its place in the order.  

Pretty neat and pretty weird but as researchers from different disciplines began to look for distributions in other data sets we started seeing the same pattern everywhere.  City populations, website traffic, citations of scholarly articles, deaths in wars, TV ratings.  How long you remember things follows this pattern.  Even randomly generated texts follow Zipf.  

We don't fully understand why language follows Zipf's law but the basis of why all these very deliberate, human systems follow the same pattern is somewhat simple: human behavior, even the extremely complex nature of cities and global societal behavior, are still governed by natural laws, and ultimately, math.  Even something like the population of a city which is governed by what feel like very personal decisions like taking a job, moving to be closer to family, etc. are subject to the ebbs and flows of pure numbers: the means and variances of statistics will, given time, determine the number of people living in a distribution of cities either in a country or in the world.  

This is similar to another statistical idea: the Pareto principle, named for Italian economist, Vilfredo Pareto.  It states that 80% of a given system's consequences are due to only 20% of that system's causes.  More concretely, Pareto discovered that 80% of land in Italy was owned by 20% of people.  This too has been applied to countless human systems: business managers will tell you that 80% of revenue will come from 20% of clients, IT departments and software engineers know that 80% of computer crashes come from 20% of the bugs, and 20% of your carpet endures 80% of the wear.  

When I first learned about Zipf and Pareto my mind eventually went to the stark wealth inequality that exists in the US and in the world.  Do these seemingly ubiquitous natural laws mean that we are locked in to the current state of affairs with roughly 20% of people controlling 80% of the world's resources?  But that's when I found a way out: Dr. Seuss.  

Famously, Theodore Geisel, known better by his pen name, Dr. Seuss, wrote the book Green Eggs and Ham on a bet that he couldn't write a book using only 50 words.  This bet produced one of the great children's author's most enduring classics.  I wondered: could this fairly extreme constraint get around Zipf?  I counted up all the words and plunked them into Excel.  And this is what I found:  


Green Eggs and Ham does not follow Zipf!  Just looking at the first few words "not," "I," and "them" you can pretty clearly see that the second and third appear much more frequently than the 1/2 and 1/3rd that of the first.  It's not even really close.  Another way to look at these data is that a Zipf distribution should be an almost perfectly straight line with slope = -1.  This line, especially after you get to "could" is no where near a straight line, it's a gentle slope towards the last word at that point.  

Green Eggs and Ham gives me hope.  If we just let systems motor along as they are, we'll inevitably end up with Zipf.  But if we manage to figure out a way to put some creative constraints on our larger systems, it looks like we may actually be able to push them towards equity.  I'm not remotely saying that will be an easy task but with the right tweaks, the right concerted effort in the right direction we might be able to pull a Zipf distribution into something a little more even.  I do so like green eggs and ham!  Thank you!  Thank you, Sam-I-am!

Sources:

https://upcommons.upc.edu/bitstream/handle/2117/180136/Ferrer-i-Cancho_EPJB_2005.pdf;jsessionid=0AA5762D175B3844791253058CFB645F?sequence=1

http://pdfs.semanticscholar.org/5a9a/0438af964d00249bafff11f0e85ef924a61a.pdf

http://pages.stern.nyu.edu/~xgabaix/papers/zipf.pdf

https://www.researchgate.net/profile/Wentian_Li/publication/253290454_Zipf's_Law_Everywhere/links/5cf59ef9299bf1fb185617ff/Zipfs-Law-Everywhere.pdf

https://www.biography.com/news/dr-seuss-green-eggs-and-ham-bet

https://betterexplained.com/articles/understanding-the-pareto-principle-the-8020-rule/

Vsauce: https://www.youtube.com/watch?v=fCn8zs912OE

Seuss, Dr.. (1960). Green Eggs and Ham. New York, NY: Beginner Books.

No comments:

Post a Comment