How can language diversity be measured, and what are the most linguistically diverse parts of the world? This interactive map represents my first attempt at answering these questions.
Using population data scraped off of Ethnologue, the linguistic diversity of each of the world’s countries is quantified as the entropy of the probability distribution over the country’s languages. The resulting entropy figures are plotted on a color-coded map with the help of Mike Bostock‘s amazing d3.js library.
The word “entropy” evokes a range of vaguely negative associations. Entropy measures disorder; entropy is what prevails after the heat death of the universe; and so on. What is entropy, and why is it suitable as a measure of language diversity?
Entropy, as the word is used in information theory, quantifies the uncertainty associated with a set of outcomes. The larger the set and the more even the dispersal of probability over different outcomes, the greater the entropy.
Entropy provides a very natural characterization of the diversity of populations grouped into types (e.g. of a country’s population grouped by first language). Each type can be treated as a member of the set of outcomes, and the proportion of the overall population associated with the type can be used as that outcome’s probability. The entropy of this set of outcomes encodes not only the number of different types but also the evenness of the distribution. By using entropy as our diversity index, a set with many outcomes that overwhelmingly favors one over the others (e.g. a population with many languages where speakers of one language outnumber all others a thousand to one) will be rated as less diverse than a set with fewer but more equiprobable outcomes. This is a more intuitive measure of diversity than, for example, a mere count of types. Others are possible, but I thought this was a promising way to start.
So entropy is actually a good thing, when we’re looking at people coexisting! A highly entropic distribution over languages is characteristic of a genuine diversity and mutual respect for difference. Other examples aren’t hard to imagine: an entropic distribution over political views might characterize a thriving democracy, for example. I hope this map does a little to rehabilitate entropy in the minds of at least a few readers.
All data for this map were collected from the 2009 web edition of Ethnologue with the help of a simple Python script written and run on the Scraperwiki platform. The data comprise a language name, an ISO 3166-1 alpha-2 country code, and a speaker population figure for each language spoken in each country in the Ethnologue database.
For each country, a probability distribution over languages was constructed by summing the speaker populations counts of each language and dividing each language’s count by the total. The entropy of this distribution was computed in the usual way with a Clojure script.
Using d3.js and d3.geo.js, each country’s entropy is plotted on the globe. The globe can be rotated by clicking and dragging.
As described by the scale in the upper left corner of the visualization, each country shape on the map is filled with a color whose deepness is proportional to the country’s linguistic entropy. The overall frequency distribution for the various levels of linguistic entropy is also displayed in a bar graph above that same scale. The colors for the scale were chosen with the help of Colorbrewer.
Since the shape file encodes political boundaries slightly different from Ethnologue’s, some adjustments were necessary to patch up blank areas (e.g. assigning South Sudan to Sudan, the West Bank to Israel, etc.).
I assume that few will be surprised to find that Africa is as rich in language diversity as it is in other forms of human diversity. Nor is it very surprising that India, a veritable continent of cultures comparable to the combined European Union, is also extremely diverse.
But this map still had a few surprises for me, particularly in its extremities.
On the high end of the scale, even though I was well aware of Papua New Guinea’s diversity, I was still astounded at how thoroughly it whipped the rest of the world.
At the other end, to quote what my co-blogger David asked me in surprise: “So how is Australia the same color as Antarctica?”—a very good question! Doesn’t Australia have a vast number of indigenous languages? Another friend of mine also found it surprising that Brazil was so pale.
But this is what we get by using entropy as our diversity index. Yes, Australia and Brazil have a vast number of languages. But those countries’ majority languages, English and Portuguese, utterly dwarf every other language in the country in numbers of speakers. So these countries’ overall diversity, as measured by entropy—the evenness of the distribution over languages—is very low. Compare the situation in the Democratic Republic of the Congo, where each of the country’s many languages is spoken by a large share of the population.