Language diversity map: preview

Mapping language diversity

How can language diversity be measured, and what are the most linguistically diverse parts of the world? This interactive map represents my first attempt at answering these questions.

Using population data scraped off of Ethnologue, the linguistic diversity of each of the world’s countries is quantified as the entropy of the probability distribution over the country’s languages. The resulting entropy figures are plotted on a color-coded map with the help of Mike Bostock‘s amazing d3.js library.

Why entropy?

The word “entropy” evokes a range of vaguely negative associations. Entropy measures disorder; entropy is what prevails after the heat death of the universe; and so on. What is entropy, and why is it suitable as a measure of language diversity?

Entropy, as the word is used in information theory, quantifies the uncertainty associated with a set of outcomes. The larger the set and the more even the dispersal of probability over different outcomes, the greater the entropy.

Entropy provides a very natural characterization of the diversity of populations grouped into types (e.g. of a country’s population grouped by first language). Each type can be treated as a member of the set of outcomes, and the proportion of the overall population associated with the type can be used as that outcome’s probability. The entropy of this set of outcomes encodes not only the number of different types but also the evenness of the distribution. By using entropy as our diversity index, a set with many outcomes that overwhelmingly favors one over the others (e.g. a population with many languages where speakers of one language outnumber all others a thousand to one) will be rated as less diverse than a set with fewer but more equiprobable outcomes. This is a more intuitive measure of diversity than, for example, a mere count of types. Others are possible, but I thought this was a promising way to start.

So entropy is actually a good thing, when we’re looking at people coexisting! A highly entropic distribution over languages is characteristic of a genuine diversity and mutual respect for difference. Other examples aren’t hard to imagine: an entropic distribution over political views might characterize a thriving democracy, for example. I hope this map does a little to rehabilitate entropy in the minds of at least a few readers.

Data

All data for this map were collected from the 2009 web edition of Ethnologue with the help of a simple Python script written and run on the Scraperwiki platform. The data comprise a language name, an ISO 3166-1 alpha-2 country code, and a speaker population figure for each language spoken in each country in the Ethnologue database.

For each country, a probability distribution over languages was constructed by summing the speaker populations counts of each language and dividing each language’s count by the total. The entropy of this distribution was computed in the usual way with a Clojure script.

Visualization

Using d3.js and d3.geo.js, each country’s entropy is plotted on the globe. The globe can be rotated by clicking and dragging.

As described by the scale in the upper left corner of the visualization, each country shape on the map is filled with a color whose deepness is proportional to the country’s linguistic entropy. The overall frequency distribution for the various levels of linguistic entropy is also displayed in a bar graph above that same scale. The colors for the scale were chosen with the help of Colorbrewer.

The GeoJSON shape file for the world map was borrowed from Mike Bostock’s interactive d3.geo.azimuthal example—as indeed were portions of the visualization’s JavaScript code, in particular the code which rotates the map.

Since the shape file encodes political boundaries slightly different from Ethnologue’s, some adjustments were necessary to patch up blank areas (e.g. assigning South Sudan to Sudan, the West Bank to Israel, etc.).

Remarks

I assume that few will be surprised to find that Africa is as rich in language diversity as it is in other forms of human diversity. Nor is it very surprising that India, a veritable continent of cultures comparable to the combined European Union, is also extremely diverse.

But this map still had a few surprises for me, particularly in its extremities.

On the high end of the scale, even though I was well aware of Papua New Guinea’s diversity, I was still astounded at how thoroughly it whipped the rest of the world.

At the other end, to quote what my co-blogger David asked me in surprise: “So how is Australia the same color as Antarctica?”—a very good question! Doesn’t Australia have a vast number of indigenous languages? Another friend of mine also found it surprising that Brazil was so pale.

But this is what we get by using entropy as our diversity index. Yes, Australia and Brazil have a vast number of languages. But those countries’ majority languages, English and Portuguese, utterly dwarf every other language in the country in numbers of speakers. So these countries’ overall diversity, as measured by entropy—the evenness of the distribution over languages—is very low. Compare the situation in the Democratic Republic of the Congo, where each of the country’s many languages is spoken by a large share of the population.

Update July 9, 2014: two regrets

I’ve had a long time to think about this post, my first on this blog—and indeed on any blog associated with my name. It has given me its fair share of regrets. I’m writing this update to make two of those regrets public.

When this post was first making the rounds (in its small way), it quickly drew down the ire of a certain linguistics professor who complained of the naivety and crudeness of the quantitative analysis it contained (giving a detailed list of suggestions for refinement) and who accused me of doing “violence against knowledge”. I was shocked: hadn’t I made it clear in my accompanying text that I was just fooling around with a dataset and that I was fully aware that my work was so simple as to be meaningless?

My attitude showed a real lack of understanding of the way infographics are consumed. Now as when I first posted the map, it is in the nature of infographics that they are re-circulated without reflection. They are first and foremost graphics, images that impart pleasure and attract attention: publishing a graphic is an implicit invitation to call attention to a set of conclusions, and it should not be expected that viewers will reflect on the analysis itself. A graphic is meant to be passed around and admired. That’s what it’s for. It’s irresponsible to publish an infographic knowing that the analysis it represents is half-assed.

So my first regret is simply that I committed such an irresponsibility, publishing a very weak analysis in an eye-grabbing clickbait form and probably misleading a number of people.

My second regret is the gallingly sappy and naive perspective I inscribed alongside the map. “A genuine diversity and mutual respect for difference”? Really, Mr. Ashton! And this in the DRC?

How did I arrive at that insanity? Well, when I was first gearing up to quit grad school, I was energized by a feeling that I had to go out and do some good in this suffering world of ours and that my detached life in the academy had prevented me from knowing what that meant. That was of course true, as far as it went … but it also meant that my first attempts to work through the problem upon leaving the ivory tower inevitably amounted to filling my lungs with the prevailing ideology of my environs and class. The result was a simple-minded celebration of the liberal Canadian value of multiculturalism—and a bewilderingly wrongheaded attempt to read the world as though this value were already realized.

This update is not the place to explain my changed political perspective. Suffice it to say that I now appreciate the reality of conflict, and I regret the sunny attitude I show above, which is foolish and offensive.

5 thoughts on “Mapping language diversity

  1. Chung-chieh Shan

    Fascinating! Thanks.

    It feels slightly unfair to smaller countries that each country is judged on its entire population’s linguistic diversity (so if France and Germany merged then the resulting country would be darker than either ex-country). But I’m not sure how to correct for it.

    Reply
  2. Christian DiCanio

    While I laud your attempt to make sense of what “diversity” entails, isn’t entropy just a measure of how many speakers of a lingua franca exist in a country? In all cases, this will dwarf the number of speakers of all the other languages. For instance, there are over a billion speakers of Mandarin in China, but this is the state language and most people in the country do not speak it as their only language.

    I’m also confused about how you think greater diversity is equated with greater tolerance for difference. Any concrete examples of this?

    Reply
  3. nmashton Post author

    I think the speaker population figure that Ethnologue reports is a ballpark estimate of the number of self-proclaimed first-language speakers. Ethnologue gives the number of speakers of Mandarin as 840,000,000, for example. So perhaps entropy is a measure of the extent to which people are inclined to declare their “first language” as something other than their state’s administrative language on the census.

    As for examples of diversity being coextensive with tolerance: ah, you caught me in wishful thinking, I think! Perhaps I was distracted by my map’s bright, cheery colors. I was assuming that where diversity exists, it is stable and healthy and happy for all involved. But the violent impermanence of state boundaries (redrawn significantly even in the three years since the data was collected) suggests otherwise. That, as well as what Chung-chieh points out about the unfairness of state lines to smaller countries, suggests that some unit other than the state might be more appropriate for a map like this…

    Reply
  4. John Emerson

    Points should be given for how distantly related the languages are, in which case the Caucasus would win. As I understand, the “Caucasian languages” actually consist of two or three unrelated groups which are also not related to any group outside the Caucasus.

    Reply
  5. Pingback: Blogspektrogramm 3/2013 – Sprachlog

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">