The Wisdom of the Language

Note: This document is a “work in progress” — I will post a links to previous editions here. It was first published Thursday, 25 Jan 2007 on omidyar.net — but is no longer available at that address. Please come and discuss with us / share your opinion: here’s the discussion thread.

OK, the goal of this thread will be to FF from “web 2.0″ to “online 4.6″. ;D

Please note that although I am first going to get a little heavy WRT “theory” (or something like that), at the end there will (hopefully) be a practical result (sorry, no prizes here ;)

Let me start at the very beginning — links. Links are great: they let you say something and show something about how the thing that you’re saying is related to other things that have been said before. The world wide web did not invent links — before the WWW, this stuff was called “footnotes”. What the WWW did was to transform written stuff (i.e. “records” and/or “recordings” [in the most general sense of those terms]) into machine-readable formats ( e.g. “text” into “text files”, “images” into “image files”, audio [records/recordings] into “audio files”). Some of the data in such machine-readable formats can be analyzed algorithmically. Even before Google started doing statistical evaluations of links, such methods were used to evaluate academic research (e.g. “publish or perish” was gauged using e.g. “citation abstracts”). And even earlier than that, Zipf analyzed words in written text, and this was perhaps one of the first forays into the field known as “computational linguistics” (and/or “natural language processing”).

Although some of the insights gained in the field of linguistics have been very enlightening, it became clear to me (though perhaps not so obvious to others) that even though some aspects of language (e.g. the lexicon — especially the most “basic” words) are very static over long periods of time (e.g. words like “house”, “home”, “happy”, “sad”, “you”, “me”, “he”, “she”, “it”, “and”, “the” …), the fact that languages are constantly in flux will make it next to impossible to “calculate” individual statements or even entire texts (i.e., to “figure out their meaning”).

Therefore, although Google’s PageRank (which is — as mentioned above — quite similar to methods that have been used for several generations) might be a good statistical method for evaluating importance, the methods used by Google’s search engine for evaluating relevance are comparatively poor (even though they may be among the “best in breed”): It may simply be that the methods of computational linguistics and/or natural language processing are insufficient for the rather complex task.

Initially Google’s failure to discern the meanings of texts (and to evaluate whether such texts might be relevant for solving problem/task X) was not readily apparent — since in the early years of the WWW most (the vast majority of) texts were written using a highly structured, almost “artificial” language (a “restricted” language — see e.g. the work of Basil Bernstein on “socio-linguistics”) — known as such things as “academic papers”, “articles for scientific journals”, “research reports”, etc. Therefore, a large degree of Google’s success relied on the “way the web was” — namely academic.

That has since changed. First commerce came online, and in the meantime (in the United States), the wider population has gotten online. At the moment, the web is characterized by its academic legacy and also by its primarily North American population. Although other technologically advanced economies (such as Germany and other parts of northern/eastern Europe) may even be “ahead” of the “state of the art” in the United States in some respects (e.g. integration with “mobile devices”), the “fact” that — in the early days of the web — the United States used to be a free and open economy and also home to some of the world’s most respected research communities gave the United States a head start on the web (and therefore the developments in the rest of the world have in the past decade been largely overshadowed by the “attention skew” given the United States). Today, people are recognizing this skew — and slowly, they are becoming more aware of the global economy (e.g. “The World is Flat”).

Moving right along, we need to recognize that not only “the economy” is becoming more and more a global phenomenon — language is also undergoing globalization. If someone sitting in somewhere in the United States types in “ringtones”, then he/she is no longer referring to something distinctly American. Indeed, although this term appears to be “English”, it is by no means clear that “ringtones” is an American (or British or Australian or whatever) phenomenon — it is, actually, a global phenomenon.

There is almost no question that English as “lingua franca” will be “speeded up” by the Internet (but perhaps some other languages — maybe a dozen? — will continue to be used alongside English). What this means for information retrieval on the Internet is that we have gone from a very specific, and particularly restricted language to a very generic and extremely fuzzy language. This has perhaps played an important role in the rise of “communities” — since within such communities people are again able to use a more “restricted language” that is “tailored” to their specific needs (i.e. jargon).

Because of the early web focus on the United States, many of the first communities popped up there (incidentally, the “message boards” of the late 80′s and early 90′s are quite different from the type of communities that are currently popular — more on that in a moment). Because these communities were usually relatively close together geographically (and in this regard perhaps more importantly: within the national boundaries of the United States), it has become quite common for these communities to “meet up” in “meatspace” / “face to face”. Currently, my hunch is that much of the excitement around the “communities concept” revolves around creating understanding by building on “like-mindedness” — without necessarily relying on “speaking the same (focused, restricted, targeted, etc.) language” . Therefore, the “wisdom of the crowds” seems to be: 1. generalized and 2. localized (in the very early days of “message boards”, participants were relatively “few and far between” — and therefore the “communities” of those early days, which were generally topically focused, are perhaps even prototypical for future developments).

Now let’s take it “a step further”: what do words like “freedom”, “democracy”, etc. mean in Africa, Asia, Europe, etc.? Do they mean something different than those terms in the United States? What about “download”, or “movie”? Is there such a thing as a “community of artists”? Would such a community know national boundaries? Would such a community be focused / targeted / restricted enough to “meet up” face to face? Would it be necessary to meet up face to face?

I think not. I think that the “wisdom of the language” will ultimately make the “wisdom of the crowdssuperfluous — especially because language, unlike crowds, scales well (e.g., it is easy to use language to focus on the “ringtones” community or the “movie” community [etc.]). OK: Bring on the language! Bring on the crowds! Let’s communicate!!!

:D nmw

ps: Ah, and on the topic of the promised “practical result”. Well, I’m sorry to say that was a hoax (only kidding ;) . No, the practical result (well, you probably could guess it) is that communities will become more and more be terminologically focused — as in something like: “when we say democracy, we mean invading countries that have oil”. And since such people will want to meet up (at least “virtually”) with other “like minded” people they will not search the entire web for “democracy” (since other people might use that term differently), they will simply register a domain (like “democracy.us” or “democracy.tv” or whatever ["democracy.mil"?]) and then they will talk about stuff related to that “mindset” (or “frame” or whatever). Other people might define democracy differently, and then they would register another “democracy” domain. Maybe “joey’s pizza parlor” would want to promote the idea of “pizza for the people” and then people could vote on the website which pizza is best and then that would be the “weekly special” ==> “democracy.ws” !!!

Leave a Reply