Graphing English

Given two words, can we connect them by a chain of synonyms? For example: minuscule — little — short — poor — wretched — ugly — frightful — tremendous — enormous. Try some:

[Note, the tool's offline right now due to a system reinstallation ... should come back shortly ... -- Dec 6, 2011]

Connect two words with a chain of synonyms:
Look up a word:

In the graphs above, each node is a word and an edge connects each pair of displayed words that can be synonymous. The raw data on semantic relationships is from Princeton's WordNet project.

Many word pairs are not linked. How large is the largest connected component—that is, the largest set of words such that any pair in the set can be connected by a chain of synonyms? An interesting graph theoretic question.

The answer is that the largest component has 25,105 words, or 17% of all words in the database. Meanwhile, the second largest component is over six hundred times smaller, with only 38 words (show it above).

Is this structure bizarre? Actually it's roughly what one would expect knowing random graph theory, which I'll now attempt to explain in one paragraph. A classic (Erdős–Rényi) random graph consists of n nodes, connected by a bunch of edges chosen uniform-randomly. Suppose the number of edges is such that on average, each node has d neighbors (for us, d synonyms). Imagine exploring a connected component by starting at one node and expanding outward: first looking at the nodes that are one step away from the starting point, then two steps away, and so on. What happens to the size of this frontier? If d < 1 then the frontier tends to shrink by a certain percentage in each step, so with high probability it dies out before the component gets very large. On the other hand, if d > 1, then the frontier tends to expand by a certain percentage in each step. Chances are pretty good that the frontier just gets bigger and bigger until the component includes a good fraction of all n nodes in the graph. That's the giant component. The second largest component must be very small, specifically O(log n) nodes, since otherwise it's likely to intersect with, and thus be absorbed by, the giant component.

Our graph of English has d = 2.92 synonyms per word. But of course English is not a random graph—which, with the same d, would have a largest component of 93% of its nodes and a second largest of about 5 nodes. Considering that we get within an order of magnitude without modeling any of the structure of the language except the number d, this is not so bad. And WordNet doubtless does not perfectly represent English: for example, it's plausible that common words are better annotated with synonymy relations than, say, abulia or vituperation.

I was led to think of this topic during conversation with some folks from UCL and CMU. But others have built a graph out of WordNet too (after all, it is called WordNet). A paper by Kamps, Marx, Mokken, and de Rijke, Using WordNet to Measure Semantic Orientations of Adjectives, scores words based on their relative distance to "good" vs. "bad".

4 comments:

  1. you misspelled Erdős =)

    ReplyDelete
  2. őőps. Unicode translation issues...

    ReplyDelete
  3. Great program! Should be more popular than it currently is.

    ReplyDelete
  4. Whoa! Had thought of a similar idea while creating a quiz with a friend and though we did not actually sit down coding it, we did figure out the logic you explained and anticipated this result, though naively so. What I wonder now is that if we were to conduct a simple experiment wherein we create a simple language with suppose 13 literals and impose rules of grammar and synonymy like grammar: literals should not repeat more than twice and no two adjacent literals be adjacent in any word; synonymy: some randomly generated database(because imposition of rule would simply make the graph structure dependent on it), then would the number of literals in the word at the current node affect the result? will the result be a function of the algorithm or its implementation used?

    ReplyDelete