Word Play

2010-09-14 : Every time I try to read the Old Testament, I'm always amazed at how repetitive it is. That me to wonder if I could measure this repetitiveness. What is the longest, most repeated phrase in the Bible? Note that this is an ambiguous question, and part of this exercise is to figure out how to resolve the ambiguity.

I'd previously done letter and word counters with Daniel Richard G., but this time I wanted to calculate word sequences. Also, the Bible is moderately large - over 790,000 words. That's nothing if you are a computational biologist, but naive O(n2) or worse algorithms aren't going to cut it.

It seemed to be that the "sorting an array of all suffix strings" approach would work pretty well for making it easy to find duplicate strings. However, I also wanted to parameterize my approach so I could use the same algorithm but vary things like how punctuation was handled and whether comparison was done on a character by character or word by word basis.

Side note: Interestingly enough, there's a word for creating a sorted index of every word in a book: a concordance. Even more interesting, the first concordance was created for . . . the Bible!

I started out by making a "histogram" of sequences of tokens of length n. Tokens can be single characters or words (character array ranges), and I can change the tokenizer to get different results.

The Results

Source: https://latenighthacking.com:8443/svn/code_root/2010/WordPlay

Speed: For the Bible (below) (4,137,819 bytes):
For lower case words:

For characters :

Not too bad. Good enough for interactive experimentation.

The Bible

My copy of the Bible comes from O-Bible, King James Version. I'm using the whole thing, not just the Old Testament.

Character histogram (lower case) of 4,137,819 characters: " ":758,536, "e":410,139, "t":316,031, "h":282,012, "a":274,632, "o":241,643, "n":223,958, "i":192,750, "s":189,132, "r":169,113, "d":157,546, "l":129,348, "f":83,095, "u":82,940, "m":79,542, ",":70,683, "w":65,213, "y":58,248, "g":54,852, "c":54,430, "b":48,555, "p":42,738, "\n":31,102, "v":30,249, ".":26,145, "k":22,111, ":":12,721, ";":10,139, "j":8,778, "?":3,297, "z":2,966, "'":1,997, "x":1,449, "q":953, "!":313, "(":221, ")":221, "-":21

Here are some interesting results, just looking at 791,421 (lower case) words.

While the results are correct, as we get to longer sequences, the result is not exactly what I want. We see the same phrase hits the top 10 multiple times, probably pushing off other interesting phrases. It's a sliding window problem: if the phrase is actually 15 words long, it will show up five times in the top ten as the window slides along. I would prefer it show up only once.

There is also the opposite problem, where a core phrase has slightly different words at the beginning or end. They really are "different" phrases though they are very similar. For example, from the above:

            "in the book of the chronicles of the kings of":34
"not written in the book of the chronicles of the":29
Even though they share the same core, they have different numbers of occurrences.

Other interesting long sequences:

And to really bring the sliding window problem home, The longest phrase used more than once is just a variation of the above that prepends two words: length 101: "did offer his offering...":2.

The Qur'an

So there are lots of translations and clearly the results are going to depend on the translation. I've fed through a couple of translations, and the general feel of the results remains the same (thankfully).

Using the Pickthall translation (source), which has fairly Biblical English, here are some interesting results. Looking at 155,833 (lower case) words:

Interestingly, by length 3 we are already under 200 occurrences: By length 6 we are already under 100 occurrences: Longest phrase used more than once:

Well, what about the original Arabic? Well, sure, the program can handle it, assuming a space is a reasonable word break. I don't know anything about Arabic, Arabic script, or it's Unicode encoding (other than that it's a right-to-left language and requires complex context-sensitive glyph shaping rules to render correctly) so I'm likely mangling things. I can't exactly read the results. Plus, if we really want to compare originals, then I'd need to find some original Aramaic, Hebrew, and Greek for the Bible. But all we really care about is the fun of seeing the numbers, not the exegesis, right? Right!

Looking at 281,694 words:

Well, we can see that the sliding window problem still happens. It's also interesting to note that the most frequent sequence of length 10 happens 114 times in the Qur'an compared to 34 times in the Bible, even though the Bible has 2.8 times as many words (under the poor assumption English and Arabic words have similar information densities). This suggests that the Qur'an is even more repetitive than the Bible! As the English translations are less repetitive, this suggests that there are some errors in our assumptions (or that the translators like to throw in some variety).

We can improve our estimate the information density by noting that in the translation we are using, there are about 1.8 Arabic words per English word. So the length 12 sequence that happens 114 times (see below) should be similar to an English sequence of length 6.6. From the Bible, the most frequent 6 word phrase happens 125 times and the most frequent 7 word phrase happens 72 times. This suggests the Qur'an is about equally repetitious as the Bible.

A few last interesting data points:

Longest phrase used more than once:

Alice's Adventures in Wonderland

I'm doing this quickly and simplifying contractions by removing apostrophes. From 27,354 words:

Longest phrase used more than twice: Longest phrase used more than once:

War and Peace

From 565,711 words:

Longest phrase used more than twice: Longest phrase used more than once:

The Complete Works of William Shakespeare

From 887,289 words:

Longest phrase used more than twice: Longest phrase used more than once:

Pride and Prejudice

From 122,149 words:

Longest phrase used more than twice: Longest phrase used more than once:
Future work

Solve the sliding window problem.

It might be possible to capture the core/affix relations using some sort of tree, where each branch represents the addition of an affix and a corresponding weight. There would need to be some sort of canonicalization of the ording of prefixes and suffixes lest the tree become a graph. Also, a token can belong to more than one tree so we would need a way to decide which tree gets it. Assuming these problems can be overcome, it would probably be interesting to visualize the result as a tree map.

Zapf distributions!

A Second Algorithm

2010-10-06 : I implemented the algorithm that finds the longest duplicated sequences and removes them iterativey. The results are interesting, but I'm not sure how to turn it into a metric to measure repetitiveness.

It seems like it would be very interesting to feed into a tree map, since each word will be counted only once (vs. the simple method where each word starts a subtree and thus the words are counted n2/2 times). However, this doesn't really measure repetitiveness, but rather word distribution and highlights common phrases.

Perhaps a better way to do it would be to create a histogram of number of words in repeated phrases of length n.

Anyway, here's some sample output. Top 20 longest repeated phrases

The Bible

The Quran

Alice's Adventures in Wonderland

War and Peace

The Complete Works of William Shakespeare

Pride and Prejudice

C o m m e n t s :    
(nothing yet)