Chimpanzees love nothing more than sitting at keyboards processing and generating text. Elephants have a prodigious ability to store and recall information, and will carry huge amounts of cargo with great determination. The chimpanzees and the elephants realized there was a real business opportunity from combining their strengths, and so they formed the Chimpanzee and Elephant Data Shipping Corporation.
They were soon hired by a publishing firm to translate the works of Shakespeare into every language. In the system they set up, each chimpanzee sits at a typewriter doing exactly one thing well: read a set of passages, and type out the corresponding text in a new language. Each elephant has a pile of books, which she breaks up into "blocks" (a consecutive bundle of pages, tied up with string).
We’re hardly as clever as one of these multilingual chimpanzees, but even we can translate text into Pig Latin. For the unfamiliar, you turn standard English into Pig Latin as follows:
-
If the word begins with a consonant-sounding letter or letters, move them to the end of the word adding "ay": "happy" becomes "appy-hay", "chimp" becomes "imp-chay" and "yes" becomes "es-yay".
-
In words that begin with a vowel, just append the syllable "way": "another" becomes "another-way", "elephant" becomes "elephant-way".
Pig Latin translator, actual version is a program to do that translation. It’s written in Wukong, a simple library to rapidly develop big data analyses. Like the chimpanzees, it is single-concern: there’s nothing in there about loading files, parallelism, network sockets or anything else. Yet you can run it over a text file from the commandline or run it over petabytes on a cluster (should you somehow have a petabyte crying out for pig-latinizing).
CONSONANTS = "bcdfghjklmnpqrstvwxz" UPPERCASE_RE = /[A-Z]/ PIG_LATIN_RE = %r{ \b # word boundary ([#{CONSONANTS}]*) # all initial consonants ([\w\']+) # remaining wordlike characters }xi each_line do |line| latinized = line.gsub(PIG_LATIN_RE) do head, tail = [$1, $2] head = 'w' if head.blank? tail.capitalize! if head =~ UPPERCASE_RE "#{tail}-#{head.downcase}ay" end yield(latinized) end
for each line, recognize each word in the line and change it as follows: separate the head consonants (if any) from the tail of the word if there were no initial consonants, use 'w' as the head give the tail the same capitalization as the word change the word to "{tail}-#{head}ay" end emit the latinized version of the line end
-
The first few lines define "regular expressions" selecting the initial characters (if any) to move. Writing their names in ALL CAPS makes them be constants.
-
Wukong calls the
each_line do … end
block with each line; the|line|
part puts it in theline
variable. -
the
gsub
("globally substitute") statement calls itsdo … end
block with each matched word, and replaces that word with the last line of the block. -
yield(latinized)
hands off thelatinized
string for wukong to output
//// Is the Ruby helper necessary? It may be, but wanted to query it as I don’t see it appearing often as a feature. Amy////
To test the program on the commandline, run
wu-local examples/text/pig_latin.rb data/magi.txt -
The last line of its output should look like
Everywhere-way ey-thay are-way isest-way. Ey-thay are-way e-thay agi-may.
So that’s what it looks like when a cat
is feeding the program data; let’s see how it works when an elephant is setting the pace.
Each day, the chimpanzee’s foreman, a gruff silverback named J.T., hands each chimp the day’s translation manual and a passage to translate as they clock in. Throughout the day, he also coordinates assigning each block of pages to chimps as they signal the need for a fresh assignment.
Some passages are harder than others, so it’s important that any elephant can deliver page blocks to any chimpanzee — otherwise you’d have some chimps goofing off while others are stuck translating King Lear into Kinyarwanda. On the other hand, sending page blocks around arbitrarily will clog the hallways and exhaust the elephants.
The elephants' chief librarian, Nanette, employs several tricks to avoid this congestion.
Since each chimpanzee typically shares a cubicle with an elephant, it’s most convenient to hand a new page block across the desk rather then carry it down the hall. J.T. assigns tasks accordingly, using a manifest of page blocks he requests from Nanette. Together, they’re able to make most tasks be "local".
//// This would be a good spot to consider offering a couple additional, real-world analogies to aid your reader in making the leap to how this Chimp and Elephant story correlates to the needs for data manipulation in the readers' world - that is, relate it to science, epidemiology, finance, or what have you, some area of real-world application. Amy////
Second, the page blocks of each play are distributed all around the office, not stored in one book together. One elephant might have pages from Act I of Hamlet, Act II of The Tempest, and the first four scenes of Troilus and Cressida [1]. Also, there are multiple 'replicas' (typically three) of each book collectively on hand. So even if a chimp falls behind, JT can depend that some other colleague will have a cubicle-local replica. (There’s another benefit to having multiple copies: it ensures there’s always a copy available. If one elephant is absent for the day, leaving her desk locked, Nanette will direct someone to make a xerox copy from either of the two other replicas.)
Nanette and J.T. exercise a bunch more savvy optimizations (like handing out the longest passages first, or having folks who finish early pitch in so everyone can go home at the same time, and more). There’s no better demonstration of power through simplicity.