Random Lexical Experiments continued -Phonemes and Graphemes


Poem of the day: It sort of reminds me of the futuristic Hawaiian dialect from cloud atlas.

freckled te mortality
terrible violent wounded great ti take
great mighty veritable
sperm sperm commentator
sperm Norwegian ta wondrous American to certain
American sperm State
dying to Greenland to mistake
Greenland Greenland ke stand
Greenland sperm important
sperm te entire Trumpa Physeter pa momentary
Sperm Greenland ki take
right po sperm pe right right ti distance
sperm humpbacked important
Greenland Greenland captain
Greenland sperm totally
sperm Hyena instant
Tusked involuntarily
Horned te Unicorn Folio ta stand
white white white to white metaphysical
white white te stand
white white ti white Albino po sperm sperm Captain
entire sperm pe sperm white established
white sperm po tallest
sperm sperm po sperm te sperm to instance
right sperm metaphysical
English snowy entailed
true take
Right to stranded living to take
Greenland boiling foremost particular Patagonian
sperm spermaceti sperm stricken ke sperm waning te talk
same heaving understanding
tremendous great beheaded involuntarily
mightiest pa great sperm dead towing tapered
fagged pe sperm tapping
right sperm tar
sperm stricken substantial
wounded potatoes
sunken to sunken first po stricken flying stature
towing whole mountaineers
sperm towing to tablecloth
unborn stricken schoolmaster controverted te unaccountable
drugged other understand
blasted ki other lighter te Dutch slack stricken stayed
sperm white other table
adult last hunted great eternal standing
living last pa dead dead stains
stricken white famous ti white tanning
gliding sperm ki take
white white uncomfortableness
white fatal
white hated before Stammering

What I did was generate a bunch of words via regex in Moby Dick, then find pair of letter frequencies and randomly inject them, and then find word pairs within words and add them to the end of lines. I would like to experiment more with stanzas here and punctuation.

I was chatting with Colin about my python poetic experiments and he said that had built a library a long time ago for rhyming based on espeak (the cool unix version of say).  So what is a phoneme? It is a unit of sound like p or th. When you want to rhyme it is useful to know the phonemes the of word endings.  There is also the  grapheme, that is a way of writing down a phoneme. I was thinking of sound when I generated my poem today. Chapter 3 is about tokenization and fileio and some encoding so it was somewhat useful in this endeavor.

I was really interested in finding frequencies of pairs of words. Here the frequency of ka is 40, of pa 1694 (in I think Moby Dick)).

a e i o u
k 40 2727 930 24 39
p 1694 3122 1124 2372 552
r 3347 11627 4113 4672 863
s 2059 6577 2721 3199 1799
t 3339 6929 4933 7126 1669
v 737 5695 1437 481 18

Things get interesting when you load in a corpus like:

from nltk.corpus import gutenberg, nps_chat
>>> moby = nltk.Text(gutenberg.words(‘melville-moby_dick.txt’))

So if I want to match a _ whale:

>> moby.findall(r”<a> (<.*>) <whale>”)
dead; great; mightier; right; live; good; southern; white; white;
white; particular; sperm; sperm; sperm; sperm; flying; dead;
Greenland; Polar; sperm; small; dead; right; sperm; DEAD; nursing;
dead; lone; fine; blasted; second; blasted; sick; certain; discovery


>>> moby.findall(r”(<.*>) <fast>”)
was; it; they; a; those; him; got; when; got; locked; go; as; still;
iron; was; ;; and; be; making; so; get; him; him; when; party;
technically; technically; her; very; ,; as; walks; himself; my; the;
very; reefed; was; and; ,; held; themselves; hold; now; all; so;

>> moby.findall(r”(<.*>) <blubber>”)
their; and; thousand; “; de; de; of; the; great; of; the; the; the;
the; the; his; That; same; the; or; his; the; the; fresh; the; of;
the; of; veteran; of; shrivelled; of; and; curved

This is fun! Unfortunately this prints to sysout and not to a string, so I need to do some massaging in order to make this actually usable.. BUT

Stemmers is also introduced in ch3. This is how we determine the root of a word, like “go” for “going” there are different type of stemmers and you just have to use the one you like. That is sort of the advice that the documentation gives.  The WordNetLemmanizer returns the word if it is in its dictionary. So for example if go and not going is in the dictionary go is returned.  We looked at lemmas the other day, but lemmas, similar to stemming, remove inflections/endings to return the root word.

I am not sure how this would be interesting poetically.  Perhaps you want to create a rhythm, or use root words at different parts in a line – the endings perhaps.  For languages where words are not divided by spaces, tokenization is more difficult. The analysis of segmentation addresses this and is fascinating. Basically word endings are demarcated by binary strings (1 being end of a word).  One idea is remove spaces from sections of moby dick and reconstitute them based on distribution of word length and sentence length.

Sumana posted this fantastic python library, olipy,  for this sort of poetic generation.  Oulipo is a writing style where you introduce certain constraints into the writing, like writing without the letter e. Christian Bok is one of my favorite oulipo-esque writers.  Look what happens when I google for an article on Bok and Oulipo… an article on Bok Oulipo and Bergvall come up– that my friends is synchronicity – maybe.