The manatee watches...

The Magical Word-o-Matic - or - Markov Text Analysis for Fun and Non-Profit Oct 29, 2009

I've just completed a fun new side project. I call it "The Magical Word-o-Matic". What follows is a technical analysis about how it works. If technical stuff isn't your thing, feel free to skip over this and jump straight to the fun part.


I've been reading the Iliad and I've found that the names of the characters are, simply put, quite awesome. One of the interesting things about the Greek names was that they all seemed to be composed of very similar phonemes. I started wondering if there was a way I could programmatically combine together common letter combinations to create my own bad-ass Greek names.

I started brainstorming very forms of statistical analysis I could run on the names to generate a finite-state-machine of sorts that would create names on the fly (yes, I am a huge nerd.) Then, earlier this week, I stumbed upon Markov Text Analysis quite by accident. I did some more research and discovered that this was exactly the kind of algorithm I'd been brainstorming in my head. Not only that, but the technique is generally applicable to language and text analysis; you can analyze words at the character level (as I wanted to do to create Greek names) or at the word level, generating sentences and whole compositions.

Markov analysis works by taking the input and generating from it a set of probable next steps for each item in the input. That is to say, it tells you, given your current state, what you should do next. Take the word "Mississippi". If we analyze this at the character level, we'll get something that looks like the following:

start
        -'M': 100%
        M
        -'i': 100%
        i
        -'s': 50%
        -'p':25%
        -end: 25%
        s
        -s: 50%
        -i: 50%
        p:
        -p: 50%
        -i: 50%

Explained further: The first letter in our text will always be an 'M' and after an 'M' will always come an 'i'. All of our newly generated words will therefore start with 'Mi'. After 'i', things become more interesting - 'i' can be followed by an 's', 'p', or it can simply be the end of the word. 's' and 'p' in turn can result in more s's and p's or another 'i'. The following words could all therefore be generated: "Mi", "Misi", "Mippppppissssipi". Adding more words to the input allow for different starting and ending letters, along with different letter combinations throughout.

Now, obviously a word like "Mipppppppppppi" looks a little silly thanks to the ridiculous number of repeating letters. English never has more than two repeating letters in a row (to the best of my knowledge.) English only has a single word that actually contains more than two letters in a row - "Goddessship" - and that's a rather silly word so its safe to build our analyzer as though we never want more than two repeating letters. To account for this, we need to make our analysis smarter - make it aware of the fact that its input won't generally have more than 2 repeating letters. To do this, we simply make it look at 2 letters at a time when it does its analysis and generation. Analyzing "Mississippi" this way, we get:

start
        -'Mi': 100%
        Mi
        -'ss': 100%
        is
        -'si': 100%
        ss
        -'is': 50%
        -'ip': 50%
        si
        -'ss': 50%
        -'pp': 50%
        ip
        -'pi': 100%
        pp
        -'i': 100%
        pi
        -end: 100%

Now possible words look more like "Missippi" or "Mississississippi". Much more sane, relatively speaking. You may notice that, if you entered in a word that has 4 repeating letters, you can end up back in a a situation where you have long chains of single letters. If you spelled the word "Missssissippi", then you end end up with a chance that the letters 'ss' get followed up by another 'ss'. This can be fixed by increasing the analysis size to 3 characters or more, but you end up with a trade off - larger analysis sizes require larger inputs to generate unique combinations. From anecdotal testing, a analysis size of two seems to give a good result in terms of the naturalness of the word.

You may also notice that, if you step through the above analysis, not all character pairs are reachable. You will always start with "Mi" which will always be followed by "ss" and from there you'll find yourself only able to repeat "issississi" or bail out with an "ippi". This is not terribly interesting.

There are two ways to fix this. One is to enter more words into the input. If the new words contain similar letter pairs, new avenues for combination are introduced. This actually works well assuming that the words one adds to the input are similar, but we can achieve better results with smaller inputs as well. We do this by analyzing words in two letter chunks but only recording single letters for the next step in our word. Analyzing "Mississippi" this way, we get:

start
        -'Mi': 100%
        Mi
        -'s': 100%
        is
        -'s': 100%
        ss
        -'i': 100%
        si
        -'s': 50%
        -'p': 50%
        ip
        -'p': 100%
        pp
        -'i': 100%
        pi
        -end: 100%

Now, before we get too excited, one will note that this generates the same words as the previous analysis, just slower. That's fair, but one will find that, with a larger source input, this will allow for a more dynamic spelling vocabulary.

Also, one will note that we included a two letter output for our starting step. That is because each subsequent step requires two letters for input, so we need two letters to start with. We could have also started with:

start
        -'M': 100%
        M
        -'i': 100%

That would require making our generator more complex however, as it would have to include logic to do a single letter step after the first letter. The end result would be the same.

So, where does this leave us? It leaves us with some kickass, made-up Greek warrior names, that's where. Names like "Dolocheptor", "Adresius", and "Ilionestor". Moreover, when you input the text of Lewis Carroll's Jabberwocky, you get words like "throgovested", "Jabbersnack", and "swortled". All and all, a few hours time well spent, if I do say so myself. Of course, if you want to use your own source text, you're more than welcome to give it a whirl.

Projects

Tags