The Right Glue
January 2009
By Dean
If you thought my last post was too technical, then you'd best just skip this one.
After a grueling week of upgrading my PC's hardware a dealing with flawed Seagate firmware updates, I decided to celebrate by adding yet another Markov chain feature to KevBot. There's not a lot of story to this. Qedi (who still hasn't updated his website since his reply to one of my very first posts), decided to call me on my claim that I couldn't make a full Markov conditional probability matrix out of KevBot's factoid database. He claimed that a graph-like data structure should be able to store the entire Markov table and also that it could be computed in 1.5 seconds.
With this in mind I took out my whiteboard and started coming up with a data structure that would be suited. Since qedi is interested in this structure, I will be outlining what I required and why I designed the structure the way I did. (The things I do for ex-roommates.)
There were three main requirements for the data structure. The first is the operations that need to be available on the structure: given a word, get the next word; access any random word; write a new Markov relationship. The second is that it has to be fast. All read/write operations have to be of O(1) time complexity. The third is that the structure should be serializable and deserializable so that the probabilities can be stored and retrieved across runs.
Qedi's main idea was to use a directed-graph structure to store the probabilities. Basically, each unique word stores a pointer to each other unique word that can come after this one. This allows O(1)-time access to any given next word. Its disadvantage is it uses O(n)-space (in the number of word relationships), so it can get really big. So each unique word is wrapped in a structure called a MarkovNode, and each MarkovNode stores a list of all words that can possibly follow from the current word, along with their weights.
A mere graph is not enough to satisfy the other requirements. A graph is not inherently searchable for a given word. What's a good structure for retrieving a given wrapper object given its name? There are several. A hash table comes to mind, but one can't guarantee O(1)-time access in every case with a hash table. The structure I chose for this is a trie. Tries aren't as fast on average as hash tables, but they certainly are consistently efficient. Each MarkovNode is referenced by a trie by its name, so that given an input string, the appropriate node can be found.
Tries are terrible at serialization and random access. That's where the final part of the structure comes into play. Each MarkovNode is referenced in a large unsorted ArrayList object. When a random MarkovNode needs to be accessed, it can be found in the ArrayList by index. Additionally, the ArrayList is very easy to serialize: for each node in the list, output the word and all of its children. Rebuilding the Markov graph from this kind of list is trivial. Serialization and deserialization are both O(n) operations (where n is the number of relationships), which is pretty much as good as can be expected.
Since the data structure is completely different from the existing structures KevBot is using, I decided to not use the existing database to populate it. I decided to use all input from the IRC channel to drive the structure instead of just what I've arbitrarily called factoids. Every time someone says something in the channel (#keveX on irc.slashnet.org), the sentence is read by the trie. Each word in the sentence is converted to a MarkovNode (a node is created and added to the ArrayList if it doesn't already exist for a given word) as found in the trie, and the MarkovNode objects are chained together based on the order of the words in the new sentence.
To generate output, one need only find a MarkovNode and follow along its children according to their weights. Finding a MarkovNode can be done randomly using an index into the ArrayList, or using a string by following the trie structure. Both operations are O(1) in the number of relationships.
So, in conclusion, large is one is something that the end up the algorithm.
By Dean
Weapons, a poem by KevBot:
When the reticle turns red,
The reticle turns red,
The guy is dead.
WARNING: This post assumes you have at least some idea of what conditional probability is, at least in the context of Markov chains. For those who do not know, here is my poor rough explanation: let's say someone has split a deck of cards into two 26-card piles. If I pick a card from either pile, I have 25% chance to get a diamond. But if I knew that the pile on the left only has red cards, and the pile on the right only has black cards, I know that the probably of picking a diamond from the left pile is 50% and 0% from the right pile. That's conditional probability: the change in probability of an outcome given some extra information. Probability is fun stuff, guys.
As you'll recall, I have something of a passion for creating things whose primary purpose is to cause general aggravation in my peers. The last time I updated my IRC bot, I added a Markov chaining algorithm whose output resembles English statistically rather than syntactically or grammatically (words that end in -tically are fun to say aloud).
Since my primary instance of KevBot ("living" in #keveX on irc.slashnet.org) has a database of nearly 100 000 sentences, it is infeasible to search through every single one of them to generate a Markov chain. Because of this I made a little heuristic that randomly picks a subset of sentences from its database and only checks that subset every time it needs to find the next word. Basically what it would do is search 4000 sentences for the current word, and if it was found, it would record the next word. Once it finds some nice next words, it picks one of those at random. Then it continues, using the last word as the new current word. It stops when it reaches 20 words or when it finds a period, exclamation mark or question mark. It's not truly a Markov chain because it doesn't work with the full set of sentences.
[Aside: While I could generate a true conditional probability table (you know, an actual Markov chain), it would take a very, very long time to calculate and would have to be recalculated at least partially every time a new sentence is added to the database. New sentences are added often enough that I never gave this option more than a few seconds' thought.]
Because the number of sentences is so large, KevBot has a very wide vocabulary and therefore is rather unlikely to create a phrase that makes any sense at all. Because it uses a different set of sentences for each word, there is a chance that it won't be able to find the current word in this set, which means it can't find the next word.
These things make the old Markov algorithm both incorrect and unnecessarily incoherent. The guys from #clone-army (on irc.rizon.net) asked me about why their instance of the bot was babbling such nonsense at them, and made a few suggestions about modifications to the algorithm.
I took a few of the ideas and got to work. An hour and two tacos later I had the new algorithm figured out. It was still a Markov chain, but worked on a rather different fundamental principle than the previous version: instead of resampling the full database of sentences for each word, it creates a set of sentences at the very beginning, and creates its chain from just those sentences.
This method has the advantage of being much faster than the previous version (since it works with many fewer sentences). It also can't miss a word, since by definition any word it finds is in a sentence it will be using on the next iteration (so even if the word "zebra" is only used once in KevBot's entire database, the fact that KevBot found "zebra" once means it will find it again). Since it's faster, I was able to increase the maximum sentence size from 20 to 30. And since it generates the sample list of sentences just once, I can parameterize how big that list is to change how random KevBot's responses are.
The fewer sentences KevBot uses to create output, the more likely it is to make sense (the poem above was generated from only two source sentences). The more it uses, the bigger the vocabulary and the longer the output.
Its biggest disadvantage is the small vocabulary domain. The wisdom command takes an optional parameter which is used as the first word in the chain if it is supplied. Since KevBot would only search, say, 100 sentences for that word, it is very unlikely that it would find it. I solved this problem by taking 5000 sentences at random, looking for the first word. If it finds it, it uses the last 100 sentences it found as a source. If it doesn't, then it gives up (to avoid an infinite loop).
The end result is very interesting. It's a true Markov chain because it uses all sentences in its sample (the conditional probabilities are correct for the sample, if not the whole database), so the output tends to be a lot more coherent than the previous version. Here are some examples:
<KevBot> a slow push through little plot, blam blam, action, go somehwere else retarted.
<KevBot> drum solo is also in the antenna
<KevBot> mine is to a brilliant idea, where you take into account a) how many dfeaths you have
<KevBot> an IQ of respects is stronger
<KevBot> a very thing is it more like a case-by-case statistical thing is C, 80-90 gets C, one standard deviation better is A, case-by-case statistical thing is supposed to the letters related
<KevBot> stpeeed on you are filled with cabinets, a lifeguard at math.
As always, you are welcome to appear in keveX to enjoy my bot's silly output as much as you'd like. It is indeed a lifeguard at math.
This is some kind of footnote. This webpage is awesome and can be viewed in any browser. Even ones that suck ass like Safari and Firefox. Isn't that awesome? This site is best viewed with browsers that aren't maximized on large-resolution displays (> 1024 pixels in width). But then again, if you are running a large resolution and browsing maximized, then you're a terrible person so you don't really deserve to see this site at its finest. Jerk. I mean, seriously. I spend all this time making a nice site and your silly browsing habits ruin its look. That's really cold, man. If you're using IE6, then in order to see the cool avatar effects you need to enable JavaScript. No rights reserved by Dean Whelton (who is awesome) of any of the content, images, design, backend or electrons used in this site. Steal at your convenience. None of it is worth stealing anyway, so there. I have even made an RSS feed for more efficient theft of my intellectual property: CLICK IT NOW!!! Now, don't say I'm not generous. I guess if you want to know more about me, you can visit the about page. It's not really an about page, though. It's just one of the first posts. I don't feel like making a real about page. You can contact me, too. If you feel like it. Are you really wasting time reading this? Go outside or something.