Markov Chains

Basics

A Markov chain is a random process that changes from one state to another state with a probability that is determined entirely by the current state.

A simple example would be if the odds of it raining tomorrow were determined only by the weather today, as in the following table (called the transition matrix):

Today's weather	Chance of rain tomorrow	Chance of sun tomorrow
rain	80%	20%
sun	10%	90%

If we initially set a state of 'rain', then we could use a random number generator to produce a prediction for one weeks' worth of weather:

rain -> rain -> rain -> rain -> sun -> sun -> sun

Of course, we would likely get a different sequence of events if we try to make a second prediction:

rain -> rain -> rain -> sun -> sun -> rain -> rain

This obviously isn't a great system for extended weather forecasts. But if we had a more elaborate matrix, we could perhaps use it for something else.

One possibility is to build a matrix from many observations, and then use that to generate potential observations that are hopefully similar. For example, if we treat the letters of a name as a sequence of states, we can generate a transition matrix from a list of names, and then make up new names.

Using US census data from the past hundred years or so (full of names like "John", "William", and "Mary") to build a transition matrix results in generated names like this:

Orex
Mak
Celis
Mayes
Leen
Parmoses

Close enough for government work. But I left out an important part: while each letter in a name is considered a separate state, to generate such (relatively) good names the transition matrix must track two states at a time. So while a "normal" Markov chain for names might have a transition matrix like this:

current	Odds of 'a'	Odds of 'b'	Odds of 'c'	Odds of 'd'	...
'A'	3.01%	1.02%	3.46%	9.22%	...
'B'	4.21%	0.00%	0.00%	0.00%	...
'C'	2.91%	0.00%	0.00%	0.00%	...
...	...	...	...	...	...

The matrix used to generate the names above would look more like this:

current	Odds of 'a'	Odds of 'b'	Odds of 'c'	Odds of 'd'	...
'Aa'	0.01%	2.01%	0.46%	2.22%	...
'Ab'	4.21%	3.65%	0.02%	0.87%	...
'Ac'	1.91%	0.03%	0.73%	0.02%	...
...	...	...	...	...	...

This is called a Markov chain of order 2, since it tracks two states at a time.

String::Markov

Since playing with Markov chains is fun, I wrote a CPAN module to make it easy to create finite, text-oriented Markov chains in Perl: String::Markov.

The finite part means that it is meant for sequences with a defined beginning and end (it would not work well with the weather example), and the text-oriented part means that it's designed to work on Unicode strings (like names). There are some fancy bits in the module (like Unicode normal forms), but it is simple enough to have 100% test coverage without too much effort.

The latest updates have been to make the code faster. The original implementation used a hash reference to store the transition matrix, and iterated through the hash with each. Iterating through hashes is slow. Simply replacing the hash with two arrays and iterating through one of them to find the correct index for the other resulted in a big boost: slow test cases ran over twice as fast, while already fast cases got at least 20% faster. Testing on my cellphone showed the new version ran up to 168% faster.

Transforming the probabilities into a cumulative distribution suitable for binary search had mixed results, and was always a net loss if the appropriate XS module was unavailable. Since the main win for this approach was in unusual situations (giant state spaces), I decided to leave out the extra complexity (and the extra dependency).

Markov Server

To show off what the module could do, I put together a small PSGI app to generate random names: http://markov.gmathews.com/names. Of course, once I started, I couldn't stop myself from adding more features and sources, eventually ending up with the Markov Server, which supports many different chains.

The multitude of chain types is not without cost, and the single-core VPS I'm using is not resource rich. To mitigate this problem, I run all the chains "simultaneously" as async blocks with Coro. Each chain writes to a dedicated Coro::Channel, so that data is ready before a request is made. Hopefully, the VPS has enough time between servicing requests to re-fill the channels with data and keep response times low. This soaks up some RAM, but modern machines (even tiny VMs) have plenty of space. The much bigger downside to this approach is that specific results can't be recalled, since all the data that users see is pre-generated and must wait in the queue.

With a faster VPS, I could skip the queued approach and provide each result set with a "permalink" that stores the state of the PRNG in the query string. Then results could be regenerated by seeding the PRNG appropriately, and a full page of lines could be "saved" in 48 bits or less.

[UPDATE]

The latest version of the Markov Server allows for saving results with a simple link, but still allows pre-generated results from a queue to fulfill most requests. This was accomplished with five changes:

Instead of generating results line-by-line, results are generated in full batches. The per-chain queues now operate atomically on batches of lines, and cannot send sequential lines to different coroutines.
The String::Markov module now generates stable chains, so that the memory layout of the hash used to store the state vectors does not affect the order in which states are tested against the chosen threshold. Recent Perl versions randomize this hash, which causes the same internal state and the same seed to produce different results on different runs. Fixing this was essential for allowing results to be recalled across server restarts.
A queue of just random seeds is used when generating batches of lines, so that each batch is generated from a known seed.
The seed for a batch is used to generate the permalink. If a request comes in with a specific seed value, then that value is used to generate results instead of pulling results from a queue.
LRU caches were added for results pulled out of a queue and for results specified by seed value. A hit in the former cache moves data to the latter cache, creating something resembling a two-level cache system.