Markov chain

From Caves of Qud Wiki
Jump to navigation Jump to search

A markov chain, used in the context of Caves of Qud, is a method of generating new text using a "corpus" of existing text. The game pulls from two files in the base game directory: QudCorpus.txt, which is a compilation of all of the game's dialogue, descriptions, quest info, and help text; and LibraryCorpus.txt which is a collection of public domain books from Project Gutenberg. Some specific titles include The Machinery of the Universe: Mechnaical Concepts of Physical Phenomena and Zoological Mythology; or, The Legends of Animals. This method of generation is used to generate text for white-titled books, dreaming creatures, graffiti, urn engravings, and glowcrow dialogue.

Jason Greenblat, aka Ptychomancer, also had a talk at the International Roguelike Celebration about markov generation:

This talk also mentions Ruin of House Isner generation, which this article does not go into.

Preparing the Model

The general algorithm is that the corpus is chopped up into key-value groups, with the size of each key based on the "order" of the markov chain. Caves of Qud uses an order of two, so the word groups consist of at most two words.

As a simplified example, take the sentence "Oh, a quetzal is a pretty bird in the trogon family." The game will split this into the following pairs:

{
["Oh, a"]: {"quetzal"}, # This is the start of the sentence, so these are OpeningWords
["a quetzal"]: {"is"},
["is a"]: {"pretty"},
...
}

The already processed output is also a file, titled QudCorpus.txt which is actually a .json file. Note that the file does not look like the above example: it requires additional parsing to make it usable.

Generation

The game chooses two starting words. This is usually chosen between all OpeningWords, to make sure they are coherent sounding. Using this very small corpus as an example, the markov chain will generate the first two words: "Oh, a". The "chain" portion comes from the fact that once one point is established, the rest will follow based on what will most likely occur next. Plugging "Oh, a" into the dictionary will return "quetzal". The current output is now "Oh, a quetzal". The chain then takes the last few words based on order and does the process again. "a quetzal" are the last two words, and this is put in the dictionary to output the next word, "is".

[Oh, a] => quetzal
Oh, [a quetzal] => is
Oh, a [quetzal is] => a
Oh, a quetzal [is a] => pretty
...

This continues until the chain comes across a period or 300 words were found. Of course, with a one sentence long corpus, it will output the only possible sentence: "Oh, a quetzal is a pretty bird in the trogon family." This is also the reason why some sentences are repeated word for word in white-title books. The source sentence uses such unique words and/or phrasing that there is only one possible word that will appear afterwards. Ex. "Klanq puff on debt".

The complexity happens when the corpus grows. In the full corpus, there are several possible values that "is a" can lead to: "billowing", "small", etc.

{
...
["is a"]: {pretty, billowing, measurement, heavy,...}
...
}

When there are multiple possible values, one will be randomly chosen. Some possible sentences:

Oh, a quetzal [is a] billowing cloak...
Oh, a quetzal [is a] measurement of how well protected you are...
Oh, a quetzal [is a] heavy fall of Karkayan in 24PP.

Limitations

The game has an optional seeding parameter to choose the initial two starting words. Cryptogull, the official discord server's helper bot, also has this parameter. This section applies to both unless specified otherwise, although it only significantly affects Cryptogull because the model is not designed to be a general text generator.

  • Initial Phrasing - Because of how the data is organized, seeds must only be of two words. There is no fuzzy searching, so the seed must be the exact same case and contain the same punctuation if needed. For example, the model considers "Of the" and "of the" to be two distinct phrases: the first occurs at the start of the sentence, while the second will only appear somewhere in the middle.
  • Corpus Size - The corpus consists of only the game's descriptions and dialogue and some public domain texts. This is about 1MB, compared to GPT-2's training model of 40GB. These are two completely different machine learning algorithms, but this is a good way of showing scale. Because of this comparatively small corpus, putting in any phrase as the seed will not work. That exact phrase must be in the corpus. If you want to check if a phrase is in the corpus, you can ctrl+F the QudCorpus.txt to see if it appears. If you are using Cryptogull, you can use ?incorpus <word> to see all possible phrases that contain that word.

Further Reading