I fed 24 years of my blog posts to a Markov model
Posted by zdw 1 day ago
Comments
Comment by srean 1 day ago
Markov Models are anything that has state and emit tokens based only on its current state and undergoes a state transition. The token emission and state transitions are usually probabilistic -- a statistical/probabilistic analogue of a state machine. The deterministic state machine is a special case where the transition probabilities are degenerate (concentrated at an unique point).
For a Markov Model to be non-vacuous, non-vapid discussion point, however, one needs to specify very precisely the relationships allowed between state and tokens/observations, whether it's hidden or visible, discrete or continuous, fixed context length or variable context length, causal or non causal ...
The simplest such model is one where the state is a specified, computable function of the last k observations. One such simple function is the identity function -- the state then is the last k tokens. This is called a k order Markov Chain and is a restriction of the bigger class -- Markov Models.
One can make the state a specified, computable function of (k) previous states and k most recent tokens/observations. (Equivalently RNNs)
The functions may be specified only upto a class of computable functions, finite or infinite in size. They may be stochastic in the sense they define only the state transition probabilities.
You can make the context length a computable function of the k most recent observations (therefore they can be of varying length), but you have to ensure that the contexts are always full for this model to be well defined.
Context length can be a computable function of both the (el) most recent states and k most recent observations.
Crazy ones emit more than one token based on current state.
On and on.
Not all Markov Models are learnable.
Comment by canjobear 10 hours ago
Comment by srean 4 hours ago
It depends on whether the state is visible in the observations or not. Hidden or not is an orthogonal axis of variation compared to the other variations mentioned in the comment.
In a non-hidden model there is no ambiguity or uncertainty about what the current state is.
Comment by NuclearPM 21 hours ago
Comment by srean 21 hours ago
The most basic/naive one is where one can estimate the unknown parameters of the model given example token streams generated by the model.
Comment by TomatoCo 18 hours ago
Comment by igorkraw 16 hours ago
Identifiability means that out of all possible models, you can learn the correct one given enough samples.causal identifiability has some other connotations
See here https://causalai.net/r80.pdf as a good start (a nose in a causal graph is Markov given its parents, and a k-step Markov chain is a k-layer causal dag)
Comment by sebastianmestre 1 day ago
> Itheve whe oiv v f vidleared ods alat akn atr. s m w bl po ar 20
Using pairs of consecutive characters (order-2 Markov model) helps, but not much:
> I hateregratics.pyth fwd-i-sed wor is wors.py < smach. I worgene arkov ment by compt the fecompultiny of 5, ithe dons
Triplets (order 3) are a bit better:
> I Fed tooks of the say, I just train. All can beconsist answer efferessiblementate
> how examples, on 13 Debian is the more M-x: Execute testeration
LLMs usually do some sort of tokenization step prior to learning parameters. So I decided to try out order-1 Markov models over text tokenized with byte pair encoding (BPE).
Trained on TFA I got this:
> I Fed by the used few 200,000 words. All comments were executabove. This value large portive comment then onstring takended to enciece of base for the see marked fewer words in the...
Then I bumped up the order to 2
> I Fed 24 Years of My Blog Posts to a Markov Model
> By Susam Pal on 13 Dec 2025
>
> Yesterday I shared a little program calle...
It just reproduced the entire article verbatim. This makes sense as BPE removes any pair of repeated tokens, making order-2 Markov transitions fully deterministic.
I've heard that in NLP applications, it's very common to run BPE only up to a certain number of different tokens, so I tried that out next.
Before limiting, BPE was generating 894 tokens. Even adding a slight limit (800) stops it from being deterministic.
> I Fed 24 years of My Blog Postly coherent. We need to be careful about not increasing the order too much. In fact, if we increase the order of the model to 5, the generated text becomes very dry and factual
It's hard to judge how coherent the text is vs the author's trigram approach because the text I'm using to initialize my model has incoherent phrases in it anyways.
Anyways, Markov models are a lot of fun!
Comment by andai 1 day ago
I'm considering just deleting all tokens that have only one possible descendant, from the db. I think that would solve that problem. Could increase that threshold to, e.g. a token needs to have at least 3 possible outputs.
However that's too heavy handed: there's a lot of phrases or grammatical structures that would get deleted by that. What I'm actually trying to avoid is long chains where there's only one next token. I haven't figured out how to solve that though.
Comment by vunderba 1 day ago
You'll also need a "sort of traversal stack" so you can rewind if you get stuck several plies in.
Comment by countWSS 1 day ago
Comment by Tallain 1 day ago
Comment by yard2010 1 day ago
Comment by sebastianmestre 1 day ago
Comment by travisjungroth 1 day ago
Comment by samus 19 hours ago
Comment by sebastianmestre 15 hours ago
Comment by vunderba 1 day ago
I used it as a kind of “dream well” whenever I wanted to draw some muse from the same deep spring. It felt like a spiritual successor to what I used to do as a kid: flipping to a random page in an old 1950s Funk & Wagnalls dictionary and using whatever I found there as a writing seed.
Comment by wrp 13 hours ago
[0] https://archive.org/details/Babble_1020, https://vetusware.com/download/Babble%21%202.0/?id=11924
Comment by Tallain 1 day ago
Comment by vunderba 1 day ago
The only thing I'm a bit wary of is the submission size - a minimum of 50,000 words. At that length, It'd be really difficult to maintain a cohesive story without manual oversight.
Comment by davely 1 day ago
It was pretty fun!
Comment by boznz 1 day ago
Comment by bitwize 1 day ago
Comment by echelon 1 day ago
I spend all of my time with image and video models and have very thin knowledge when it comes to running, fine tuning, etc. with language models.
How would one start with training an LLM on the entire corpus of one's writings? What model would you use? What scripts and tools?
Has anyone had good results with this?
Do you need to subsequently add system prompts, or does it just write like you out of the box?
How could you make it answer your phone, for instance? Or discord messages? Would that sound natural, or is that too far out of domain?
Comment by ipaddr 1 day ago
You could use a vector database.
You could train a model from scratch.
Probably easiest to use OpenAI tools. Upload documents. Make custom model.
How do you make it answer your phone? You could use twillio api + script + llm + voice model. Want natural use a service.
Comment by echelon 1 day ago
Wouldn't fine tuning produce better results so long as you don't catastrophically forget? You'd preserve more context window space, too, right? Especially if you wanted it to memorize years of facts?
Are LoRAs a thing with LLMs?
Could you train certain layers of the model?
Comment by dannyw 1 day ago
Comment by idiotsecant 1 day ago
Comment by vunderba 1 day ago
The problem with that is either your n-gram level is too low in which case it can't maintain any kind of cohesion, or your n-gram level is too high and it's basically just spitting out your existing corpus verbatim.
For me, I was more interested in something that could potentially combine two or three highly disparate concepts found in my previous works into a single outputted sentence - and then I would ideate upon it.
So I haven't opened the program in a long time so I just spun it up and generated a few outputs:
A giant baby is navel corked which if removed causes a vacuum.
I'm not sure what the original pieces of text were based on that particular sentence but it starts making me think about a kind of strange void harkonnen with heart plugs that lead to weird negatively pressurized areas. That's the idea behind the dream well.Comment by kqr 22 hours ago
Very The Age of Wire and String.
Comment by user_7832 1 day ago
Here’s a link: https://botnik.org/content/harry-potter.html
Comment by mattacular 23 hours ago
It is hollow text. It has no properties of what I'd want to get out of even the worst book produced by human minds.
Even more sophisticated models have a ceiling of pablum.
Comment by grahamnorton39 22 hours ago
That said, it’s obviously not to everyone’s tastes!
Comment by user_7832 22 hours ago
Comment by cluckindan 8 hours ago
Comment by monoidl 1 day ago
Iirc there was some research on "infini-gram", that is a very large ngram model, that allegedly got performance close to LLMs in some domains a couple years back
Comment by Legend2440 17 hours ago
It achieved state-of-the-art performance at tasks like spelling correction at the time. However, unlike an LLM, it can't generalize at all; if an n-gram isn't in the training corpus it has no idea how to handle it.
https://research.google/blog/all-our-n-gram-are-belong-to-yo...
Comment by tqian 8 hours ago
Ngrams are surprisingly powerful for how little computation they require. They can be trained in seconds even with tons of data.
Comment by GarnetFloride 1 day ago
Comment by LanceH 1 day ago
Comment by lacunary 1 day ago
Comment by frumiousirc 1 day ago
Comment by lacunary 22 hours ago
Comment by nurettin 1 day ago
Comment by pavel_lishin 1 day ago
Comment by lloydatkinson 1 day ago
Comment by hilti 1 day ago
Giving 24 years of your experience, thoughts and life time to us.
This is special in these times of wondering, baiting and consuming only.
Comment by rumgewieselt 16 hours ago
Comment by Aperocky 1 day ago
npm package of the markov model if you just want to play with it on localhost/somewhere else: https://github.com/Aperocky/weighted-markov-generator
Comment by anthk 1 day ago
Comment by OuterVale 1 day ago
Comment by kazinator 1 day ago
Comment by litver 20 hours ago
Comment by Peteragain 1 day ago
Comment by microtonal 1 day ago
I recommend you to read Bengio et al.’s 2003 paper which describes this issue in more detail and introduces distributional representations (embeddings) in an RNN to avoid this sparsity.
While we are using transformers and sentence pieces now, this paper aptly describes the motivation underpinning modern models.
Comment by Peteragain 1 day ago
Comment by microtonal 1 day ago
Distributional representations, not distributed.
https://en.wikipedia.org/wiki/Distributional_semantics#Distr...
Comment by frizlab 1 day ago
Comment by andai 1 day ago
Except instead we fine-tuned GPT-2 instead. (As was the fashion at the time!)
We used this one, I think https://github.com/minimaxir/gpt-2-simple
I think it took 2-3 hours on my friend's Nvidia something.
The result was absolutely hilarious. It was halfway between a markov chain and what you'd expect from a very small LLM these days. Completely absurd nonsense, yet eerily coherent.
Also, it picked up enough of our personality and speech patterns to shine a very low resolution mirror on our souls...
###
Andy: So here's how you get a girlfriend:
1. Start making silly faces
2. Hold out your hand for guys to swipe
3. Walk past them
4. Ask them if they can take their shirt off
5. Get them to take their shirt off
6. Keep walking until they drop their shirt
Andy: Can I state explicitly this is the optimal strategy
Comment by Tepix 1 day ago
Comment by msapaydin 1 day ago
Comment by hexnuts 1 day ago
Comment by jacquesm 1 day ago
"Do me a favor, boy. This scam of yours, when it's over, you erase this god-damned thing."
Comment by swyx 1 day ago
Comment by 0_____0 1 day ago
Comment by bitwize 1 day ago
Comment by fragmede 1 day ago
Comment by keithalewis 10 hours ago
Comment by ikhatri 1 day ago
Comment by anthk 1 day ago
Usage:
hailo -t corpus.txt -b brain.brn
Where "corpus.txt" should be a file with one sentence per line.
Easy to do under sed/awk/perl. hailo -b brain.brn
This spawns the chatbot with your trained brain.By default Hailo chooses the easy engine. If you want something more "realistic", pick the advanced one mentioned at 'perldoc hailo' with the -e flag.
Comment by pessimizer 19 hours ago
https://archive.org/details/Babble_1020
A fairly prescient example of how long ago 4 years was:
https://forum.winworldpc.com/discussion/12953/software-spotl...
Comment by manthangupta109 1 day ago
Comment by anthk 1 day ago
cpanm -n local::lib
cpanm -n Hailo
~/perl5/bin/hailo -E Scored -t corpus.txt -b brain.brn
~/perl5/bin/hailo -b brain.brn
As corpus.txt, you can use a Perl/sed command for instance with
book from Gutenberg.I forgot to put the '-E' flag in my previous comments, so here it is. It's to select a more 'complex' engine, so the text output looks less gibberish.
Comment by delfugal 17 hours ago
Comment by mamma5211 1 day ago
Comment by huflungdung 13 hours ago
Comment by atum47 1 day ago
Comment by pavel_lishin 1 day ago
Respectfully, absolutely nobody wants to read a copy-and-paste of a chat session with ChatGPT.
Comment by atum47 1 day ago
I was having a discussion about similarities between Markov Chains and LLMs and short after I found this topic on HN, when I wrote "I can share if you like" was as a proof about the coincidence.
Comment by empiko 1 day ago
Comment by famouswaffles 1 day ago
Comment by chpatrick 1 day ago
Comment by famouswaffles 1 day ago
The problem is that this definition strips away what makes Markov models useful and interesting as a modeling framework. A “Markov text model” is a low-order Markov model (e.g., n-grams) with a fixed, tractable state and transitions based only on the last k tokens. LLMs aren’t that: they model using un-fixed long-range context (up to the window). For Markov chains, k is non-negotiable. It's a constant, not a variable. Once you make it a variable, near any process can be described as markovian, and the word is useless.
Comment by chpatrick 1 day ago
Comment by sigbottle 1 day ago
And in classes, the very first trick you learn to skirt around history is to add Boolean variables to your "memory state". Your systems now model, "did it rain The previous N days?" The issue obviously being that this is exponential if you're not careful. Maybe you can get clever by just making your state a "sliding window history", then it's linear in the number of days you remember. Maybe mix the both. Maybe add even more information .Tradeoffs, tradeoffs.
I don't think LLMs embody the markov property at all, even if you can make everything eventually follow the markov property by just "considering every single possible state". Of which there are (size of token set)^(length) states at minimum because of the KV cache.
Comment by chpatrick 1 day ago
Comment by sigbottle 1 day ago
The markov property states that your state is a transition of probabilities entirely from the previous state.
These states, inhabit a state space. The way you encode "memory" if you need it, e.g. say you need to remember if it rained the last 3 days, is by expanding said state space. In that case, you'd go from 1 state to 3 states, 2^3 states if you needed the precise binary information for each day. Being "clever", maybe you assume only the # of days it rained, in the past 3 days mattered, you can get a 'linear' amount of memory.
Sure, a LLM is a "markov chain" of state space size (# tokens)^(context length), at minimum. That's not a helpful abstraction and defeats the original purpose of the markov observation. The entire point of the markov observation is that you can represent a seemingly huge predictive model with just a couple of variables in a discrete state space, and ideally you're the clever programmer/researcher and can significantly collapse said space by being, well, clever.
Are you deliberately missing the point or what?
Comment by chpatrick 1 day ago
Okay, so we're agreed.
Comment by famouswaffles 1 day ago
Again, no they can't, unless you break the definition. K is not a variable. It's as simple as that. The state cannot be flexible.
1. The markov text model uses k tokens, not k tokens sometimes, n tokens other times and whatever you want it to be the rest of the time.
2. A markov model is explcitly described as 'assuming that future states depend only on the current state, not on the events that occurred before it'. Defining your 'state' such that every event imaginable can be captured inside it is a 'clever' workaround, but is ultimately describing something that is decidedly not a markov model.
Comment by chpatrick 1 day ago
Comment by famouswaffles 1 day ago
2. “Fixed-size block” is a padding detail, not a modeling assumption. Yes, implementations batch/pad to a maximum length. But the model is fundamentally conditioned on a variable-length prefix (up to the cap), and it treats position 37 differently from position 3,700 because the computation explicitly uses positional information. That means the conditional distribution is not a simple stationary “transition table” the way the n-gram picture suggests.
3. “Same as a lookup table” is exactly the part that breaks. A classic n-gram Markov model is literally a table (or smoothed table) from discrete contexts to next-token probabilities. A transformer is a learned function that computes a representation of the entire prefix and uses that to produce a distribution. Two contexts that were never seen verbatim in training can still yield sensible outputs because the model generalizes via shared parameters; that is categorically unlike n-gram lookup behavior.
I don't know how many times I have to spell this out for you. Calling LLMs markov chains is less than useless. They don't resemble them in any way unless you understand neither.
Comment by chpatrick 1 day ago
Comment by saithound 1 day ago
Comment by famouswaffles 1 day ago
My response to both of you is the same.
LLMs do depend on previous events, but you say they don't because you've redefined state to include previous events. It's a circular argument. In a Markov chain, state is well defined, not something you can insert any property you want to or redefine as you wish.
It's not my fault neither of you understand what the Markov property is.
Comment by chpatrick 1 day ago
Comment by famouswaffles 22 hours ago
They don't because n gram orders are too small and rigid to include the history in the general case.
I think srean's comment up the thread is spot on. This current situation where the state can be anything you want it to be just does not make a productive conversation.
Comment by famouswaffles 1 day ago
My point, which seems so hard to grasp for whatever reason is that In a Markov chain, state is a well defined thing. It's not a variable you can assign any property to.
LLMs do depend on the previous path taken. That's the entire reason they're so useful! And the only reason you say they don't is because you've redefined 'state' to include that previous path! It's nonsense. Can you not see the circular argument?
The state is required to be a fixed, well-defined element of a structured state space. Redefining the state as an arbitrarily large, continuously valued encoding of the entire history is a redefinition that trivializes the Markov property, which a Markov chain should satisfy. Under your definition, any sequential system can be called Markov, which means the term no longer distinguishes anything.
Comment by wizzwizz4 1 day ago
Comment by chpatrick 1 day ago
Comment by srean 1 day ago
Comment by sophrosyne42 1 day ago
Comment by ben_w 1 day ago
An LLM could be implemented with a Markov chain, but the naïve matrix is ((vocab size)^(context length))^2, which is far too big to fit in this universe.
Like, the Bekenstein bound means writing the transition matrix for an LLM with just 4k context (and 50k vocabulary) at just one bit resolution, the first row (out of a bit more than 10^18795 rows) ends up with a black hole >10^9800 times larger than the observable universe.
Comment by sophrosyne42 15 hours ago
The case for brain states and ideas is similar to QM and massive objects. While certain metaphysical presuppositions might hold that everything must be physical and describable by models for physical things, science, which should eschew metaphysical assumptions, has not shown that to be the case.
Comment by arboles 1 day ago
If you use a syllable-level token in Markov models the model can't form real words much beyond the second syllable, and you have no way of making it make more sense other than increasing the token size, which exponentially decreases originality. This is the simplest way I can explain it, though I had to address why scaling doesn't work.
[1] There are 4^400000 possible 4-word sequences in English (barring grammar) meaning only a corpus with 8 times that amount of words and with no repetition could offer two ways to chain each possible 4 word sequence.
Comment by cwyers 1 day ago
* LLMs don't use Markov chains, * LLMs don't predict words.
Comment by arboles 1 day ago
* The R package markovchain[1] may look like it's using Markov chains, but it's actually using the R programming language, zeros and ones.
[1] https://cran.r-project.org/web/packages/markovchain/index.ht...
Comment by srean 1 day ago
Comment by empiko 1 day ago
Comment by srean 1 day ago
MC have constant and finite context length, their state is the most recent k tuple of emitted alphabets and transition probabilities are invariant (to time and tokens emitted)
Comment by empiko 21 hours ago
Comment by srean 20 hours ago
Comment by empiko 20 hours ago
Comment by srean 19 hours ago
You can certainly feed k-grams one at a time to, estimate the the probability distribution over next token and use that to simulate a Markov Chain and reinitialize the LLM (drop context). In this process the LLM is just a look up table to simulate your MC.
But an LLM on its own doesn't drop context to generate, it's transition probabilities change depending on the tokens.
Comment by atum47 1 day ago
Hate to be that guy, but I remember this place being nicer.
Comment by roarcher 1 day ago
Everyone has access to ChatGPT. If we wanted its "opinion" we could ask it ourselves. Your offer is akin to "Hey everyone, want me to Google this and paste the results page here?". You would never offer to do that. Ask yourself why.
These posts are low-effort and add nothing to the conversation, yet the people who write them seem to expect everyone to be impressed by their contribution. If you can't understand why people find this irritating, I'm not sure what to tell you.
Comment by pavel_lishin 20 hours ago