A language model that emits raw VM opcodes instead of text

A few months ago I posted asking why AI agents control machines through human text instead of emitting machine instructions directly. I've built a demo about this concept.

I replaced the decoder head of a Qwen 1.5B with a small cross-attention head (38M params) that emits raw CHIP-8 opcodes. The LLM encodes the instruction once and never generates a token. The head outputs machine instructions by attending to the actual machine state.

It handles arithmetic with BCD, subroutine calls, timer wait loops, and conditional branching. 1-3ms per opcode. Every opcode executes on a real CHIP-8 emulator.

Interesting failure: "3 plus 5" works and draws 8, but "two plus three" produces wrong operands. The frozen LLM's hidden states for "two" and "2" are nearly orthogonal (cosine sim 0.09) in this context. Removing the decoder apparently removes the path the LLM would use to bridge word-form and digit-form numbers.

Demo and code: https://github.com/ilbertt/reflex

A language model that emits raw VM opcodes instead of text

Comments