A language model that emits raw VM opcodes instead of text

Posted by ilbert 10 hours ago

Counter1Comment2OpenOriginal

A few months ago I posted asking why AI agents control machines through human text instead of emitting machine instructions directly. I've built a demo about this concept.

I replaced the decoder head of a Qwen 1.5B with a small cross-attention head (38M params) that emits raw CHIP-8 opcodes. The LLM encodes the instruction once and never generates a token. The head outputs machine instructions by attending to the actual machine state.

It handles arithmetic with BCD, subroutine calls, timer wait loops, and conditional branching. 1-3ms per opcode. Every opcode executes on a real CHIP-8 emulator.

Interesting failure: "3 plus 5" works and draws 8, but "two plus three" produces wrong operands. The frozen LLM's hidden states for "two" and "2" are nearly orthogonal (cosine sim 0.09) in this context. Removing the decoder apparently removes the path the LLM would use to bridge word-form and digit-form numbers.

Demo and code: https://github.com/ilbertt/reflex

Comments

Comment by turtleyacht 10 hours ago

What was its result of two plus three?

Comment by ilbert 9 hours ago

It produces an arithmetic program but with wrong operands. The frozen LLM's hidden states for "two" and "2" are nearly orthogonal (cosine sim 0.09) in this context, so the head can't extract the right numbers. "2 plus 3" works fine and draws 5. The model understands the task structure but can't bridge word-form to digit-form without token generation