Bypassing the kernel for 56ns cross-language IPC

Posted by riyaneel 4 days ago

Comments

Comment by riyaneel 4 days ago

I am the author of this library. The goal was to reach RAM-speed communication between independent processes (C++, Rust, Python, Go, Java, Node.js) without any serialization overhead or kernel involvement on the hot path.

I managed to hit a p50 round-trip time of 56.5 ns (for 32-byte payloads) and a throughput of ~13.2M RTT/sec on a standard CPU (i7-12650H).

Here are the primary architectural choices that make this possible:

- Strict SPSC & No CAS: I went with a strict Single-Producer Single-Consumer topology. There are no compare-and-swap loops on the hot path. acquire_tx and acquire_rx are essentially just a load, a mask, and a branch using memory_order_acquire / release.

- Hardware Sympathy: Every control structure (message headers, atomic indices) is padded to 128-byte boundaries. False sharing between the producer and consumer cache lines is structurally impossible.

- Zero-Copy: The hot path is entirely in a memfd shared memory segment after an initial Unix Domain Socket handshake (SCM_RIGHTS).

- Hybrid Wait Strategy: The consumer spins for a bounded threshold using cpu_relax(), then falls back to a sleep via SYS_futex (Linux) or __ulock_wait (macOS) to prevent CPU starvation.

The core is C++23, and it exposes a C ABI to bind the other languages.

I am sharing this here for anyone building high-throughput polyglot architectures and dealing with cross-language ingestion bottlenecks.

Comment by nly 2 days ago

Your statements on queues ignore the state of the art

> MPSC (multiple-producer single-consumer) requires a compare-and-swap loop on the head pointer so that two producers can each reserve a contiguous slot without overlap.

Martin Thompsons designs, as used in Aerons logbuffer implementation, don't require a CAS retry loop. Multiple producers can reserve and commit concurrently without blocking one another.

The trade off is you must carefully decide on an upper bound for message size and the number of producer threads (in the hundreds typically). A caretaker thread also needs to run periodically to reclaim/zero memory off the hot path. Typically though, this isn't a problem.

Aeron itself, which you compare at ~250ns, I think (not entirely sure) is paying the price for being multi consumer as well as multi producer, and perhaps implementing flow control to pace producers. You can turn off multi producer by using an exclusive publication, which eliminates one atomic RMW operation on reserve. I'm not sure where the other nanos are lost.

For SPSC, doing away with 2 atomic shared counters and moving to a single counter + inline headers is a win for thread to thread latency. The writer only needs to read the readers new position from a shared cache line when it believes the queue is full. The reader can batch writes to this counter, so it doesn't have to write to memory at all most of the time. The writer has one fewer contended cache line to write to in general since the header lives in the first cacheline of the message, which it's writing anyway. This is where the win comes from under contention (when the queue is ~empty)

Comment by riyaneel 2 days ago

You're right on the MPSC point, the ADR overstates it. Aeron's claim-based approach uses a single fetch_add per producer, no retry loop. The real constraints are bounded message size upfront and a caretaker thread for reclamation, not a CAS retry. The wording needs fixing. On the SPSC counter argument, Tachyon already does most of what you describe: inline headers, head/tail on separate 128-byte cache lines, cached tail only reloaded on apparent fullness, tail writes amortized across 32 messages. If you have numbers comparing the single-counter approach against this specific layout I'd be genuinely curious.

Comment by nly 2 days ago

The main issue with dual counters is that most of the time, in low latency usecases, your consumer is ~1 message behind the producer.

This means your consumer isn't getting a lot of benefit from caching the producers position. The queue appears empty the majority of the time and it has to re-load the counter (causing it to claim the cacheline).

Meanwhile the producer goes to write message N+1 and update the counter again, and has to claim it back (S to M in MESI), when it could have just set a completion flag in the message header that the consumer hasn't touched in ages (since the ring buffer last lapped). And it's just written data to this line anyway so already has it exclusively.

So when your queue is almost always empty, this counter is just another cache line being ping ponged between cores.

This gets back to Aeron. In Aerons design the reader can get ahead of the writer and it's safe.

Comment by riyaneel 2 days ago

Fair point on the head cache line. Tachyon's target is cross-language zero-copy IPC, not squeezing the last nanosecond out of a pure C++ ring. Different tradeoff.

Comment by amluto 1 day ago

I gave this a quick skim, and:

> - Strict SPSC & No CAS: I went with a strict Single-Producer Single-Consumer topology. There are no compare-and-swap loops on the hot path. acquire_tx and acquire_rx are essentially just a load, a mask, and a branch using memory_order_acquire / release.

> - Hybrid Wait Strategy: The consumer spins for a bounded threshold using cpu_relax(), then falls back to a sleep via SYS_futex (Linux) or __ulock_wait (macOS) to prevent CPU starvation.

You can't actually achieve both of these at once, right? In "pure_spin" mode you can write without seq_cst, but in hybrid wait mode you need some seq_cst operation to avoid a race that would cause you to fail to wake the consumer, I think. This is an IMO obnoxious general problem with any sort of lightweight wake operation, and I haven't seen a great solution. I wish there was one, and I imagine it would be doable with only smallish amounts of hardware help or maybe even very clever kernel help. And you can avoid it (at extreme) cost with membarrier(), but I struggle to imagine the use case where it's a win, and it's certainly not a win in cases where you really want to avoid tail latency.

Comment by zekrioca 2 days ago

Why report p50 and not p95?

Comment by riyaneel 2 days ago

Tail latency p99.9 (122ns) are reported

Comment by etaioinshrdlu 2 days ago

This is super cool. If you're interested in this, you might enjoy this proof of concept as well: https://github.com/deepai-org/omnivm

It instead embeds a bunch of runtimes onto the same OS thread.

Comment by nnx 2 days ago

Looks really interesting but the Go binding sadly uses cgo. Could the binding be done in pure Go? Or at least purego (the cgo alternative using Go assembly for FFI) ?

Comment by yencabulator 1 day ago

So a shared-memory ringbuffer? Better make it clear that sender can perform TOCTOU attacks on the receiver. There seems to be a fuzz tester for the header, but the application logic would be the real target.

Comment by riyaneel 1 day ago

Exactly, the application logic is the target. Actually doing seccomp bpf base but for managed bindings (Java, Node, Go, ...) add a lot of complexity....

Comment by oasisaimlessly 1 day ago

What?

> Exactly, the application logic is the target. Actually doing seccomp bpf base but for managed bindings (Java, Node, Go, ...) add a lot of complexity....

Maybe proofread the slop before posting it next time?

Comment by riyaneel 1 day ago

Just having a bad english. But yes, the application logic is where the vulnerability can occur. I am adding support for seccomp-BPF but this is complicated for managed runtimes like Go, JVM, Node, Python.

Comment by BobbyTables2 2 days ago

Would be interesting to see performance comparisons between this and the alternatives considered like eventfd.

Sure, the “hot path” is probably very fast for all, but what about the slow path?

Comment by riyaneel 2 days ago

eventfd always pays a syscall on both sides (~200-400ns) regardless of load. Tachyon slow path only kick in under genuine starvation: the consumer spins first, then FUTEX_WAIT, and the producer skips FUTEX_WAKE entirely if the consumer still spinning. At sustainable rates the slow path never activates.

Comment by mananaysiempre 2 days ago

> eventfd always pays a syscall on both sides (~200-400ns) regardless of load.

It’s fairly standard to make the waiting side spin a bit after processing some data, and only issue another wait syscall if no more data arrives during the spin period.

(For instance, io_uring, which does this kind of IPC with a kernel thread on the receiving side, literally lets you configure how long said kernel thread should spin[1].)

[1] https://unixism.net/loti/tutorial/sq_poll.html

Comment by riyaneel 2 days ago

Fair point. The real difference is the narrower: with a futex the producer can inspect consumer_sleeping directly in shared memory and skip the FUTEX_WAKE entirely if the consumer is still spinning. With eventfd you need a write() regardless, or you add shared state to gate it, which is essentially rebuilding futex. Same idea but slightly less clean.

Comment by sunnypq 2 days ago

It's not really bypassing the kernel if you're using futexes, is it?

Comment by riyaneel 2 days ago

Exactly, futexes are only used if the consumer opts-in for the sleep phase. The low-latency hot path uses pure spin mode which completely bypasses kernel and fences.

Comment by sunnypq 2 days ago

Thanks for clarifying, I didn't spot that waiting on a futex was optional.

I wonder to what extent the performance would be affected with a middle-ground option to spin a few times and then call sched_yield() syscall before spinning again.

Comment by JSR_FDED 2 days ago

What would need to change when the hardware changes?

Comment by riyaneel 2 days ago

Absolutely not, the code following all Hardware principles (Cache coherence/locality, ...) not software abstraction. That not means the code is for a dedicated hardware but designed for modern CPUs.

Comment by ajb 2 days ago

Would be more convincing if you enumerated the assumptions. For example, 128b cache lines. Presumably, that is a speed assumption but not a soundness assumption.

Comment by riyaneel 1 day ago

Im currently adding doc, but 128b alignas are for the Adjacent Cache Line prefetcher and avoid to silence the MESI protocol

Comment by Onavo 1 day ago

Good work btw, beating Aeron is non trivial.

Comment by yc-kraln 2 days ago

How do you handle noisy neighbors?

Comment by riyaneel 2 days ago

Tachyon is lock-free and uses a strict alignment to avoid the MESI protocol, but relies on the environment for isolation. You still need core pinning and CPU isolation for true hardware determinism.

Comment by iberator 1 day ago

It's a shame to know that there are some amazing architectures back from 70" that handled IPC better. Nowadays is still a hard problem to tackle despite decades of x86

Comment by Fire-Dragon-DoL 2 days ago

Wow, congrats!

Comment by riyaneel 2 days ago

Thanks!

Comment by Fire-Dragon-DoL 2 days ago

I will be discussing this at work on monday, will let you know what they think.

I wouldn't be surprised if somebody develops a cross-language framework with this.

Comment by riyaneel 2 days ago

Would love to hear the feedback