A kernel bug froze my machine: Debugging an async-profiler deadlock
Posted by bluestreak 18 hours ago
Comments
Comment by SerCe 17 hours ago
[1]: https://www.youtube.com/watch?v=u7-S-Hn-7Do
[2]: https://netflixtechblog.com/netflix-flamescope-a57ca19d47bb
Comment by jerrinot 17 hours ago
Heatmaps are amazing for pattern spotting. I also use them when hunting irregular hiccups or outliers. More people should know about this feature.
Comment by kreelman 12 hours ago
Great that you had the time to be curious and dig into what was going on. QEMU is quite an amazing tool.
I'm kind of surprised there isn't a fairly robust kernel test around this issue, since it locks the machine down and I think the fix was to prevent a stuck CPU last time as well?
It's also vaguely surprising that this hasn't been encountered more often, particularly by the https://news.ycombinator.com/user?id=everlier talking in links to this HN post about "20-30 containers" running simultaneously and occasionally locking up the machine.
If you're still thinking about the bug a little, you could look over how other kernel tests work and implement a failing test around it....?
I imagine the tests have some way of detecting a locked up kernel... I don't know exactly how they'd do it, but they probably have a technique. Most likely since the kernel is literally in a loop it won't respond to anything.. so starting any process, something as simple as creating any process, even one as simple as printing "Hello World!!" would fail and indicate the machine is locked.
Perhaps this is one of those cases where something like UserModeLinux would allow a test to be easily put together, rather than spawning complete VMs via some kind of VM software. Again, would be interesting to know what Linux does with this kind of test.
Comment by pjmlp 2 hours ago
Comment by ChuckMcM 17 hours ago
The bug being that the precedence of || is higher than the precedence of != ? Consider writing it if ((event->state != PERF_EVENT_STATE_ACTIVE) || (event->hw_state & PERF_HES_STOPPED))
This coming from a person who has too many scars from not parenthesizing my expressions in conditionals to ensure they work the way I meant them to work.
Comment by jerrinot 16 hours ago
Comment by unsnap_biceps 16 hours ago
Comment by anematode 15 hours ago
Comment by Artoooooor 5 hours ago
Comment by everlier 16 hours ago
Comment by jerrinot 15 hours ago
Comment by broken_broken_ 7 hours ago
I do not have much experience with it, but I think you can see the kernel call stack with it and I know you can also see the return value (in eax). That would be less effort than qemu + gdb + disabling kernel aslr, etc.
Comment by jerrinot 6 hours ago
Comment by bluuewhale 13 hours ago
This kind of "debugging journey" post is gold.
Comment by jerrinot 17 hours ago
Comment by snvzz 16 hours ago
Now, with the complexity (MLoCs!) of the Linux kernel, this is definitely not the only bug to be found in there.
This is why Linux is just an interim kernel for these use cases in which we still cannot use seL4[0].