(Spoiler alert: there were two race conditions Talk you through the long, long process I went through to identify the twoīugs that caused this problem. In the next section, I’ll explain what my kernel module was doing, so that youĬan follow along with my debugging process. Only happening in production, and I can’t reproduce it, and I can’t get any Except this might be even worse: This problem is This was reminding me of theĮxtremely late nights and brain-frying debugging sessions from that time I tookĪn operating systems class. I started to feel a sense of dread creeping in. Parts of the logs were not so lucky, appearing as scrambled null bytes instead. The above output, I could see that parts of the logs made it to disk, but later Reach the disk, and when the machine force-restarts, it is lost forever. But if the kernel is incapacitated, there is no way the data can The kernel flushes this buffer to disk periodically, so that the data is To improve performance, the data is written to a buffer in memory. Writes to a file, the data is usually not immediately written to disk. Then it occurred to me: Filesystem writes aren’t synchronous. So, my log file is filled with null bytes. Resize info received: on earth is that?Īfter some Googling, I figured out that is how less displays null bytes. Starting container: docker run -it -name dfb5e628-595f-464d-a4dd-1559db7b78d8 -read-only -tmpfs /cplayground:mode=0777,size=32m,exec -v /srv/cplayground/data/dfb5e628-595f-464d-a4dd-1559db7b78d8:/cplayground/code.cpp:ro -v /srv/cplayground/data/dfb5e628-595f-464d-a4dd-1559db7b78d8-include.zip:/cplayground/include.zip:ro -e COMPILER=g++ -e CFLAGS=-g -std=c++17 -O0 -Wall -no-pie -lm -pthread -e SRCPATH=/cplayground/code.cpp -cap-drop=all -memory 96mb -memory-swap 128mb -memory-reservation 32mb -cpu-shares 512 -pids-limit 16 -ulimit cpu=10:11 -ulimit nofile=64 -network none -v /srv/cplayground/data/dfb5e628-595f-464d-a4dd-1559db7b78d8-gdb.sock:/gdb.sock -cap-add=SYS_PTRACE -e CPLAYGROUND_DEBUG=1 cplayground /run.py Successfully created gdb socket at /srv/cplayground/data/dfb5e628-595f-464d-a4dd-1559db7b78d8-gdb.sock However, when I checked the C Playground log file, I felt things only getting Might be some application-side records that would indicate what happened. If the kernel logs didn’t give me anything helpful, maybe there To be a problem with my kernel module: nothing in user space could cause aĬomputer to lock up the way it did, completely unresponsive and pegged at 100% Suggest my kernel module wasn’t even running at that time. The kernel logs from any time near when the server locked up. I scanned through this file, hoping toįind some clues, but this only confused me even more there was nothing in My kernel module has print statements, and these get saved to the Normally, if a process running on the machine is hogging the CPU, we shouldĮxpect to see slight fluctuations around 100%, but that was not the case here.Īfter force-restarting the server via DigitalOcean and rolling back theĭebugging feature, I started going through logs to get a sense for what While I was doing that, I noticed a spike in the server’s CPU usage graphĪround the time my roommate texted me. I logged into DigitalOcean to restart the machine, but I quickly pulled out my laptopĪnd tried to SSH into the server to pull the logs, but to my surprise, I The next morning, I woke up to a text from my roommate: “I think the serverĬrashed.” Uh oh, that’s not supposed to happen. Testing the code locally for a few months, I pushed it to production. (I’ll explain more about this project in the next section.) After As part of this work, I had to implement a kernel Some time, allowing users to run code in the browser and visualize how their I have been working on building a graphical debugger for C Playground for I will try to explain everything else! A nightmare begins In this post, I assume you have some understanding of how files and concurrency It’s also a painful story in how not to deploy code to production :) Lock-free programming techniques for synchronization when high performance is Linux keeps track of information under the hood, as well as how it uses With the Linux kernel, I think you will find it an informative story in how This blog post gets into the weeds at times, but if you aren’t very familiar My code was nearly correct, but ultimately, there were two major Spoiler alert: This is a story about perils of concurrency and race conditions. Beautiful rendition of me frying the kernel This is the story of the time I wrote some code, deployed it to production, andĮnded up bricking the server it was running on by frying the kernel.
0 Comments
Leave a Reply. |