Saturday, March 29, 2014

Something's not write

This is a recount of a recent high pressure debugging session I helped with. As with most difficult debugging sessions it ended up being educational and fun. I'm posting here mostly for posterity and to enlighten the future me.

Two coworkers and I were called in to support a project (not our own) that was going to demo in two days but was experiencing a variety of system-halting problems. This was a system with which I had almost no experience and to complicate matters, since demo was so close, there was no chance we could take down the system to investigate. Our instructions were to engage the production system while the entire user base was attempting to squeeze in last minute testing. Time to go spelunking.

The tasks were broken up between the three of us and my focus was a master-slave setup supporting system automation through a web interface. The machines were running FreeBSD with code written in a variety of languages including C, Python, Perl, and PHP all tied together and coordinated with databases and XML-RPC. The symptom I had to go on was that every once in a while when trying to create a data capture (a file that could ultimately be several Gbs) the resulting file would end up being 0 bytes on disk representing a complete loss of data for that day. There was also a message in the log file that appeared when the data capture failed: 'write failed: Permission Denied'. The strange thing was that this message would appear approximately half way through the transfer so by that time the file had to exist and would have already been written to.

I started by looking at the source code for the offending process. The error was being reported as a result of a failed write call. Strangely, 'Permission Denied' (EACCES) is not listed as a valid error in the write man pages. I wanted to avoid tearing apart the build system to recompile a single source file so I started investigating ways other than editing source to try and figure out why this was happening.

I'm quite unfamiliar with FreeBSD but on Linux I would use something like strace to try an establish a baseline of what is happening. Luckily, FreeBSD has truss which provided close to the same functionality. Unfortunately, the process I wanted to examine was exec'd from deep within some Perl code that kept failing when I tried to prefix the command with truss -o truss.out. Since it was a long running process I ended up attaching to the running process with truss -o truss.out -p 12345.

Now I had a running process that I could monitor and all I needed was to trigger the effect - enter: the heisenbug. Several hours of creating large files without issue forced me to look elsewhere.

I started collecting what information I could about the system and df showed that the data capture file was being written to an NFS mounted location. Since nothing else was out of the ordinary I tried to understand why a write to an NFS file would fail. My first instinct was that there was an overload so I tried several parallel writes of substantial size without issues - no dice. Setting up a tcpdump between the master and slave would have also been tough since all communication throughout the system (including NFS) was handled between these two machines.

Since the process I was monitoring was writing to the master machine and the local code seemed fine I tried digging into why mounted locations would experience random interruptions. I ended up finding a thread on a FreeBSD mailing list that described this exact problem. According to the dialog there, mountd could cause this problem if it received a SIGHUP while the transfer was taking place. I ran a test where I tried transferring a large file while sitting on the destination machine executing kill -s HUP `cat /var/run/mountd.pid`. It took three attempts but I finally triggered the bug - it seems to be a race condition even when doing the very thing that will cause the error.

The only thing left was to determine why mountd might be receiving a SIGHUP. This would happen when calling mount by hand and, as it turns out, some of the tests in the system mount new locations as part of spinning up a test environment. So in the very rare case of creating a data capture while another test mounts a new location and that mount happens to be at the precice moment to interfere with network writes to NFS you get issues. In the general operation this might not ever occur; however, given the load on the system during prep for demo it became a more noticeable (and repeatable) problem.

Considering the the mailing list thread was discussing the issue in FreeBSD 8.3 and this system is still seeing it in 9.1, the ultimate solution was to modify the C code to handle the EACESS case on error from write. As to why the man pages for write do not list EACESS as an error condition... they probably aren't allowed.