Wednesday, March 5, 2008

Xen and the Art of Not Functioning

For the most part I enjoy doing system administrator work ... like setting up the software for our group to use. However, the real meat of sysadmin's job is not dealing with software that works well, it's dealing with, and working around software that's broken (most of the time with out the ability to change code). 

Such was the case with an Ubuntu bug I ran in to that caused Apache and other processes to hang (sometimes) when calling readdir_r or readdir64_r (which is a thread safe version of the readdir call). The bug initially manifested when I first installed the Ubuntu O/S image for my new Virtual Private Server (VPS) -- I've had had a VPS with another provider, but this was the first time using Ubuntu Gusty (version 7.10). My VPS is provided by VPSLink (subsidiary of Spry), I found the following report on the VPSLink forums of problems with the particular O/S image I had installed. I followed the advice there and installed a different (still Ubuntu) O/S image and set up all the software packages I needed by hand (mainly just Apache).

Everything was good for a while until I ran in to a problem with attempting to do an SVN move (which is implemented as a delete and a copy from the previous revision). The WebDAV "COPY" command was failing because the Apache server was misconfigured (this document describes several misconfiguration scenarios including mine). I fixed the configuration problem and attempted to restart Apache ... low and behold Apache went in to what appeared to be a busy wait loop just pegging the CPU. What's better is that the process didn't appear to be any code that I could see with user land tools. The system call tracer strace would sit silent when attaching to the process. When I started Apache under strace I could see that the process was stopping after calling getdents. The library call tracer ltrace showed that execution stopped inside of apr_read_dir which coincides with what strace said (since it's likely that apr_read_dir would call getdents).

APR is the Apache portable runtime which is supposed to provide a cross platform shim for C applications. Out of desparation (since I'd killed Apache in the middle of trying to get my group members set-up to use SVN) I tried installing an Apache build with a different concurrency model (the non-threaded model) even though this didn't really seem like a likely cause of the problem, since it was clear from the strace output that Apache wasn't even getting past reading it's configuration file. To my chagrine this did not help ... but discovering that readdir64_r was the last thing on the stack (by attaching gdb to the running process, which I had never done with gdb before), led me to ask the Oracle (aka Google) about bugs involving readdir64_r, Ubuntu, and Xen and I found this bug report which describes an almost identical problem.

The only exception is that they were seeing a segfault instead of the process grinding the CPU. At the bottom of the above bug report they suggest installing the Xen variant of the C library. This solution fixed everything. Clearly my VPS provider does not do adequate testing of their O/S images.

No comments:

Post a Comment