Saturday, March 8, 2008

Google Groups, Loosing E-Mail, GMail Marking Google Groups (DomainKey signed) Email as Spam

We decided today to 86 my choice of Google Groups as our group communication mechanism. Our group lead (Ben) posted several emails to the list about meeting times for this Saturday, none of which went through. So the meeting ended up being rather last minute (since no one knew where it was). I had problems with my GMail account thinking that the Google Groups email was spam.

Whenever I would send email from the wrong account (since the default email gmail sends mail from is not the one that's subscribed to our list) I wouldn't get any feedback from the list that said I couldn't post to list (since that email address is not a member). I happened to peruse my spam folder and found all the "you can't post to this list" emails laying there. GMail's spam classifier apparently didn't think the chain of valid domain keys (from one of Google's own domains even) was enough to prevent the mail from being spam. Needless to say Google Groups is out, hopefully a mailman list provided by the CS department's System Support Group is in.

Wednesday, March 5, 2008

Xen and the Art of Not Functioning

For the most part I enjoy doing system administrator work ... like setting up the software for our group to use. However, the real meat of sysadmin's job is not dealing with software that works well, it's dealing with, and working around software that's broken (most of the time with out the ability to change code). 

Such was the case with an Ubuntu bug I ran in to that caused Apache and other processes to hang (sometimes) when calling readdir_r or readdir64_r (which is a thread safe version of the readdir call). The bug initially manifested when I first installed the Ubuntu O/S image for my new Virtual Private Server (VPS) -- I've had had a VPS with another provider, but this was the first time using Ubuntu Gusty (version 7.10). My VPS is provided by VPSLink (subsidiary of Spry), I found the following report on the VPSLink forums of problems with the particular O/S image I had installed. I followed the advice there and installed a different (still Ubuntu) O/S image and set up all the software packages I needed by hand (mainly just Apache).

Everything was good for a while until I ran in to a problem with attempting to do an SVN move (which is implemented as a delete and a copy from the previous revision). The WebDAV "COPY" command was failing because the Apache server was misconfigured (this document describes several misconfiguration scenarios including mine). I fixed the configuration problem and attempted to restart Apache ... low and behold Apache went in to what appeared to be a busy wait loop just pegging the CPU. What's better is that the process didn't appear to be any code that I could see with user land tools. The system call tracer strace would sit silent when attaching to the process. When I started Apache under strace I could see that the process was stopping after calling getdents. The library call tracer ltrace showed that execution stopped inside of apr_read_dir which coincides with what strace said (since it's likely that apr_read_dir would call getdents).

APR is the Apache portable runtime which is supposed to provide a cross platform shim for C applications. Out of desparation (since I'd killed Apache in the middle of trying to get my group members set-up to use SVN) I tried installing an Apache build with a different concurrency model (the non-threaded model) even though this didn't really seem like a likely cause of the problem, since it was clear from the strace output that Apache wasn't even getting past reading it's configuration file. To my chagrine this did not help ... but discovering that readdir64_r was the last thing on the stack (by attaching gdb to the running process, which I had never done with gdb before), led me to ask the Oracle (aka Google) about bugs involving readdir64_r, Ubuntu, and Xen and I found this bug report which describes an almost identical problem.

The only exception is that they were seeing a segfault instead of the process grinding the CPU. At the bottom of the above bug report they suggest installing the Xen variant of the C library. This solution fixed everything. Clearly my VPS provider does not do adequate testing of their O/S images.