Testing for race conditions

Race conditions are among the worst kind of problem to debug: they tend to appear only rarely and unrepeatably, often arise only in long running cases in production, and leave little evidence of what went wrong. This also was the case for a online payment solution provider in Stockholm.

Mnesia is a database written in Erlang that is used by for example in telecommunication equipment and financial transaction databases. Erlang is not immune to race conditions, despite its excellent support for concurrency. Race conditions can give rise to rare intermittent failures in software such as mnesia. Mnesia was known to fail “once every month or two” in production, and race conditions are one likely cause. After six weeks of work engineers responsible for this software could not find the cause of the errors.

We extended QuickCheck with features to generate concurrent unit tests to provoke race conditions, and shrink them to minimal examples. With less than 2 days of work and 200 lines of QuickCheck code, we were able to generate concurrent tests of dets, the disk storage layer of mnesia, and provoke five separate race conditions in short order. This not only means that we can find these kind of concurrency errors easily, it also means we can find them quicker, by already addressing them at the unit testing level.

We publihed a research paper about this cutting edge technology.