Thursday, September 13, 2012

Suggestions for Jenkins on multi-platform projects

Our team uses Jenkins as our Continuous Integration tool. I would like in this post to describe our usage, and suggest a few ideas that could improve this great tool.  But first let me explain what we are building...

What we build

The product that we are building is a Mathematical Programming engine. The most basic usage is that you feed it with the mathematical formulation of the business problem that you want to solve, and ask for the best possible solution to this problem.  The program then cranks up all the CPUs/cores it can find on your machine and returns with an answer after a few tens of a second, or a few hours (some problems are REALLY complicated).

The core of the engine is a library built from 500.000 lines of C code. In addition to this library, we have APIs in half a dozen other languages (C++, Java, Python, etc), and connectors for several third party applications.  No less than 14 platforms (including Windows, Linux on various CPU types, MacOs, AIX, HP-UX, etc.) are supported.

We are therefore very glad to use continuous builds: you really don't want to discover a possible compiler bug, or a non-determinism in the code on some exotic platform just before the release!

Our current setup

On our master branch (the one that receives most of the commits from the developers) we use two job families.

The first one builds the software in Debug mode and runs a fully comprehensive suite of tests. We have around 20 such jobs for all the platform/compiler/settings combinations we support. Run times vary widely: some of the jobs are done in 1h30, while others need almost 7 hours.

The second family (the 'distrib' jobs) builds the Release versions of the product. There is one job per platform. Each job builds all the components for this platform (e.g. on Windows32, we support both Visual Studio 2008 and 2010), packages them into some releasable form (could be a Zip, a TarZ or an installer) and tests the basic functionality of the software (e.g. the distributed samples). For those jobs, the run times vary even more: from 30 minutes to 10 hours, depending on the platform.

This setup has been in place for some time now. It works, and it's extremely useful!


Although Jenkins is a great tool, it doesn't yet have all the features I'd like.  So here are a few ideas, just in case the developers would not have already enough...

Detect stale jobs

We sometimes have jobs that stop running (no new run is triggered, or no available nodes).  This is of course not intended, and it would be nice to be able to detect those easily.  I suppose that adding a 'Last build' column to the list view, that would display the time since the job entered its current state, would be nice.  Something like 'Ended 8.6 hr' or 'Queued 1.3 hr' or 'Started 12 min'...

Then I'd know that if the code changed 3 hours ago, I shouldn't see any number larger than 3 hours...

Detect hung jobs

We have many jobs running, typically 20 to 30 simultaneously. And some builds last for several hours.  It happens that tests hang, or are abnormally slow.  These situations should be detected as soon as possible for investigation.

Unfortunately, the 'Build History' list is not very helpful, for two reasons. It has too few jobs for us: with 50 builds, only the last 5 hours are covered, which is less than the duration of many of our builds. But then if this limit was increased, we'd probably need a list of 200 or so jobs, which would not be easy to handle.

I would thus suggest to allow filtering on the 'building' status.  When this flag would be set, the 'Build History' would only display the jobs that are currently being built.

A view 'by revision'

I often need to check if a given revision of the source has been built by a given job, or what is the latest revision that is good on a set of jobs.  For example, I may want to merge this revision to some 'stable' branch for other teams to use.

I think that a grid view with the following attributes would be very useful for this: each line is a commit id or SVN revision, each column is a job, each cell is blue, red or gray (or even empty if this revision has not yet been part of a run of the job, or the run is not finished yet).

Do you think these would be useful additions?