Monday, 30 December 2013

Rewriting history in Subversion with the help of Erlang

Note: This text was originally written for the redhoterlang.org blog in 2011, but since that is now defunct, I'm republishing it here to make it available online again. When I wrote this, we had only been using the resulting repository for a few months and our experience with git was quite limited. Now three years later, I can say that I'm very very happy indeed that I spent a lot of effort importing the entire development history into git. Being able to run "git log" or "git blame" and see every change back to when the initial code base was created almost 10 years ago is incredibly useful.

Our development department at Klarna has grown quite a lot the last year [2009-2010], and because we are trying to be as agile as we can, using scrum and kanban, this has meant more teams, more branches, and more coordination and merging of branches. When summer came, we knew we had a big batch of recruits due to start in August, and that we had to do something about our version control system before then, because it was getting to be a real problem.

Once upon a time, our R&D consisted of a handful of developers getting along fairly well using CVS. After a year or two, they took the step of converting the repository to Subversion, and that worked well enough for a couple of dozen developers, as long as we tended to follow the basic pattern of creating project branches from trunk and only merging them back to trunk when they were done. In Subversion, regularly merging out from trunk to a branch to keep it up to date works fine, but you really want to avoid merging a branch back to trunk except when you're done with it and you're going to remove the branch afterwards.

A typical problem we ran into was that we wanted to have separate per-sprint branches for each team, from which the developers in the team could create individual work branches that they merged back to the team branch at the end of the sprint. The team branches were then reviewed and tested, and reintegrated to trunk. But occasionally, a work branch was not mature enough to be allowed to be merged to the team branch in this sprint. Of course, you'd want to reschedule the work as part of the next sprint and keep the branch. But the team branch (from which the work branch was taken) will get reintegrated to trunk, and can't be kept around - Subversion gets severe hiccups if you try to keep working on it after reintegration. So what do you do with the work branch? There are a few tricks you can try in this case: some discard the history, and some call for manual tweaking of svn:mergeinfo. Neither is good, and both require time and effort by a local Subversion master.

Another annoyance was that Subversion had only had mergeinfo support for a short time, and most of our repository history had no mergeinfo properties. This meant that 'svn blame' was mostly useless, generally showing only who had performed the final merge to trunk. And even when you asked svn to do its best with the mergeinfo it had, it simply took forever to compute the annotations because of the amount of history it had to download from the server. The svn:mergeinfo properties themselves were also getting ridiculous: even though we only had had them for a year, the mergeinfo was already hundreds of lines long, due to all the work branches that had been created and reintegrated along the way. There were other problems as well, but to cut a long story short, Subversion no longer felt like a viable solution. Eventually we decided that Git was the way we wanted to go - Mercurial was the main contender - and that should mean that all we had to do was import the Subversion repository into Git using git-svn, right?

But things are never that easy. Far from it, in fact. Our initial attempts to use git-svn ran for a few hours and then aborted with some odd error message. Apparently, our Subversion history was just too complicated for git-svn to understand. We experimented with different options, tried using svndumpfilter to extract only the most important parts of the repository, and fiddled with graft points to recreate missing information. All to no avail - the result was so incomplete, it was not funny.

At this point, we could have decided to waste no more time and just make a clean import of the sources into Git, keeping the Subversion repository as read-only for historical purposes, but it felt worthwhile to try to import the entire history into git, so we would get full use of 'git blame' and not have to use a separate tool for digging up old details. There was only one way to do this: rewrite the Subversion history so that git-svn would accept it. But that was no simple task - there were many idiosyncrasies in our repository:
  • The first few thousand commits came from CVS, via cvs2svn.
  • At first we had a simple layout with trunk/tags/branches at the top level (and - probably because of the cvs2svn conversion - with an extra directory level under trunk, as in "trunk/foo/*").
  • Later, we reorganized the repository to use a typical multi-project layout: "foo/{trunk,tags,branches}", "bar/{trunk,tags,branches}", etc.
  • Some sub-projects that had originally been part of "foo", such as "foo/trunk/lib/baz" were moved out to be separate projects with their own trunk: "baz/{trunk,tags,branches}".
  • There were lots of commits where people had copied branches to the wrong locations (or copied from the wrong directory level) and then moved or deleted them again. On average, that probably happened once or twice each week. It's OK if you stay with Subversion forever, but if you want to change to another version control system, they become a big problem, making tools like git-svn very confused.
At first, I tried using a combination of svndumpfilter and various shell scripts and temporary files to pry apart the original commits, rewrite them, and join them again, but svndumpfilter is a very simple tool, and does no sanity checking: in general, you only discover consistency problems when you try to load the final dump file back into an empty repository and get an error message - usually after several hours of disk activity. (The original dump of our entire Subversion repository was about 10 GB in size.) The feedback loop was way too slow, and I had to change strategy. I had intentionally postponed thinking about writing my own down-and-dirty dumpfile filtering tool, because I knew that it could potentially be a huge timesink, but August was getting closer and I felt that I had no choice if we were to be able to switch to Git before the new recruits started working.

A good Subversion dumpfile filtering tool had to be able to:
  • Read dumpfiles, as created by 'svnadmin dump' and represent them internally in symbolic form suitable for easy matching and rewriting.
  • Write the possibly rewritten entries back to disk, in valid dumpfile format so that they can be read by 'svnadmin load'.
  • Track a lot of information to be able to perform consistency checks and give early warnings, in order to save time when writing a filter.
  • Handle reading and writing in a streaming fashion, since it was out of the question to load a 10 GB dump file into main memory. (I was doing this on my laptop, which had 4GB RAM.)
As it turned out, Erlang is a really good language for doing this sort of thing! I had expected that the parsing and output should be pretty simple, mainly because of Erlang's binary syntax, but the rest also turned out very well, even if I had to make a bunch of performance hacks. Most of all, the fact that you really want to work with structured symbolic data for a task like this, and not just plain strings, made it a great match for Erlang.

First, hunting down and deciphering all the information I could find about the Subversion dumpfile format and implementing a parser and writer for it, took only a couple of days, and as usual, the hardest part was figuring out the little undocumented things by parsing some real dump files, writing them back out, and trying to load them into Subversion again. Once I had that working, I started working on doing the filtering/rewriting while streaming.

It turned out that in order to do all the necessary consistency checks - for example, checking that if you copy the path "foo/bar/x" to "foo/baz/x" in revision 17, the parent path "foo/baz" actually exists at that point - you really have to keep track of all existing paths in all revisions. To see why, imagine that in revision 54321, you look up the path "foo/baz/x" as it was in revision 17, and copy it to "foo/fum/x". Furthermore, in each revision, there may be tens of thousands of paths that were live at that point. This calls for some serious sharing of internal data structures - just the sort of thing that functional language is good at! Basically, I needed to represent the structure of the repository in much the same way Subversion itself represents it: as a directory tree with file and directory nodes. Each revision is a pointer to the current top node in that revision, but most of the nodes are shared between revisions. (For speed, I used the process dictionary to map revisions to nodes - this still preserves the sharing.)

Obviously, you don't want to store the contents of the files, only their paths, but you do want to check that the MD5 sum of the contents matches what the dumpfile metadata says that they should be - and if your filter modifies the contents, the metadata needs to be automatically updated in the output.

Furthermore, it turned out that in order to be certain that path copy operations were correct, you want to compare the source and target MD5 sums. If the information in the dumpfile says that "in rev. 54321, we copy foo/baz/x of revision 17, whose MD5 is A1B2...FEDC, to foo/fum/x", then you want to check that the file at foo/baz/x in revision 17 really has this checksum. This means that for each path in each revision, you also need to track its current MD5 sum. That's even more data that you only want to store once.

Even though I kept both the paths and the MD5 sums as binaries, not lists, I kept running out of memory: there were lots and lots of occurrences of the same character sequences. To get around this, I had to cache all binaries in the process dictionary: each time a path or MD5 is read from disk, we look it up in the process dictionary: if it is found, we take the old copy and let the newer one be garbage collected. If it's not found, we store it in the dictionary. (Using atoms would not have been possible, because of the limitations on total numbers of atoms and lengths of the atom names.)

This, and some other small things, made it possible to stream the entire 10 GB Subversion dump file of our repository (more than 38 thousand revisions), in under an hour on my measly laptop, keeping all the necessary metadata in memory and sanity check every rewite I made. The program would fail quickly and report any inconsistencies, making it easy to keep adding rules until it all worked. In the end, a couple of hundred of individual rewrite rules were needed to clean up our mess. The result could be loaded back into an empty Subversion repository, and handed over to git-svn which imported it without complaints!

There was only one problem: the merge information was incomplete: for a large number of merge points, Git didn't know where it came from - only the log message hinted that a merge had occurred. If we had been using the standard svnmerge script for simplifying merges in Subversion, then git-svn should have been able to pick up the info from the generated commit messages - even from the time before svn:mergeinfo was implemented. However, we had for some reason always used a merge script of our own. Also, the script had been tweaked from time to time, so the commit messages it produced were not even always on the same form. I had to add functionality to the filter for extracting the correct merge information from our commit messages, and reinsert it as svn:mergeinfo properties for git-svn to use.

But there's another complication: for that to work, when you copy a path in Subversion for branching or tagging it is crucial that the copy gets the merge information from the source path. This meant that apart from MD5 sums, I now also had to track the complete merge information on all trunks, tags and branches, in all revisions. (Thankfully, there was no need to do this for all paths, just "root paths" - I scrubbed away all other svn:mergeinfo properties on sub-paths.) Once I had this working, things just fell in place, and git-svn was able to pick up practically all branchings and taggings that had happened, even from the bad old CVS days. And I was still running the entire conversion on my laptop!

We now had a complete and consistent history in Git, although it was not, strictly speaking, true: it looks as if we had the same nice directory layout all the way from the start. The drawback is that we can't check out a very old revision and expect it to build and run out of the box, because paths in Makefiles and in the code probably won't fully match the layout on disk. If that sort of thing is important to you, think carefully about each modification you make to the history: will it break an old build?

One of the first weeks in August, we announced that everybody should commit any work they had lying around, and after that point we disabled write access to the Subversion server, Friday afternoon. The following Monday morning, developers got to clone the new Git repository, and that was that. The first couple of weeks were a bit chaotic while people were getting to grips with Git, but we've never looked back.

The code I wrote for processing Subversion dumps is open source and can be found at https://github.com/richcarl/erlsvn. It is divided into a generic library module, and a user-defined filter module: there is a trivial filter example as well as a stripped-down version of the filter I wrote for our production repository. I've cleaned up the code a bit, so it shouldn't be too hard to read, but if you think some parts of it are ugly, remember that I was under hard time and space constraints when I wrote it. Feel free to improve on the interface as well as the coding style, and let me know it you use it, be it successfully or not.

- Richard