Source-controlled database backups

Monday, December 5, 2011

This is in beautiful Kauai in case you were wondering I’m not very satisfied with traditional database backup solutions. It seems like almost no progress was made since SQL Server 6.5 (at least for the built-in tools). They are still outputting monolithic binary backup files that only do the job if the only thing you expect from backups is the ability to recover from catastrophic failures. I expect more, but before I explain, we need a disclaimer:

I am no expert of database backup and may very well miss some crucial points or some aspects of the current state of the art. The solution exposed in this post is the result of my own experiments and has no pretense at exhaustiveness or even reliability. It’s something I use on my own simple projects, and no more. Database backups are serious business and your data is valuable. Use this at your own risk.

What I wanted from my database backup solution can be summarized as follows:

Works on Azure databases (can you believe that as I’m writing this there is no valid built-in backup solution on Azure?)
Backups are not stored in the same place as the database
Has more than one state to rollback to
Can be automated
Easy to restore
Can make diffs between backup states
Can merge backup states

The last two, diffs and backups, can seem a little peculiar and require some explanation. I work on several Orchard projects, where a team of developers collaborate. The team is virtual, with people in the US, and in various places in Europe. Because Orchard is a CMS, the database contains more than just data: it also stores the configuration of the site, something that developers modify as part of their routine job. Since it is impractical to work off a central database because of latency, and also because one developer can break everyone’s instance with a bad change, we need multiple copies of the database, and a way to reconcile them.

What works in a distributed fashion, resolves conflicts, and stores history? Why, a Distributed Version Control System of course! This is what gave me the idea of storing my database backups in a private repository on BitBucket, using Mercurial as the DVCS.

In order to be able to do diffs and merges, we need a text-based persistence format. I chose to use the bcp command-line utility to emit that format because my past experience of it has been that it’s fast and reliable. The code could in principle use the bulk copy SQL commands, but having never worked with them I went for bcp. If someone wants to modify the code and make a pull request using bulk copy that’s fine by me. Bcp produces flat text files for each table that are very easy to diff and merge.

My solution, HgDBackup, comes with two command-lines that automate bcp, one for backup and one for restore. HgDBackup.exe enumerates the tables in the database and spits out one text file per table. HgDRestore.exe enumerates the tables, tries to find a text file with the same name, and if found, drops all rows in the table before importing the contents of the file into the table.

The schema of the database is not backed up, so you need to do that as a separate operation every time it changes, or as a separate automated task.

In the distributed development environment I described above, the process to take a new version of the database will be as follows:

execute HgDBackup
commit local changes
pull and update from the repository
execute HgDRestore
run tests
commit if a merge was necessary
push to repository

Both commands have built-in help that you can bring up with hgdbackup /? or hgdrestore /?.

The backup command responds to the following flags:

/Server or /S: the address of the server to backup (e.g. tcp:myazuredb.database.windows.net or .\SQLExpress)
/Database or /D: the name of the database to backup (e.g. Orchard)
/User or /U: user name to use on the database
/Password or /P: password to use on the database
/Output or /O: output directory where the tables will be dumped (e.g. C:\backup)
/Template or /T: connection string template

The restore command responds to the same flags, except that Output is replaced with Input.

The default connection string pattern is SQLExpress with integrated security (so no password has to be on the command-line).

For example, here is what I use to backup one of my Azure databases:

HgDBackup.exe
  /S:tcp:something.database.windows.net /D:Orchard
  /U:mysubscription@weirdstuff /P:**********
  /O:C:\Projects\nwazet.Backup\Tables
  /T:Server={0},1433;Database={1};User={2};Password={3};
     Trusted_Connection=False;Encrypt=True;

And here is what I use to import that backup into my local SQLExpress instance:

HgDRestore.exe
  /S:.\SQLExpress /D:Orchard
  /I:C:\Projects\nwazet.backup\Tables

That nwazet.backup directory is a clone of a private repository on BitBucket. It’s of course important that the repository is private because databases contain secrets.

The code is available on BitBucket:
https://bitbucket.org/bleroy/hgdbackup

You may also download the already compiled commands:
http://weblogs.asp.net/blogs/bleroy/HgDBackup.zip

And remember, use at your own risk…

Hey Bertrand - liking the effort - it's very good and bare metal! No fancy stuff just do the right job in a hurry, please! ;~)

That being said I note that you drop data from each table before you insert. How do you handle forgein keys between the tables and also how do you handle autoincrementing ID-columns in relation to that?

Cheers,

M.

Magnus Mårtensson - Monday, December 5, 2011 7:23:11 AM

@Magnus: from what I understand bcp suspends referential integrity checks and identity auto-increments while doing the import (in its default configuration, there are flags to modify that) so I suppose you can get into inconsistent states until your import is done. If I were implementing that in a more industrial setting (for the moment I'm only using it in dev settings but that might change soon), I would add a step after restore to check and enforce integrity. I'll follow-up if necessary once I know more about this.

Bertrand Le Roy - Monday, December 5, 2011 8:41:13 PM

beautiful post and exactly what I need for orchard. Thanks for sharing!

John - Tuesday, December 6, 2011 1:14:13 AM

Maybe the site config shouldn't go in a SQL database

gill bates - Tuesday, December 6, 2011 3:57:58 AM

(sigh)

Bertrand Le Roy - Wednesday, December 7, 2011 3:18:36 AM

Uuummm...no. You definitely got me on that one, doesn't support Azure...which is a pretty significant difference (that I missed).

Yex - Wednesday, December 21, 2011 9:13:58 PM

6 Comments