Source-controlled database backups
I’m not very satisfied with traditional database backup solutions. It seems like almost no progress was made since SQL Server 6.5 (at least for the built-in tools). They are still outputting monolithic binary backup files that only do the job if the only thing you expect from backups is the ability to recover from catastrophic failures. I expect more, but before I explain, we need a disclaimer:
I am no expert of database backup and may very well miss some crucial points or some aspects of the current state of the art. The solution exposed in this post is the result of my own experiments and has no pretense at exhaustiveness or even reliability. It’s something I use on my own simple projects, and no more. Database backups are serious business and your data is valuable. Use this at your own risk.
What I wanted from my database backup solution can be summarized as follows:
- Works on Azure databases (can you believe that as I’m writing this there is no valid built-in backup solution on Azure?)
- Backups are not stored in the same place as the database
- Has more than one state to rollback to
- Can be automated
- Easy to restore
- Can make diffs between backup states
- Can merge backup states
The last two, diffs and backups, can seem a little peculiar and require some explanation. I work on several Orchard projects, where a team of developers collaborate. The team is virtual, with people in the US, and in various places in Europe. Because Orchard is a CMS, the database contains more than just data: it also stores the configuration of the site, something that developers modify as part of their routine job. Since it is impractical to work off a central database because of latency, and also because one developer can break everyone’s instance with a bad change, we need multiple copies of the database, and a way to reconcile them.
What works in a distributed fashion, resolves conflicts, and stores history? Why, a Distributed Version Control System of course! This is what gave me the idea of storing my database backups in a private repository on BitBucket, using Mercurial as the DVCS.
In order to be able to do diffs and merges, we need a text-based persistence format. I chose to use the bcp command-line utility to emit that format because my past experience of it has been that it’s fast and reliable. The code could in principle use the bulk copy SQL commands, but having never worked with them I went for bcp. If someone wants to modify the code and make a pull request using bulk copy that’s fine by me. Bcp produces flat text files for each table that are very easy to diff and merge.
My solution, HgDBackup, comes with two command-lines that automate bcp, one for backup and one for restore. HgDBackup.exe enumerates the tables in the database and spits out one text file per table. HgDRestore.exe enumerates the tables, tries to find a text file with the same name, and if found, drops all rows in the table before importing the contents of the file into the table.
The schema of the database is not backed up, so you need to do that as a separate operation every time it changes, or as a separate automated task.
In the distributed development environment I described above, the process to take a new version of the database will be as follows:
- execute HgDBackup
- commit local changes
- pull and update from the repository
- execute HgDRestore
- run tests
- commit if a merge was necessary
- push to repository
Both commands have built-in help that you can bring up with hgdbackup /? or hgdrestore /?.
The backup command responds to the following flags:
- /Server or /S: the address of the server to backup (e.g. tcp:myazuredb.database.windows.net or .\SQLExpress)
- /Database or /D: the name of the database to backup (e.g. Orchard)
- /User or /U: user name to use on the database
- /Password or /P: password to use on the database
- /Output or /O: output directory where the tables will be dumped (e.g. C:\backup)
- /Template or /T: connection string template
The restore command responds to the same flags, except that Output is replaced with Input.
The default connection string pattern is SQLExpress with integrated security (so no password has to be on the command-line).
For example, here is what I use to backup one of my Azure databases:
HgDBackup.exe
/S:tcp:something.database.windows.net /D:Orchard
/U:mysubscription@weirdstuff /P:**********
/O:C:\Projects\nwazet.Backup\Tables
/T:Server={0},1433;Database={1};User={2};Password={3};
Trusted_Connection=False;Encrypt=True;
And here is what I use to import that backup into my local SQLExpress instance:
HgDRestore.exe
/S:.\SQLExpress /D:Orchard
/I:C:\Projects\nwazet.backup\Tables
That nwazet.backup directory is a clone of a private repository on BitBucket. It’s of course important that the repository is private because databases contain secrets.
The code is available on BitBucket:
https://bitbucket.org/bleroy/hgdbackup
You may also download the already compiled commands:
http://weblogs.asp.net/blogs/bleroy/HgDBackup.zip
And remember, use at your own risk…