Automated Snapshot Backups
It's quite easy to implement a snapshot style backup on Linux, that will in essence give you a time-stamped version of your data, yet use hardly any disk space at all (over the amount of the original data set you are backing up).
Total disk usage is similar to what is used with incremental type backups, but a bit different in some respects.
This style backup will work if:
* You are on a *nix system with the cp command that supports the -al switches.
* You can contain your "master" copy and revision on the same partition, as linking is not supported across different volumes.
The basics:
cp -al (archive mode, link) does all the work. "-a" is a substitute for "-dpR", "-l" causes hard linking to occur, instead of actually copying the file.
Everything else is built around this behavior to take advantage of the "coolness" of linking vs. copying files.
Why it works: Linux has the ability to link files instead of actually copying them. If you copy a 1K file from one location to the other, then do an ls -i, you will see the files are separate files, and each have their own inodes assigned, each file is 1K in size, and they take a total 2K of space.
Now, do the same copy to the original file using the -al switches (cp al), and you will see the two files have the same inode. This means they are the same file, and are both just a pointer to the same data. And, for both files, total disk usage is.... 1K. Two "files", each 1K, taking a total of 1K of space. Weird. But oh, so exploitable.
It's this behavior that allows "cp -al" snapshot coolness. Some of you sharper ones, this is going through you head about now "Ok, so, the next time I update my "master" copy, and a file is deleted, the other file with the same inode is toast also, right?". Fortunately, Linux "unlinks" files before deleting, so if you delete a file in the master set, your other same-inode file is safe, and stands on it's. For example, with our above two 1K files, if one linked file is deleted, the other still exists, intact. Try it if you don't believe me. Youll see.
Expand this idea to a directory. Say you have this directory where your daily backup puts stuff, it's full of data you want to snapshot -
/backup/master/mydata
Let's make another folder -
/backup/snapshot
Then, let's copy "mydata" to it using cp -al
cp -al /backup/master/mydata /backup/snapshot
Now, you have:
/backup/master/mydata
/backup/snapshot/mydata
(directories do not link, this is a good thing if you contemplate...)
If the original "mydata" set takes 1GB of space, since linking has occurred instead of copying, you now have a "point in time" copy, and your total usage for both data sets? 1 GB. Just like our 1K + 1K = 1K, your 1G + 1G = 1G.
Oh, linking, where have you been all my life? Does this stuff work on money??
cp al /paycheck/mymoney /bankaccount/moremoney
Shoot, doesnt work
But, I digress..
Now, we run a regular backup script the next day that updates the /master/mydata. Some files have been deleted, some have been added, some have changed. Basically any changes are unlinked, while files that are unchanged are not.
Then, we run the snapshot script again, this time copying to another folder:
cp -al /backup/master/mydata /backup/snapshot2/
Now, we have a snapshot of that day under /snapshot2. Under /snapshot, we still have a point-in-time representative of the data as it stood when that set was copied (yesterday, let's pretend).
So, with this simple idea, if we have some other script running, say, daily, to backup important data to our master directory, then after we are sure that process has finished, run a script to cp -al that data to a snapshot directory, renaming this directory each day (more on that in a minute), then we in essence have a master with separate snapshot directories, all only consuming just the space of the changed files between the versions.
To give a real-world example, I have one data set that the master data used about 1.5TB. Each day, I run a snapshot copy after my main backup has finished updating /backup/master. The entire disk space I have to work with is 5.4TB, for storing the master and all snapshots. Keep in mind, this data is heavily used, and many files are updated, deleted, and added each day. So, common sense tells you that 5.4TB divided by 1.5TB = 3.6, so you would expect to get about 3 1/2 snapshots with the total usable space. Using this snapshot trick, here are my real-world values:
Total master date volume: 1.5TB
Total disk space to work with 5.4TB
Number of daily snapshots I have at this moment: 156 (yes, 156 *daily* revision)
Total free space left right now on this same volume: 1.3TB
This means that in 4.1TB of used space, I have the original Master data of 1.5TB, plus 156 *daily* revisions of this data, with 1.3TB free to hold more.My oldest snapshot is from 11-16-2008, and I have one snapshot for every day between then and today, 4-18-2009.
So, that's, give or take, 5 months worth of daily snapshots. About 2.7TB is consumed to make these 156 revisions, so not quite twice the space of the original data. And i have about 3-4 months to go before the volume becomes full. Nice, huh?
You can equate this to any size data you have, and the only issue I have found that this style doesn't work well is when you have a volume where almost every file changes between snapshot intervals. But for normal "heavy" usage even, it has always worked well for me in almost every circumstance. In real-world usage, it's mainly database files that do not work well, but that's about it. I've tried with SQL dumped files, and Fox Pro database (similar to flat-file DB), and it just doesn't work well because all of these large files are updated every day. That is the only limiting factor I've found so far.
"So, how can I get started?"
Glad you asked, I'll include the info and script you need to get started. This is in no way to be considered complete, as *how* you get the data to the Master is up to you, but you do want to do some type of mirroring to your Master, so that data deleted from the real, used data set is mirrored to your Master data backup set. This is important because you want a true "point in time" version with each snapshot. This means if you use rsync, use the --delete option. If you use robocopy, use the /MIR option, etc.
I personally use a few different methods. On one server, I even mount the volume I want to backup with "cifs", so I can simply rsync what appears to be local file system data to my Master backup. How it gets to your master is unimportant, as long as you don't use a method that doesn't know about linking / unlinking. SMB shares will naturally work, as will rsync (rsync unlinks before deleting), and Linux's inherent file commands also know about inodes and unlinking.
What my real-world script will include is a way to append the daily snapshot with the date. This way, you get self-signing snapshots, and can run the same simple cp -al script each day since the created snapshot directory name is a variable.
Here we go, real world stuff. I'll give the real examples of one of my most generic servers, taken from a working model, the names of which have been changed to protect the innocent - (I like to call my master data "backup.0", so sue me)
Master Data:
/data/backup/master/backup.0
Snapshot dir:
/data/backup/versions/
Script (feel free to cut n paste, dont include the dashed lines, make sure the file is executable by the user running it) -
-------------------
#!/bin/bash
#Set timestamp variable
timestamp=$(date +%F)
# copy hardlinks from master to timestamped version
cp -al /data/backup/master/backup.0 /data/backup/versions/backup.$timestamp
--------------------
Thats it. Its that simple. Feel free to change the timestamp to suit your needs, this example (date +%F) returns year-mm-dd, so your backups are tagged backup.year-mm-dd (as in backup.2009-02-15).
The command "date --help" will give you all the options you can use. Stamp it how you see fit.
Set this script (changed to reflect your directory structure) to run in a cron job AFTER a nightly backup finishes updating to your /data/backup/master/backup.0, and there you go. All done.
Where go do from here? Ideas based on enhancements I've used (or at least considered) -
Cleanup: I usually watch my data manually, and will issue a delete command and dump the oldest month of backups when the volume starts filling up. This can be scripted also, but I mainly just have an alert on the drive itself to warn when space is low, then I make an intelligent decision on what to dump.
Easy Access: For users on my network, I create a read-only SMB share to the time-stamped snapshots, so that anyone can go get any file they need from any date retained, but can't alter the data in any way.
I've taken this idea and expanded and enhanced it quite a lot over the years, and even turned it into an on-site to off-site backup service, all using very free tools and operating systems. I hope you get as much use out of it as I have.
Enjoy!
|