Cool Automated Snapshot Backups [Long, but worth it]

Portal Home > Knowledgebase > Articles Database > Cool Automated Snapshot Backups [Long, but worth it]

Posted by mugo, 04-19-2009, 01:09 AM
Automated Snapshot Backups It's quite easy to implement a snapshot style backup on Linux, that will in essence give you a time-stamped version of your data, yet use hardly any disk space at all (over the amount of the original data set you are backing up). Total disk usage is similar to what is used with incremental type backups, but a bit different in some respects. This style backup will work if: * You are on a *nix system with the cp command that supports the -al switches. * You can contain your "master" copy and revision on the same partition, as linking is not supported across different volumes. The basics: cp -al (archive mode, link) does all the work. "-a" is a substitute for "-dpR", "-l" causes hard linking to occur, instead of actually copying the file. Everything else is built around this behavior to take advantage of the "coolness" of linking vs. copying files. Why it works: Linux has the ability to link files instead of actually copying them. If you copy a 1K file from one location to the other, then do an ls -i, you will see the files are separate files, and each have their own inodes assigned, each file is 1K in size, and they take a total 2K of space. Now, do the same copy to the original file using the -al switches (cp al), and you will see the two files have the same inode. This means they are the same file, and are both just a pointer to the same data. And, for both files, total disk usage is.... 1K. Two "files", each 1K, taking a total of 1K of space. Weird. But oh, so exploitable. It's this behavior that allows "cp -al" snapshot coolness. Some of you sharper ones, this is going through you head about now "Ok, so, the next time I update my "master" copy, and a file is deleted, the other file with the same inode is toast also, right?". Fortunately, Linux "unlinks" files before deleting, so if you delete a file in the master set, your other same-inode file is safe, and stands on it's. For example, with our above two 1K files, if one linked file is deleted, the other still exists, intact. Try it if you don't believe me. Youll see. Expand this idea to a directory. Say you have this directory where your daily backup puts stuff, it's full of data you want to snapshot - /backup/master/mydata Let's make another folder - /backup/snapshot Then, let's copy "mydata" to it using cp -al cp -al /backup/master/mydata /backup/snapshot Now, you have: /backup/master/mydata /backup/snapshot/mydata (directories do not link, this is a good thing if you contemplate...) If the original "mydata" set takes 1GB of space, since linking has occurred instead of copying, you now have a "point in time" copy, and your total usage for both data sets? 1 GB. Just like our 1K + 1K = 1K, your 1G + 1G = 1G. Oh, linking, where have you been all my life? Does this stuff work on money?? cp al /paycheck/mymoney /bankaccount/moremoney Shoot, doesnt work But, I digress.. Now, we run a regular backup script the next day that updates the /master/mydata. Some files have been deleted, some have been added, some have changed. Basically any changes are unlinked, while files that are unchanged are not. Then, we run the snapshot script again, this time copying to another folder: cp -al /backup/master/mydata /backup/snapshot2/ Now, we have a snapshot of that day under /snapshot2. Under /snapshot, we still have a point-in-time representative of the data as it stood when that set was copied (yesterday, let's pretend). So, with this simple idea, if we have some other script running, say, daily, to backup important data to our master directory, then after we are sure that process has finished, run a script to cp -al that data to a snapshot directory, renaming this directory each day (more on that in a minute), then we in essence have a master with separate snapshot directories, all only consuming just the space of the changed files between the versions. To give a real-world example, I have one data set that the master data used about 1.5TB. Each day, I run a snapshot copy after my main backup has finished updating /backup/master. The entire disk space I have to work with is 5.4TB, for storing the master and all snapshots. Keep in mind, this data is heavily used, and many files are updated, deleted, and added each day. So, common sense tells you that 5.4TB divided by 1.5TB = 3.6, so you would expect to get about 3 1/2 snapshots with the total usable space. Using this snapshot trick, here are my real-world values: Total master date volume: 1.5TB Total disk space to work with 5.4TB Number of daily snapshots I have at this moment: 156 (yes, 156 *daily* revision) Total free space left right now on this same volume: 1.3TB This means that in 4.1TB of used space, I have the original Master data of 1.5TB, plus 156 *daily* revisions of this data, with 1.3TB free to hold more.My oldest snapshot is from 11-16-2008, and I have one snapshot for every day between then and today, 4-18-2009. So, that's, give or take, 5 months worth of daily snapshots. About 2.7TB is consumed to make these 156 revisions, so not quite twice the space of the original data. And i have about 3-4 months to go before the volume becomes full. Nice, huh? You can equate this to any size data you have, and the only issue I have found that this style doesn't work well is when you have a volume where almost every file changes between snapshot intervals. But for normal "heavy" usage even, it has always worked well for me in almost every circumstance. In real-world usage, it's mainly database files that do not work well, but that's about it. I've tried with SQL dumped files, and Fox Pro database (similar to flat-file DB), and it just doesn't work well because all of these large files are updated every day. That is the only limiting factor I've found so far. "So, how can I get started?" Glad you asked, I'll include the info and script you need to get started. This is in no way to be considered complete, as *how* you get the data to the Master is up to you, but you do want to do some type of mirroring to your Master, so that data deleted from the real, used data set is mirrored to your Master data backup set. This is important because you want a true "point in time" version with each snapshot. This means if you use rsync, use the --delete option. If you use robocopy, use the /MIR option, etc. I personally use a few different methods. On one server, I even mount the volume I want to backup with "cifs", so I can simply rsync what appears to be local file system data to my Master backup. How it gets to your master is unimportant, as long as you don't use a method that doesn't know about linking / unlinking. SMB shares will naturally work, as will rsync (rsync unlinks before deleting), and Linux's inherent file commands also know about inodes and unlinking. What my real-world script will include is a way to append the daily snapshot with the date. This way, you get self-signing snapshots, and can run the same simple cp -al script each day since the created snapshot directory name is a variable. Here we go, real world stuff. I'll give the real examples of one of my most generic servers, taken from a working model, the names of which have been changed to protect the innocent - (I like to call my master data "backup.0", so sue me) Master Data: /data/backup/master/backup.0 Snapshot dir: /data/backup/versions/ Script (feel free to cut n paste, dont include the dashed lines, make sure the file is executable by the user running it) - ------------------- #!/bin/bash #Set timestamp variable timestamp=$(date +%F) # copy hardlinks from master to timestamped version cp -al /data/backup/master/backup.0 /data/backup/versions/backup.$timestamp -------------------- Thats it. Its that simple. Feel free to change the timestamp to suit your needs, this example (date +%F) returns year-mm-dd, so your backups are tagged backup.year-mm-dd (as in backup.2009-02-15). The command "date --help" will give you all the options you can use. Stamp it how you see fit. Set this script (changed to reflect your directory structure) to run in a cron job AFTER a nightly backup finishes updating to your /data/backup/master/backup.0, and there you go. All done. Where go do from here? Ideas based on enhancements I've used (or at least considered) - Cleanup: I usually watch my data manually, and will issue a delete command and dump the oldest month of backups when the volume starts filling up. This can be scripted also, but I mainly just have an alert on the drive itself to warn when space is low, then I make an intelligent decision on what to dump. Easy Access: For users on my network, I create a read-only SMB share to the time-stamped snapshots, so that anyone can go get any file they need from any date retained, but can't alter the data in any way. I've taken this idea and expanded and enhanced it quite a lot over the years, and even turned it into an on-site to off-site backup service, all using very free tools and operating systems. I hope you get as much use out of it as I have. Enjoy!
Posted by cert-eh-fiable, 08-15-2009, 10:16 PM
Great thoughts, thank you. I've recently had the opportunity of having to recover some lost data, thank heaven for my routinely scheduled backups to my local machine or I would have been b0Rked. That said there is no backup in the world that is worth being called a backup unless you have successfully restored from it. Period, end of story. To my point. I'm looking for a clean and overhead smart way of creating a hard copy of my customer's public files which is where my backups fell short. So basically I what I want is to: cp /var/www/ /storage/ -Rfpu I was then thinking of setting a crontab to handle this nightly. /storage is a different volume used for these purposes it's not a partition. Any thoughts and ideas why this would or wouldn't work or if there is a better way to accommodate this? Thanks ... and I'm sorry if this off topic a bit but it's still in the same vane of thought. ~cert
Posted by JulesR, 08-16-2009, 01:37 AM
www.rsnapshot.org - does the same thing with less work
Posted by cert-eh-fiable, 08-16-2009, 02:16 AM
you know I've read about this somewhere ... I'll dig on the manpages for a bit and maybe see if I can compile it up. Thanks for the nudge.

Add to Favourites Print this Article

Knowledgebase

Cool Automated Snapshot Backups [Long, but worth it]

Our Services

Client Menu

Legal