An interesting alternative to rsync is zsync http://zsync.moria.org.uk/ . A very brief summary of differences:
* Instead of performing the sender portion of work of generating checksums on-demand, it is performed once when the file is "published" and saved in a zsync metadata file
* This zsync metadata file is fetched (simple copy) and the receiver uses it to decide which portions of the file it needs to request. It then requests only those portions.
* Because of the simplification, the protocol can be reduced to work over simple stateless http. Any HTTPD that supports range requests can be a zsync server. Remote zsync files are represented by http urls.
* Note, this all but removes the CPU requirement of the sender/server.
I've used zsync in some very large systems to efficiently distribute write-few read-often files with only partial changes to many endpoints. Much more scalable than rsync due to the lack of CPU cost for the server/sender.
I also maintain a fork of zsync which runs using libcurl rather than the original author's custom http client code. This fork is primarily to support SSL: https://github.com/eam/zsync
All is true, but do note that zsync is, at least for now, a single file system. If you are rsyncing thousands of files over a slow connection (because only little has changed), rsync can often do this with just a handful of bytes more than the actual changes, and zsync needs hundreds of bytes per file just to see nothing has changed.
Use zsync to distribute a small number of large files that have small changes. If you need to rsync hierarchies with lots of files, rsync is still king.
Absolutely true, the zsync client operates on a single file and doesn't manipulate file metadata. But this is a solvable problem and I have written wrappers which will deal with file hierarchies approximately as efficiently as rsync. Here is one I developed to drive a CM system comprised of many small files most of which are unchanging: https://github.com/yahoo/cm3/tree/master/azsync
The additional process is to generate and send a list of filenames and metadata attributes (which rsync must do as well) and to invoke zsync per-file only if an update is necessary. For large trees of files which are largely unchanged this is very efficient - much moreso than fetching a zsync manifest per-file.
The file path is generally the largest amount of data sent per-file, prior to sending the zsync manifest. This is similar to rsync.
The main problem with the standard rsync utility is the protocol. Check out the Rsync Protocol section of this document:
"A well-designed communications protocol has a number of characteristics."
<list of characteristics>
"Rsync's protocol has none of these good characteristics."
...
"It unfortunately makes the protocol extremely difficult to document, debug or extend. Each version of the protocol will have subtle differences on the wire that can only be anticipated by knowing the exact protocol version."
This is why it is very hard to implement a client program that can communicate with the standard rsync deamon on a server. You can always use the rsync program itself to communicate with the server, but this is not always an option. If it is - it can get ugly. On windows, you need cygwin or similar to run rsync.exe, which can complicate the deployment of your desktop app or shell extension.
An easy rsync client API would be useful if you were building an app that can store files on an rsync server, because the rsync utility and the rsync algorithm are great ways to efficiently syncronize files.
On windows, you need cygwin or similar to run rsync.exe, which can complicate the deployment of your desktop app or shell extension.
I tried deploying updates to a (pre-existing already deployed) website to a Windows-server machine using rsync once.
The site which running fine in the first place instantly stopped working, because rsync didn't merely copy the files over, but it completely reset the existing ACLs and permissions on all the files. The result was that the webserver no longer had permission to access the website's files. It was repeatable for every sync.
librsync is a library for building rsync workalikes. It is not compatible with rsync itself.
librsync and the rdiff binary that wraps it can create a signature from a destination file, create a patch from a signature and a source file, and can apply a patch to a destination file. And that's about it. librsync doesn't concern itself with the networking. That's up to you.
rdiff is a thin wrapper around librsync. librsync can easily do anything rdiff can do, without having to fork a new process. You might wish the rsync executable were built this way, but it is not.
I'm almost sure that this description is out of date and describes rsync 2.
rsync 3 does not need to create or transfer the entire file list - in fact, it will start immediately, and will have no idea how many files are left -- it's not uncommon for it to always say "just 1000 more files left" all the time while working through a million files. You can force it to prescan all files with -m ("--prune-empty-dirs" or something like that) if you insist.
Also, I might be mistaken, but I think rsync3 doesn't even transfer the entire file list to the other side - it will treat the directory like a file (which contains file names, attributes, and checksums), and transfer that using rsync. If nothing changed, this will take a few bytes. If something did, the entire directory listing is rsynced to the other side, and it will be determined recursively which files and directories actually need to be transferred -- with every directory that doesn't any changes skipped like a file that doesn't need any changes.
The 'rolling checksum' part of the implementation is brilliant.
I have often wondered why it is that rsync is so life-saving-ly quick and how it is that a few small changes to a massive file (e.g. from mysqldump) can be copied up to a server from the slow end of an ADSL line so quickly. Now I know about the 'rolling checksum' I can see what is going on.
Note that I work with people who use 'FTP' to copy files, or even worse, people who find FTP too complicated and have to send me files on a 'Dropbox' thing so I can download them and upload them for them, notionally with 'FTP'. (I will use rsync instead, not least for the bandwidth control options).
I have even had micro-managers get me to get FTP to work on the server for them, despite my protestations about it being insecure (which it really is if you use a Windows PC and something like Filezilla).
Obviously I only use rsync and scp. Without aforementioned micro-managed requests I would not even know if FTP was installed on the server side.
My point is that it may be easy for a few folks here to criticise rsync, however, there are a lot of people, from clients to managers and even talented programmers that just don't have a clue about rsync and are stuck in some stone age of using things like FTP.
> (which it really is if you use a Windows PC and something like Filezilla).
What does Windows as a client OS have to do with it? FTP is insecure because it transmits credentials in the clear and because it opens additional ports for the actual transfer of data. Neither of which are a concern of the client.
Thanks for your point - I have just remembered that I have recovered someone's FTP password for them from a TCPIP stream and it was fun but not difficult!
However, of notable attacks I have witnessed recently, the plain text file of Filezilla was the attack vector. Get that and away you go!
[0]: from http://blog.liw.fi/posts/rsync-in-python/ but this site has been on and off regularly, hence the scavenging straight from my browser cache. As of today, the site is up again but the bzr repo is out of order (and bzr is not exactly popular).
I always found myself looking for a simple way to backup a hierarchy of folders on an external device and then keep keep both copies synced, then I heard about rsync and discovered that it does just that. Being using it exclusively for all of my backups, really useful.
EDIT: Also since we're talking about rsync, do you think the following options are sufficient for syncing a folder hierarchy from the local disk to an external flash drive?
rsync -aW --delete /source /destination
My main concern is the W option, which skips the usual compression (that delays a lot the already long process of syncing) and might end up writing a lot of bytes and decaying the memory cells of flash storage.
You might enjoy checking out rsnapshot[1], which is a convenient way to store backups as snapshots. It of course uses rsync. I've been using it for several years now, and it's saved my ass on more than one occasion.
Note that I haven't touched the configuration since I set it up. It's really great.
I've given unison a lot of chances over the years, but I keep going back to rsync with "inbox" and "outbox" directories (if that can be done), or [le]git push/pull/sync if not.
unison is very slow compared to rsync. version at both ends must match (which means you'll likely need to compile your own unless all your machines run the same distro and version).
Something that wasn't clear to me right away is that the generator is running on the remote system (assuming a remote transfer) so in the generator -> sender -> receiver bit each -> is data going over the network.
We use rsync extensively all throughout our deployment pipeline. Here are a few pointers on how we use it though.
Don't rsync directly to the location you are running your application from. Instead, upload to a staging directory and then use a symlink to change from one version of your code to the next. Changing a symlink is an atomic operation.
We have a user called something like ~packages which has all the static code and assets in it. This users data should be read only from the users that run the actual services. Inside that user dir, we have version directories like tags/0.11.1/1, tags/0.11.1/2 and tags/0.11.2/1. These directories correspond to tags from our version control system.
Switching over to a new build just means stop service, change symlink, start. Some services don't need the stop and start part.
You can use hard links to make this process even better. Our build system uses the "--link-dest" option to specify the last build's directory when uploading a new build. This means that files that have not changed from the last build don't consume any extra space on the disk. Since the inodes are the same, they even stay in the file system cache after the deploy.
You can have lots of past versions sitting there on the server without taking up any space. If you have a bad deploy, and need to revert to a past version, just change the symlink again.
Rsync is just a file transfer tool with extra options. Deployment involves a lot more pieces. The file transfer component of your deployment could certainly use Rsync, assuming you aren't limited to a particular transport protocol (though rsync does support HTTP proxies!)
Here are some of the neat features of Rsync you can take advantage of for deployments:
* Fault tolerance: when an error happens at any layer (network, local i/o, remote i/o, etc), Rsync will report it to you. Trapping these errors will give you better insight into the status of your deployments.
* Authentication: the Rsync daemon supports its own authentication schemes.
* Logging: report various logs about the transfer process to syslog, and collect from these logs to learn about the deployment status.
* Fine-grained file access: use a 'filter', 'exclude' or 'include' to specify what files a user can read or write, so complex sets of access can be granted for multiple accounts to use the same set of files (you can also specify specific operations that will always be blocked by the daemon)
* Proper permissions: force the permissions of files being transferred, so your clients don't fuck up and transfer them with mode 0000 perms ("My deploy succeeded, but the files won't load on the server! Wtf?")
* Pre/post hooks: you can specify a command to run before the transfer, and after, making deployment set-up and clean-up a breeze.
* Checksums on file transfers for integrity
* Preserves all kinds of file types, ownership and modes, with tons of options to deal with different kinds of local/remote/relative paths, even if you aren't the super-user (including acls/xattrs)
* Tons of options for when to delete files and when to apply the files on the remote side (before, during or after transfer, depending on your needs)
If your current deployment procedure is "I just scp this directory or zip file up", then yes, rsync may be slightly better. It ultimately depends on how much actually changed between your new build artifact and whatever is actually on your server. If you're deploying using just scp though, I'd strongly suggest looking at a deploy tool (e.g., capistrano).
Where I've found rsync really valuable is for good ol' regular file copying ("I just need to stick this one file or directory on a server"). I've pretty much stopped using scp and replaced it with rsync. rsync is awesome because:
1) you can resume interrupted transfers
2) it's much faster than scp when sending lots of small files
3) it's actually you know, a sync tool, as opposed to just a copy tool
If you miss the little progress bar that scp gives, you can also use --progress with rsync and then it's basically a drop in replacement.
Rsync always sends a list of files (and their attributes). But typically most files haven't changed. They could just send the files that have changed since the last sync.
Yes in the general case. But in the case of a backup that's done daily the sender can say... there are all the files that changed since we last did this.
rsync already supports that. It has an "offline"/"batch" mode - you can generate a diff, only send that, and apply it at the other end. However, you are unlikely to save any traffic that way. rsync is super efficient and does not necessarily send complete directory listings.
Too bad there's not a complimentary document, "How Rsync Breaks", because that one would be quite useful as well. I've had it fail in the most annoying and arbitrary ways and it's dissuaded me from using it in any real production situations.
Yep. The main caveat is that files update is not transactional. If rsync is stopped (crashed, disconnected) in the middle of updating a file, then what you get is a corrupted file.
When rsync needs to update a file it saves the contents to a temporary file first and then copies it over at the end, which should be an atomic operation on most filesystems. So you shouldn't end up with half updated files (unless you use the --inplace switch), but you can end up in situations where half the files in a directory are updated and half are not, which can be just as bad.
Interesting, didn't know about the temp file. It doesn't really make updates atomic, but it certainly reduces the chances of ending up with a partially updated file.
No, it DOES make a single file update atomic. What it doesn't do is make multiple updates atomic.
The way rsync works, it CANNOT end up with a partially updated file! (unless you use --inplace or --append which implies it - and it's your fault if you do)
Of course it CAN and it DOES NOT. If I flip two bits in a large file - one at the head and at the tail - then no matter how clever the algorithm is, the update cannot be atomic without proper support from the OS, because it would involve two separates Writes into the file.
On Windows there's Transactional NTFS whereby you can bind an open file to a transaction and then have either all or no changes applied at once. But that's only Vista+ and I am pretty sure rsync doesn't use it anyhow.
Flip those two bits. What rsync will do on the target system is create a copy of the file you want (with a name like .tempxasdiohkshlksdf-filename.ext) which takes most of the data from the local copy, and a few kilobytes of patches transferred. Then, when this file has been created, closed, its attributes properly set, and it is an identical copy of the file on the source system - it will rename ("move") the temporary file into the name that it should have. This move operation is what makes everything atomic;
It does cost another copy of the file on disk, but it does NOT leave the file in an inconsistent state. It is either the original file, or the new file - no in between.
You CAN avoid this behavior, by using the "--inplace" switch or the "--append" switch, which tell rsync to just modify the file in-place. However, this is NOT atomic, and NOT the default (for that exact reason).
I've used rsync many times and never had any problem with corrupted files (though it might just be luck). Would running rsync a second time fix the corrupted files?
* Instead of performing the sender portion of work of generating checksums on-demand, it is performed once when the file is "published" and saved in a zsync metadata file
* This zsync metadata file is fetched (simple copy) and the receiver uses it to decide which portions of the file it needs to request. It then requests only those portions.
* Because of the simplification, the protocol can be reduced to work over simple stateless http. Any HTTPD that supports range requests can be a zsync server. Remote zsync files are represented by http urls.
* Note, this all but removes the CPU requirement of the sender/server.
I've used zsync in some very large systems to efficiently distribute write-few read-often files with only partial changes to many endpoints. Much more scalable than rsync due to the lack of CPU cost for the server/sender.
I also maintain a fork of zsync which runs using libcurl rather than the original author's custom http client code. This fork is primarily to support SSL: https://github.com/eam/zsync
It's a cool project, check it out!