= bfiles - manage large binary files = bfiles is a Mercurial extension for tracking large binary files. Typically such files are (to some extent) not compressible, not mergeable, and not diffable (ie. diffs are meaningless, and possibly pointless because small changes result in large diffs). Mercurial's repository structure is built on compressed deltas, which don't work very well on such files. As a result, Mercurial does not handle large binary files very well. Compounding the problem is the distributed nature of the beast: since every Mercurial working copy has a copy of the history of all those files, clones take more disk space, more time, and more bandwidth. Finally, Mercurial typically reads complete files into memory, so many Mercurial operations can consume large amounts of memory when working with large files. This is often a problem for organizations migrating from centralized version control systems like CVS or Subversion to Mercurial. For various reasons, people tend to put large binary files under source control. And then they build complex workflows around that fact, whether it is a historical mistake or a perfectly valid (if unexpected) use of their source control system. The fact that the history all lives on a central server means that the effects of this pattern are confined to wasting disk space on the server. Converting to Mercurial magnifies the effects: now it slows down every clone operation and wastes considerable disk space on both client and server machines. bfiles is intended to bridge the gap between centralized VC systems and Mercurial by centralizing the storage of large binary files, while still letting Mercurial's distributed nature work well on text files. Specifically, bfiles works by leaving the history of large files on a central server somewhere, fetching only the revisions that you need for a particular changeset. The correspondence between changesets and large file revisions is stored a collection of small version-controlled files. == The central store == Farthest removed from your working copy is the central store, which is just a regular filesystem directory on a server somewhere. You can access it using: * a filesystem path (for network mounts or for repositories on the same machine as the central store) * an SSH URL * an HTTP URL The structure of the central store is straightforward: each large file has a directory of revisions, where each revision is stored in a separate file named after its SHA-1 hash. For example, if your working copy contains two large files: lib/biglib1.jar lib/biglib2.jar then your central store might look like http://bigserver.example.com/bfiles/lib/biglib1.jar/546f727d3f52b2f230d65d864b1b3bb4ef2c1d02 http://bigserver.example.com/bfiles/lib/biglib1.jar/ba35463bd5e25083b932aa61e7731f7a7047ee50 http://bigserver.example.com/bfiles/lib/biglib1.jar/75f1d83ce27bab5f29fff034fc74aa9f7266f22a http://bigserver.example.com/bfiles/lib/biglib2.jar/868c0792233fc78d8c9bac29ac79ade988301318 (That's 3 revisions of lib/biglib1.jar and 1 of lib/biglib2.jar.) If you use bfiles to populate the central store from the beginning, this structure will be built up for you. By default, bfiles uses the same protocol for getting/putting large files from/to the central store. For example, if your .hg/hgrc configures the central store as an HTTP URL: [bfiles] store = http://bigserver.example.com/bfiles/ Then bfiles will use HTTP GET to download large files and HTTP PUT (not POST!) to upload them. (bfiles includes nothing to help you get HTTP PUT working: that's your problem.) Analogous to Mercurial's 'default-push' configuration setting, bfiles supports 'store-put' if you need to use separate protocols (or even separate locations) for getting and putting large files. == The local cache == In future, bfiles will support a local cache to save repeated downloads of the same revision. There are a number of unresolved design issues, though: 1) It's desirable to share the local cache between different clones of the same repository on one machine, but it's unclear what the default location of the local cache should be to easily support this. One likely possibility is for the local cache to live under .hg/bfiles by default (not shared), but to allow the user to move it and reconfigure its location for shared use. 2) Should bfiles also try to save time and disk space by hardlinking from the local cache to the working copy? This sounds nice, except it means that every tool that might update a large file would have to break the hardlink. Since large files are more likely than source files to be modified in-place, this is unlikely: thus, using hard links to save disk space greatly increases the possibility of activity in one working copy modifying the local cache and even other repositories on the same machine. Not good. 3) Should bfiles support a hierarchy of caches, say for geographically distributed groups with good local connectivity but slow links to the central store? This could be useful for certain cases, but might complicate things considerably. Thus, bfiles does *not* currently support a local cache. Every "get" operation downloads the required revision from the central store. This reduces bfiles to Subversion/CVS-like behaviour ... which is no worse than the status quo for people who are converting from Subversion/CVS-with-big-files to Mercurial-with-bfiles. == The working copy == bfiles manifests in the working copy as a directory .hgbfiles/ (in the repository root), a tree of version-controlled stand-ins for actual large files. For example, if you wish to track large files lib/biglib1.jar lib/biglib2.jar then the stand-ins would be .hgbfiles/lib/biglib1.jar .hgbfiles/lib/biglib2.jar where each stand-in simply contains the revision ID of the corresponding large file. Large file revisions are identified by the SHA-1 hash of the file contents, so the stand-ins are just 40-byte files: $ cat .hgbfiles/lib/biglib1.jar 75f1d83ce27bab5f29fff034fc74aa9f7266f22a $ cat .hgbfiles/lib/biglib2.jar 868c0792233fc78d8c9bac29ac79ade988301318 (Aside: an early version of this design specified that large files would be tracked in a single file, .hgbfiles, containing revision IDs and filenames. This idea was dropped in favour of .hgbfiles/ as a directory for various reasons: * history of an individual big file is trivial: 'hg log .bfiles/...' * detecting which big files changed in a particular changeset is trivial: just look for manifest differences under .hgbfiles; no need to diff .hgbfiles * file flags (namely the executable bit) are almost free: let Mercurial take care of .hgbfiles/, and bfiles just has to make sure the real big files are in sync with the stand-ins * tracking renames is almost free * `.hg/dirstate` keeps track of whether a stand-in has been added/removed, meaning bfiles does not have to duplicate that for the real big files The only disadvantages of .hgbfiles/ as a directory are: * more local disk space, depending on the filesystem - but if you're downloading hundreds of megabytes of large files, the overhead of small stand-in files is in the noise * slightly more work to list all big files ) == Getting big files from the central store == Of course, bfiles isn't much use if all you have is a tree of 40-byte standins. There has to be a way to actually get the big files into your working copy. The low-level interface is the `bfget` command, which downloads big file revisions required by the current changeset as specified in .hgbfiles/. (When bfiles grows a local cache, `bfget` will of course go through it.) For a more convenient interface, you can configure bfiles in "integrated update" mode, where `hg update` (or any command that implicitly does an update, such as `hg pull -u` or `hg clone`) automatically gets the big file revisions required for the requested changeset. This makes bfiles transparent for read-only use (and makes Mercurial+bfiles behave similarly to CVS/Subversion). The downside, of course, is that you don't necessarily need all those large file revisions every time you switch to another changeset. == Putting big files into the central store == Occasionally, big files have to be added or updated. Unsurprisingly, this is a multi-step process: 1) create the new file (or modify an existing one) 2) tell Mercurial you have done so with the `bfadd` command (or `bfrefresh` if modifying an existing big file) 3) upload the new file (or new revision) to the central store with `bfput` 4) `commit` a new changeset (with modifications in .hgbfiles/) describing the new/revised big file 5) `push` your new changeset so that others will update to it and download the required big files Naturally, `bfadd`, `bfrefresh`, and `bfput` can operate on many big files at a time. Steps 3 and 4 can be reversed, so you can add/modify/commit big files while disconnected from the network. And `push` can of course be delayed until you have more changesets to push. But bfiles will prevent you from `push`ing changes until you have `bfput` the large files referenced by them. `bfadd` and `bfrefresh` are very similar. The exact steps taken by `bfadd ` are: 1) copy `` to `.hg/bfiles/pending/`, computing the SHA-1 hash on the fly 2) add an entry for `` to .hg/bfiles/dirstate (so future status checks are fast) 3) create `.hgbfiles/` containing the 40-byte SHA-1 hash of the revision that was copied 4) do the equivalent of `hg add .hgbfiles/` After `bfadd`, `` is in state "pending/added". This is reported by the `bfstatus` command (or `status` in integrated update mode) as BPA (B = big file, A = added, P = pending; pending means that this revision of has not yet been uploaded to the central store.) `bfrefresh ` has slightly less work to do: 1) copy `` to `.hg/bfiles/pending/`, computing the SHA-1 hash on the fly 2) replace contents of `.hgbfiles/` with the 40-byte SHA-1 of the just-copied revision After `bfrefresh`, the state of `` is "pending/modified", reported by `bfstatus` as BPM If the next command run is `bfput`, then its actions are: 1) upload contents from .hg/bfiles/pending/ to the central store (as /, where is the SHA-1 hash) 2) delete `.hg/bfiles/pending/` Now `` is in state "stored/added", which is reported by `bfstatus` as B-A At this point, `commit` does nothing special: it creates a new changeset that incorporates the change(s) to `.hgbfiles/` made by `bfadd` and/or `bfrefresh`. Now `` is in state "stored/committed" (aka "clean"), reported by `bfstatus` as B-C (As with `status` files in state "clean" are only reported if you explicitly ask for them with the `-c` option to `bfstatus`.) If you had instead run `commit` before `bfput`, then the states of `` would be slightly different. After `commit`, it would be in state "pending/committed", reported as BPC (This is not a "clean" state, so it will be reported by `bfstatus` by default.) Running `bfput` next would perform the same actions (upload then delete from `hg/bfiles/pending`), and the state would again be "stored/committed".