BagIt
BagIt is a hierarchical file packaging format designed to support disk-based storage and network transfer of arbitrary digital content. A "bag" consists of a "payload" (the arbitrary content) and "tags", which are metadata files intended to document the storage and transfer of the bag. A required tag file contains a manifest listing every file in the payload together with its corresponding checksum. The name, BagIt, is inspired by the "enclose and deposit" method,[1] sometimes referred to as "bag it and tag it".
Bags are ideal for digital content normally kept as a collection of files. They are also well-suited to the export, for archival purposes, of content normally kept in database structures that receiving parties are unlikely to support. Relying on cross-platform (Windows and Unix) filesystem naming conventions, a bag's payload may include any number of directories and sub-directories (folders and sub-folders). A bag can specify payload content indirectly via a "fetch.txt" file that lists URLs for content that can be fetched over the network to complete the bag; simple parallelization (e.g. running 10 instances of Wget) can exploit this feature to transfer large bags very quickly. Benefits of bags include
- Wide adoption in digital libraries (e.g., the United States' Library of Congress).
- Easy to implement using ubiquitous and ordinary filesystem tools.
- Content that originates as files need only be copied to the payload directory.
- Compared to XML wrapping, content need not be encoded (e.g. Base64) which saves time and storage space.
- Received content is ready-to-go in a familiar filesystem tree.
- Easy to implement fast network transfer by running ordinary transfer tools in parallel.
Specification
BagIt is currently defined in an IETF internet draft[2] that defines a simple file naming convention used by the digital curation community for packaging up arbitrary digital content, so that it can be reliably transported via both physical media (hard disk drive, CD-ROM, DVD) and network transfers (FTP, HTTP, rsync, etc.). BagIt is also used for managing the digital preservation of content over time. Discussion about the specification and its future directions takes place on the Digital Curation discussion list.
The BagIt specification is organized around the notion of a “bag”. A bag is a named file system directory that minimally contains:
- a “data” directory that includes the payload, or data files that comprise the digital content being preserved. Files can also be placed in subdirectories, but empty directories are not supported
- at least one manifest file that itemizes the filenames present in the “data” directory, as well as their checksums. The particular checksum algorithm is included as part of the manifest filename. For instance a manifest file with MD5 checksums is named “manifest-md5.txt”
- a “bagit.txt” file that identifies the directory as a bag, the version of the BagIt specification that it adheres to, and the character encoding used for tag files
On receipt of a bag a piece of software can examine the manifest file to make sure that the payload files are present, and that their checksums are correct. This allows for accidentally removed, or corrupted files to be identified. Below is an example of a minimal bag “myfirstbag” that encloses two files of payload. The contents of the tag files are included below their filenames.
myfirstbag/ |-- data | \-- 27613-h | \-- images | \-- q172.png | \-- q172.txt |-- manifest-md5.txt | 49afbd86a1ca9f34b677a3f09655eae9 data/27613-h/images/q172.png | 408ad21d50cef31da4df6d9ed81b01a7 data/27613-h/images/q172.txt \-- bagit.txt BagIt-Version: 0.97 Tag-File-Character-Encoding: UTF-8
In this example the payload happens to consist of a Portable Network Graphics image file and an Optical Character Recognition text file. In general the identification and definition of file formats is out of the scope of the BagIt specification; File attributes are likewise out of scope.
The specification allows for several optional tag files (in addition to the manifest). Their character encoding must be identified in “bagit.txt”, which itself must always be encoded in UTF-8. The specification defines the following optional tag files:
- a “bag-info.txt” file which details metadata for the bag, using colon-separated key/value pairs (similar to HTTP headers)
- a tag manifest file which lists tag files and their associated checksums (e.g. “tagmanifest-md5.txt”)
- a “fetch.txt” that lists URLs where payload files can be retrieved from in addition or to replace payload files in the “data” directory
The draft also describes how to serialize a bag in an archive file, such as ZIP or TAR.
History
The BagIt specification was a natural outgrowth of work done by The Library of Congress and the California Digital Library in transferring digital content created as part of the National Digital Information Infrastructure and Preservation Program. The origins of the idea date back to work done at the University of Tsukuba on the "enclose and deposit" model, for mutually depositing archived resources to enable long-term digital preservation.[3] The practice of using manifests and checksums is fairly common practice as evidenced by their use in ZIP (file format), the Deb (file format), as well as on public FTP sites.
In 2007 the California Digital Library needed to transfer several terabytes of content (largely Web archiving data) to the Library of Congress. The BagIt specification allowed the content to be packaged up in "bags" with package metadata, and a manifest that detailed file checksums, which were later verified on receipt of the bags. The specification was written up as an IETF draft by John Kunze in December 2008, where it has seen several revisions.[2] In 2009 the Library of Congress produced a video that describes the specification and the use cases around it.[4][5]
Use
- The Library of Congress is using the BagIt specification in several projects including its Content Transfer Services which allow digital content to be inventoried, and copied to production access and storage environments.
- The Copyright Office uses the format for mandatory deposit of serials published only online.
- Archivematica is an open source digital preservation system which uses BagIt to create OAIS Archival Information Packages (AIP).[6]
- Ghent University library is using the BagIt specification as archival format for its digital collections (preserved in the private LOCKSS network SAFE-PLN[7]) and as interchange format when adding new external collections (such as Google Books) to the local repositories.
- The Dryad Data Repository, a repository of data underlying scientific publications, is using the BagIt specification to share data and related metadata with TreeBASE, a repository of phylogenetic information.
- Towards Interoperable Preservation Repositories (TIPR) is a partnership between the Florida Center for Library Automation, Cornell University and New York University to develop, test and promote a standard interchange format for exchanging information pacakges among OAIS-based repositories. The proposed RXP format is using the BagIt specification to exchange package bundles via HTTP.[8]
- The Stanford Digital Repository (SDR) uses BagIt as the primary transfer format for content being deposited into the SDR.[9]
- Chronopolis, a large-scale preservation system, uses BagIt as the transfer format for content that is deposited into the system.
- The University of North Texas Libraries uses the BagIt specification as an archival container format in its digital repository and as an interchange format for importing and exporting digital objects from its repository.
- The UCSD Library uses BagIt as the transfer format when sending digital objects to Chronopolis.
- The The Rockefeller Archive Center uses the BagIt specification as the transfer format when receiving items from donor institutions, when creating Archival Information Packages in Archivematica, and when depositing digital materials into MetaArchive..
- The ERIS software from the Central Connecticut State University Library uses BagIt to verify archival packages that are deposited on Amazon S3[10]
- A Drupal module that creates Bags is available.
- BagIt Profiles provide a mechanism for allowing creators and consumers of Bags to agree on optional components of the Bags they are exchanging.
- The University of Kentucky Louie B. Nunn Center for Oral History and AVPreserve use BagIt as the underlying library and specification in an upcoming desktop file packaging application called Sipperfly.
- The DataONE federation of data repositories uses BagIt as a serialization format for transporting data packages from data repositories to end users.[11] These data packages consist of heterogeneous data objects that are collected in BagIt and linked by including an OAI-ORE compatible resource map in a standard location in the bag for describing data relationships.
- Media conservators at The Museum of Modern Art use bagit-java as a tool for establishing chain of custody when receiving digital collections materials.
- Archivsts at the Bentley Historical Library employ BagIt to transfer a copy of material (and metadata) to a secure dark archives.
- Islandora objects can be packaged into Bags with Islandora BagIt.
- New York University Libraries use BagIt as a transfer and storage format in NYU's repository infrastructure.
- The Purdue University Libraries Archives and Special Collections and Purdue University Research Repository (PURR) use BagIt to bundle content and metadata for storage and transfer to the MetaArchive Cooperative.
- Research Objects can be serialized as BagIt archives using the Research Object BagIt profile
- Eclair Preservation is using the BagIt specifications as archival format for its cinema digital collections (preserved in the private Eclair Archive OAIS compliant system) and as interchange format when adding new external collections to the local repositories (Eclair laboratories)..
- The Academic Preservation Trust uses BagIt for the transfer of digital objects and their long term storage.
Tools
The BagIt specification was designed for ease-of-use using familiar Unix utilities such as md5deep. However several BagIt specific tools have been created that can ease bag creation in several programming environments:
- Archive::BagIt: Perl
- BagIt Library: Java
- Bagger GUI: Java
- BagIt gem: Ruby
- bagit: Python
- pybagit: Python
- BagIt GUI: JRuby
- BagItPHP: PHP
- gladstone: JavaScript
See also
References
- ↑ "A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method" (PDF). 2005.
- 1 2 "The BagIt File Packaging Format". Retrieved 12 October 2010.
- ↑ Tabata, Koichi. "A Collaboration Model between Archival Systems to Enhance the Reliability of Preservation by an Enclose-and-Deposit Method" (pdf). Retrieved 12 October 2010.
- ↑ BagIt: Transferring Digital Content for Preservation. Library of Congress. 2009. Retrieved 12 October 2010.
- ↑ "BagIt: Transferring Digital Content for Preservation (Transcript)" (PDF). Library of Congress. 2009. Archived (PDF) from the original on 10 October 2010. Retrieved 12 October 2010.
- ↑ "Overview – Archivematica".
- ↑ "SAFE PLN Safe Archiving FEderation". Retrieved 2015-07-16.
- ↑ "Caplan P, Kehoe W, Pawletko J Towards Interoperable Preservation Repositories: TIPR".
- ↑ Cramer, Tom; Kott, Katherine. "Designing and Implementing Second Generation Digital Preservation Services: A Scalable Model for the Stanford Digital Repository". D-Lib Magazine. 16 (9/10). doi:10.1045/september2010-cramer. ISSN 1082-9873.
- ↑ Iglesias, Edward; Meesangnil, Wittawat (2010). "Using Amazon S3 in Digital Preservation in a mid sized academic library: A case study of CCSU ERIS digital archive system". code4lib journal (12). ISSN 1940-5758.
- ↑ "Data Packaging". DataONE Architecture, Version 1.2. DataONE. Retrieved 14 July 2015.
External links
- BagIt IETF draft: the canonical BagIt specification
- BagIt on GitHub: the latest working copy of the specification, with source files for publishing to IETF.
- Digital Curation Google Group: where most discussion about use of the specification, and its continued development takes place.
- BagIt specification from the California Digital Library: CDL has found that it helps to have local documentation about the BagIt specification for development purposes.
- BagIt specification from the Library of Congress: similarly the Library of Congress has made a snapshot of the specification available.