Thoughts on "The Theory and Craft of Digital Preservation"
I just finished reading a book called “The Theory and Craft of Digital Preservation”, by Trevor Owens. I’m going to take a few minutes to talk about what I thought about the book, and how it is influencing my approach to my own digital collections.
This one is for the archive nerds.
I’ve been getting up really early on Saturday mornings, grabbing a cup of coffee, and spending the whole day sitting by myself and reading. I don’t know how much longer I’ll be able to keep this up, but it has really helped me get through some of my non-fiction backlog, which was frankly starting to pile up. I recommend it, if you can pull it off.
Aside over, now on to the book.
Here are some of my key takeaways from this text.
- Digital Preservation is a craft, not a science. It is a set of practices that will evolve over time, and that must be Practiced.
- You can’t say “I have preserved this thing”. You can only say “I am preserving this thing.
- Preservation means Future Access. You are not preserving a thing if you do not have an access strategy.
- The way you provide access to a collection, or an object in a collection, depends on your goals, but you should strive to provide access quickly and with a minimal amount of upkeep.
- There is no solution to the decay of digital media other than the practice of digital preservation. Lots Of Copies, Kept Safe. Verify your checksums regularly. Keep some copies off site.
- Digital preservation is Risk Management.
Of course, there’s a lot more. It’s a book, this is a list of bullet points. Go read the book. I’m just summarizing some of the points that I want to discuss further.
One question that came up at several points: What aspects of the thing you’re trying to preserve are significant?
That’s such a big question! With born digital objects, the answer varies from the “screen essentialist” answer ( ie “The finished, rendered file” ) to the slightly less obvious “every bit on the hard drive, even if it looks like garbage”, but it also includes the even less obvious “a video of the thing as it existed when it was popular” because, for the example of federated social network software Mastodon, or the computer game World of Warcraft, the parts of the object that might be significant can fairly far removed from the object itself. (What is WoW without servers and players? Do you just want to talk about the art? Are you talking about the community? What is significant to you? Allow that to inform how you undertake preservation.)
The book highlights many examples of the ways that digital objects are more complicated than physical objects, and some ways that digital objects are much simpler than physical objects. I’m not a librarian, and I don’t archive physical objects, so I can’t speak to that. I’m going to focus on the work that I do, and the way this book influenced me.
I’m not a librarian, and I don’t work for an archive. I’m just a dude that cares about art and stories. I have a collection of disparate digital objects that I’m trying to preserve. I’m going to discuss what these objects are, and what parts of them are significant to me. Then I’m going to explore how I can apply the ideas from “The Theory and Craft of Digital Preservation” to these items.
First, I have a handful of “Born Digital” items. Most of these stem from the ~5 years that I ran a magazine about music, art, and politics in metro atlanta, and the ~3 years I ran a record label under the same name. There are also some personal items, the artifacts of my years working as a web developer, and the digital objects that represent the constituent parts of the media that I produce.
Born digital artifacts include:
- The hard drive of the computer we used to manage Analog Revolution Records when that was a thing
- The stem files for every album we produced, including artwork
- The stem files, as well as dozens of drafts and other work in progress, for every issue of our magazine
- The websites, with their music, videos, photos, etc. for the magazine, the bands, the record store, etc.
- Every photo my wife and I have taken (both personally and professionally) in the last 15 years
- Digital artwork, web comics, podcasts, etc.
- Videos! Concert videos, interviews, the documentary I was in about record stores, short clips of my family together, etc.
- Websites! I was a web developer. I built a bunch of websites. Some of them were pretty cool, most of them only exist in my collection.
- My normal music library
- My “I know these people” music library
- My “I took this from the Great 78 Project” music library
- Homebrew video games and other independently produced digital objects that I don’t own the rights to, but which are likely to disappear from the web
- TEXT! Just so many text documents.
I’m a writer, apparently, as just the text and word processor files I’ve produced consume nearly 800MB. That seems impossible. Certainly, some of those MB are taken up by copies and copies of copies. Some are consumed by PDF output, or by giant images embedded in documents. In terms of unique text, we’re probably well under 200MB, But even with all that gone, I have an archive of literally millions of words, tens of thousands of pages of text that I’ve produced over the last 20 years.
Oh, right, did I mention? This collection goes back 20 years. It also includes basically every hard drive I’ve ever owned, every computer program I’ve ever used, etc, but that stuff is mostly insignificant (and potentially sensitive or embarassing) so we can ignore it.
In addition to all that, I have a pretty large collection of physical media which is in various states of digitization. Personally, this includes:
- Family photos. When my mom died, I got 50 years of family photos, and instructions to “do something with them”
- Home Movies (not nearly as many as there are Photos. Video tape was expensive, after all.)
- Artwork. My wife is an artist, my grandfather is an artist. I have a collection of Art.
and … That’s pretty much it for my personal media, but it’s the less personal stuff that’s interesting.
I have a collection of unique, or nearly so, physical media artifacts dating back 100+ years. These include:
- Likely one of a kind copies of albums from a few dozen local bands (inluding demos and alternate mixes that the bands never even heard, from when I was a studio intern) on quickly deccaying CD-Rs
- Private Press LPs and singles from the 60s, 70s, and 80s featuring music that has, as best as I’ve been able to discern, never been digitized.
- Cassettes. So many cassettes. Some of them are even labeled! Mostly, these are rough mixes from a small recording studio that shut down in the mid 90s. Some feature my Grandfather, allegedly. There are also a fair number of concert bootlegs, and a big box of general mixtapes.
- Zines, posters, artwork, and ephemera of the kind that you pick up when you work in a concert venue/record store/coffee shop/magazine publisher
- Photos, magazines, small press books, and other printed goods from the turn of the century that I picked up when I worked for an antique store
and then the biggy
- ~200 (and growing) telecines of TV shows from the 40s and 50s, most of which have never been digitized
- ~25 (and growing) 16mm films from 1920 - 1960 that are entirely unavailable on the home market, or only exist in low quality VHS era transfers.
Add to that ~2TB of public domain video footage that I’ve collected from archive.org and others, organized, and catalogued over the last 6 or 7 years, and you have a pretty sizable collection of early television! (Only including television, and only including what I’ve digitized myself and catalogued so far, I’m sitting at ~2000 episodes.)
I’ve talked about the film preservation stuff before, I’m not going to get in to it here today. I’ll be talking a lot more about it in the near future, as I’m finally able to start doing it again soon. Today is about managing the collection, and getting it to the point that I can say it is being preserved.
My Methods so far
What I have ammounts to about ~10 TB of data. I could normalize and de-dupe it, and probably it’s actually more like ~7TB of data, but that’s a lot of work at the item level, and I’m trying to focus on work at the Collection level right now.
My current strategy, up to this point, has been simple and effective:
- USB > SATA interfaces
- 1TB and 2TB spinning disks (or, in the past, whatever the biggest cheap disk was, 500 GB the first time I did this, I think)
- There’s an online copy for access, and an offline copy for storage.
- The online copy is growing, and I rsync to the offline copy about once a month
- The one of a kind portions of the online copy are mirrored over nextcloud to a web server, which is my only offsite backup (I manage a VPS with ~4 TB of cloud storage dedicated to this.)
- When I want to share things with people, they go on the peertube instance I run at mountaintown.video, or to Archive.org
This was fine when the whole collection was 1TB, and I could be reasonably certain that most of it wasn’t unique. It was fine for me to manage by hand, and I did an okay job of it.
(That’s not true. I lost all of “OfManyTrades” TWICE because I was mismanaging the backup process that dropped backups in to Nextcloud so that they would get mirrored elsewhere, and lost some important stuff in the process. Even when you care about it, this stuff is hard.)
Updating my Methods
My strategy above is fine when I stick to it, and verify that it’s working, but it’s missing some stuff and it can be better.
1) Fixity! I don’t really do file hashes to verify that my files aren’t corrupted when I make backups or whatever. That’s bad, and easy to fix, so I should fix it. 2) Filesystems! I do everything in EXT3, and I should probably switch to ZFS for this kind of work 3) I have room in my office for Raspberry Pi and a couple of external drives, I could easilly set that up as a full, live, offsite mirror 4) Better backups! I mention above that I rsync a copy to cold storage once a month. That’s mostly true, except when I forget. 5) Stop normalizing file formats, and focus on making derrivatives for use. 6) Get other people involved.
A person cannot do preservation work, because preservation is about Future Access, and a person is a finite thing. If I want my collection of film and TV from the 40s and 50s, and my collection of the music (some of which I helped create) from metro Atlanta circa 2015 to survive longer than I do, I need help. Some of that help exists in The Internet Archive. I could get that help by donating this collection to a library. I’ll probably do both of those things, but I also want to do more.
We’ve been working on establishing a maker space in my new home town. I am working with a team of ~6 people to build a community resource and learning center, to provide access to tools and education. We’re trying to design a center for local culture and art. What better place to run a community archive? Our local library is cash strapped and over worked, like most libraries in late capitalism. I have the expertise, I have the space, I have the money. Building a digital archive for this town seems like a reachable goal, and a valuable project for the maker space.
We’ll see what happens!)
Other things I learned
For most of my Born Digital collection, or at least for the media that I produce, I’m probably not capturing enough. I focus on the Screen and the Speakers. I focus on the rendered product. (I’ve been guilty of this for a long time! Original files are Huge and as a byproduct of the way I work, it’s not uncommon for me to only ever save a rendered file. I’m working on it.)
I need to do more to prioritze access. I have a huge collection of early television shows, most of which have been found to be in the public domain. I have done a lot of work to ensure that this collection has good metadata, it’s searchable, it’s browseable, it’s watchable. You can’t do anything with it, though, unless you’re on my LAN. A lot of this content came from elsewhere on the web, where it is less well catalogued. I need to make the work I’ve done documenting it available for others (even if I can’t afford the bandwidth and storage to make all the content available for others.)
Indexing, documenting, providing metadata: After ensuring that the thing continues to exist, and that other people can get to it, this is the most vital work. You can’t access things you can’t find. If you can’t access it, it isn’t being preserved.
“The Theory and Craft of Digital Preservation”? Pretty good. Worth your time, if this is a space you care about. It’s conversational, and approachable. It’s a quick read, and says enough in it’s 200 pages to merit looking twice.
Frankly, it made me want to write a book of my own (but who has time for that when there are archives to manage and collections to build?)
If you want to know more about the work that I’m doing, or see how my thoughts have evolved on this topic: