Keeping data readable in the long run
(image by frankieleon)
This is reminder of the obvious, but perhaps one some of you could use: Be sure to save your files in a format, and on a medium, that you can read in the future.
Let me tell you a little story about an email that I got this week.
I heard from a colleague who was working on a project related to the first paper I published (which was a note about velvet worms that I kept finding while I was collecting ants; much of field biology of Neotropical velvet worms remains undocumented).
They had a couple questions about the paper, because the discussion directly contradicts the results! This is all a vague whiff of memory, as I did the fieldwork literally twenty years ago. When I took a looked at this discrepancy. I detected an error in one of the tables. (It showed density is 7ish when it really was 2ish).
I had two thoughts: First, funny that this has never come up before! Second, I was curious if this was a typesetting error or was it an error in the file that I sent to the journal. I took a quick look to my hard drive to figure this out.
But I couldn’t.
I found the file that I submitted, and according to the metadata recovered from the jumbled text in the file, I wrote the manuscript in WordPerfect 3.5. I had run the analyses in Statview, and created figures in Cricket Graph III. These were all mightily useful tools that did what I needed at the time, and none of them work now. (This short and surprising 2013 conversation about Cricket Graph tidily sums up this set challenges, and by the way, I just found an original in-box copy on sale.) I suppose if it really mattered, I could try to get an emulators up and running, and so on, but the best case scenario here would be a huge hassle, and the worst case scenario is lost data. Actually, the best way to answer this question is probably to find a printout of the manuscript (which I submitted by post) in a file cabinet in my lab. I haven’t opened that file cabinet in several years.
While the word processing, analysis, and graphic files weren’t openable, it turns out that I saved the associated data in a .txt file. I didn’t give much thought to these files, as this velvet worm paper was a one-off. In my defense, I’d like to point out that for all of my other work, I’ve created archives of the data as .csv files. Which are on my hard drive, backed up every day to to a separate hard drive, and also in dropbox. And also hard copies of all of these data are shelved in my office. I don’t think I have any old data that I can’t open because it’s in the wrong format.
I started paying close attention to data curation after getting spooked by the prospect of losing material to the Year 2000 problem, otherwise known as the Y2K bug. Just in case some of my programs stopped working, I wanted to make sure I didn’t lose anything. And I’ve stayed in the habit. (Huh. I just noticed that I can’t open the word processing files for my dissertation as well. No loss there, as far as I’m concerned.)
Keep in mind that files that we think might have a high longevity may not. For example, will R scripts be useful 20 or 50 years from now? Is whatever lives in google docs going to be there for you forever in a usable format?
Moreover, consider that pretty much every medium that we are storing our files on now won’t work in the near future. When I was doing my dissertation, I kept files on 3.5 inch floppies, and for bigger stuff, on Zip disks, if you happen to know what those are. You could get fancy and back things up by burning them to a CD if need be. I still have ton of those floppies and zip disks kicking around a box in my lab. Is it too late for those formats? And I have a USB Zip drive that I haven’t plugged in for several years, I am guessing it might still work? And I heard a rumor that a guy down the hall in Computer Science has a USB 3.5 floppy drive. But as far as I know, all of the files from those disks that I might want also is on my hard drive, so I’m not fussing over it. I suppose I should get 10-year old movies of my kid off of DV tapes onto a hard drive before it’s too late.
I doubt that, 20 years from now, we’ll be using DVDs, USB drives, SD cards, optical drives, and whatever else we’re using today, at least not in a way that’s easily readable. Are we going to be using pdfs? (I think fastlane will remain unchanged :) )
Keep in mind that this is just me writing vaguely about this in my blog, but this is also a professional matter that falls under the expertise of librarians, who are responsible for the long-term of maintenance and curation of digital information. I’m just sayin’, store your files in a text format so that you and others will have better prospects for opening them in the future, and make sure they’re archived in the medium that you’re using at the moment, and re-archive as your medium evolves. Because the long-term prospects for whatever we’re using now, “open” and otherwise, are grim.