The race to save our online lives from a digital dark age
Scott now works as “free-range archivist and software curator” with the Internet Archive, an online library started in 1996 by the internet pioneer Brewster Kahle to save and store information that would otherwise be lost.
As a society, we’re creating so much new stuff that we must always delete more things than we did the year before.
Over the past two decades, the Internet Archive has amassed a gigantic library of material scraped from around the web, including that GeoCities content. It doesn’t just save purely digital artifacts, either; it also has a vast collection of digitized books that it has scanned and rescued. Since it began, the Internet Archive has collected more than 145 petabytes of data, including more than 95 million public media files such as movies, images, and texts. It has managed to save almost half a million MTV news pages.
Its Wayback Machine, which lets users rewind to see how certain websites looked at any point in time, has more than 800 billion web pages stored and captures a further 650 million each day. It also records and stores TV channels from around the world and even saves TikToks and YouTube videos. They are all stored across multiple data centers that the Internet Archive owns itself.
It’s a Sisyphean task. As a society, we’re creating so much new stuff that we must always delete more things than we did the year before, says Jack Cushman, director at Harvard’s Library Innovation Lab, where he helps libraries and technologists learn from one another. We “have to figure out what gets saved and what doesn’t,” he says. “And how do we decide?”
Archivists have to make such decisions constantly. Which TikToks should we save for posterity, for example?
We shouldn’t try too hard to imagine what future historians would find interesting about us, says Niels Brügger, an internet researcher at Aarhus University in Denmark. “We cannot imagine what historians in 30 years’ time would like to study about today, because we don’t have a clue,” he says. “So we shouldn’t try to anticipate and sort of constrain the possible questions that future historians would ask.”
Instead, Brügger says, we should just save as much stuff as possible and let them figure it out later. “As a historian, I would definitely go for: Get it all, and then historians will find out what the hell they’re going to do with it,” he says.
At the Internet Archive, it’s the stuff most at risk of being lost that gets prioritized, says Jefferson Bailey, who works there helping develop archiving software for libraries and institutions. “Material that is ephemeral or at risk or has not yet been digitized and therefore is more easily destroyed, because it’s in analog or print format—those do get priority,” he says.