Following an abortive TTRPG session caused by a much relied on wiki being down, a friend of mine was shocked and frightened. "What if the niche sites I read are also taken offline!" It is times like these that remind you just how fragile our dear net is. So, as someone with a little bit of technical knowledge, he asked me for help. Specifically, he wanted LPArchive.org saved for prosperity.
I'd used the site myself in the past, so I was also interested in such a project. Consequently, I looked into how technically feasible such a project would be and broke it down into the following components.
- Getting a list of all lets plays on the website.
- Getting a list of all sublinks associated with each lets play
- Actually archiving each sublink and saving it in some easily readable format
Thankfully, there are plenty of tools that make accomplishing something like this rather straightforward. I opted to use Python and libraries like Beautiful Soup for getting the links, then I explored a few different avenues before also settling on Python and PDFKit to save each page as a PDF. I then stitched those PDFs together to create a single document for each LP. The entire process of this took about 5 days, including the 2-3 days it took for the code to run.
1. Getting the list of LPs
LPArchive made this step really easy. Turns out, they have a well-formatted index of every LP and its url in a file you can access if you go into the webdev tools and look at sources. Throw it into a Python program that turns it into a dataframe ready for export to csv and we were off and running. Of course, I only noticed this after an hour trying to scrape another page for all of its links. I'm glad they ultimately made it easy.
2. Getting the list of sublinks
Under construction...