View our collection of 1483 editions of Robinson Crusoe scraped from University of Florida Digital Collection, Hathitrust Digital Library, and the Internet Archive here.
Despite what may seem as a large collection of texts, there exists at least ten times that amount as we managed to track down metadata for over 15,000 editions published across the globe.
Here is an example of what our raw metadata looks like. Notice the ‘publisher’ column in which the publisher’s name, city, and date are all combined in wildy different styles.
Here is the latest version that we have been able to compile and process within the scope of our project. This dataset powers our map and our Doc2Vec models. However, it is by no means a ground-truth dataset as our cleaning could only go so far in a 10-week timeframe.