Chat

Datasets

The Scholars Studio also gathers and curates datasets for study and analysis. These include a wide-variety of publicly available datasets, as well as data scraped from various media sources. Besides these born-digital datasets, we also curate collections as data.

Top 40,000 Subreddits

Useful for studying how people react and engage with a variety of topics via social media, this 2.5TB dataset is available upon request in the LCDSS.  It consists of all content from the top 40,000 most populated subreddits from the social media site Reddit.  The data is from the beginning of Reddit in 2005 up until the end of 2023.  To access all or part of the dataset, please send an email to digitalscholarship@temple.edu with your needs and we can discuss access.
 
The files in the dataset are in a format called zstandard compressed ndjson. ZStandard is a super efficient compression format, similar to a zip file. NDJson is "Newline Delimited JavaScript Object Notation", with separate "JSON" objects on each line of the text file.

There are a number of ways to interact with these files, but they all have various drawbacks due to the massive size of many of the files. The efficient compression means a file like "wallstreetbets_submissions.zst" is 5.5 gigabytes uncompressed, far larger than most programs can open at once. It is recommend you use a script to process the files one line at a time, aggregating or extracting only the data you actually need. You can extract the files yourself with 7Zip. You can install 7Zip from here and then install this plugin to extract ZStandard files, or you can directly install the modified 7Zip with the plugin already from that plugin page. Once you've extracted the data, you'll need a text editor capable of opening very large files. A program like glogg can work, which lets you open files like this without loading the whole thing at once.

The data in this dataset prior to April 2023 was collected by Pushshift, data after that was collected by u/raiderbdev. The dataset was originally extracted, split and re-packaged by Reddit user u/Watchful1.

Sci-Fi Literature

In collaboration with the Special Collections Research Center, the Scholars Studio has been working to digitize novels and magazines from the Paskow Science Fiction Collection. These image and text files are available for students and teachers to use for research and teaching purposes. Because these materials are under copyright, they can only be accessed through a visit to the Scholars Studio, where users will be required to sign an agreement form to only use these materials at the Scholars Studio for non-consumptive research.

We have also made our corpus available as extracted features. For more information on how to mine the corpus as data, visit our new website.  Over three hundred titles have been ingested into the HathiTrust Digital Library, and they are now available to access both in the Digital Library and the HathiTrust Research Center, where non-consumptive, computational research can be conducted with the HTRC Analytics tool. To browse some of the book cover art from the collection and learn about the history of science fiction, visit our Omeka site, Digitizing the New Wave.

If you would like to learn more about ways this corpus can be used for text mining and image analysis, please visit our blog to see examples of work by students involved with the project. If you would like to access this corpus, please contact Alex Wermer-Colan .

Gen Con

The Gen Con Database includes searchable and downloadable data from all 50 programs of the Gen Con gaming convention. Covering 50 years and over a quarter million Gen Con events, the database offers an exciting glimpse into gaming history and trends. The dataset is publicly available.

Canonical Literature

The Twentieth-Century English Literature Corpus contains plain text files of over 100 twentieth century canonical novels, as well as the above-mentioned set of 450 science fiction novels. Due to copyright, these texts are only available to Temple University affiliates. Contact the Scholars Studio to learn more about future projects aimed to increase access to the dataset.