Chapter 2 Data sources

The primary data sources for this project are Yahoo Finance, Pushshift Reddit API and Google Trends. Yahoo Finance provides financial news, data and commentary including stock quotes, press releases, financial reports, and original content. The pushshift.io Reddit API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions. Google Trends is a website by Google that analyzes the popularity of top search queries in Google Search across various regions and languages.

From the above data sources we collected and used three major datasets: Stock dataset from Yahoo Finance, Reddit posts dataset from Reddit API and Google search score dataset from Google Trends. Each member in our team was responsible for downloading one dataset from corresponding data sources. More details of the datasets are as following:

2.1 Stock dataset

This dataset mainly contains the prices and volume of the stocks that are discussed in r/wallstreetbets subreddit, which help to indicate the stock performance. We used the R package quantmod to download the data, which is sourced from Yahoo Finance.

Table 2.1: Columns in dataset
Column Name Type Description
Date Date Date of the prices and volume
Open numeric Open price
High numeric Highest price of the day
Low numeric Lowest price of the day
Close numeric Close price adjusted for splits
Volume numeric Number of shares traded
Adjusted numeric Adjusted close price adjusted for splits and dividend and/or capital gain distributions.

2.2 Reddit posts dataset

This dataset contains the posts in r/wallstreetbets subreddit, which we used to extract out the information about the stocks discussed and the post attitude. We created a python script to pull the data we need from Pushshift Reddit API and write to the csv files. Due to Github’s limit on file size, we slice the data by day to create one file for each day. The Pushshift Reddit API has the advantage over the official Reddit API that it gives full functionality for searching Reddit data, which is very helpful for us to only download the data we need.

Table 2.2: Columns in dataset
Column Name Type Description
postid character unique identifier for post
author character post creator
created_utc integer post created time in utc timezone
permalink character URL to the post
score integer number of upvotes minus the number of downvotes
title character post title
selftext character post content

Issues with this dataset:

  1. Only the current state of the posts can be downloaded. Because the majority of the discussions on r/wallsteetbets we are interested in happened earlier this year, many of the posts are either deleted by the author or removed by the administrator. As a result, we can no longer access those posts’ content and perform analysis on top it. See [Section header text] for more details.

  2. Because of the big volume of the posts and the large size of the data, we are unable to download all posts over the complete time period. Instead, we only download the posts created in Jan, Feb, May, Jun and Jul, which are the months most of the discussions that we care about happened in.