Chapter 2 Data sources

The primary data sources for this project are Yahoo Finance, Pushshift Reddit API and Google Trends. Yahoo Finance provides financial news, data and commentary including stock quotes, press releases, financial reports, and original content. The pushshift.io Reddit API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions. Google Trends is a website by Google that analyzes the popularity of top search queries in Google Search across various regions and languages.

From the above data sources we collected and used three major datasets: Stock dataset from Yahoo Finance, Reddit posts dataset from Reddit API and Google search score dataset from Google Trends. Each member in our team was responsible for downloading one dataset from corresponding data sources. More details of the datasets are as following:

2.1 Stock dataset

This dataset mainly contains the prices and volume of the stocks that are discussed in r/wallstreetbets subreddit, which help to indicate the stock performance. We used the R package quantmod to download the data, which is sourced from Yahoo Finance.

Table 2.1: Columns in dataset
Column Name	Type	Description
Date	Date	Date of the prices and volume
Open	numeric	Open price
High	numeric	Highest price of the day
Low	numeric	Lowest price of the day
Close	numeric	Close price adjusted for splits
Volume	numeric	Number of shares traded
Adjusted	numeric	Adjusted close price adjusted for splits and dividend and/or capital gain distributions.

2.2 Reddit posts dataset

This dataset contains the posts in r/wallstreetbets subreddit, which we used to extract out the information about the stocks discussed and the post attitude. We created a python script to pull the data we need from Pushshift Reddit API and write to the csv files. Due to Github’s limit on file size, we slice the data by day to create one file for each day. The Pushshift Reddit API has the advantage over the official Reddit API that it gives full functionality for searching Reddit data, which is very helpful for us to only download the data we need.

Table 2.2: Columns in dataset
Column Name	Type	Description
postid	character	unique identifier for post
author	character	post creator
created_utc	integer	post created time in utc timezone
permalink	character	URL to the post
score	integer	number of upvotes minus the number of downvotes
title	character	post title
selftext	character	post content

Issues with this dataset:

Only the current state of the posts can be downloaded. Because the majority of the discussions on r/wallsteetbets we are interested in happened earlier this year, many of the posts are either deleted by the author or removed by the administrator. As a result, we can no longer access those posts’ content and perform analysis on top it. See [Section header text] for more details.
Because of the big volume of the posts and the large size of the data, we are unable to download all posts over the complete time period. Instead, we only download the posts created in Jan, Feb, May, Jun and Jul, which are the months most of the discussions that we care about happened in.

2.3 Google Trends dataset

Dataset: combined_news.csv and combined_web.csv

This dataset mainly contains the website search and news search information in Google. Since r/wallstreetbets subreddit, the heavily shorted stocks as well as the disagreement between retailer traders and professional traders were relatively new to the general public, we assume that they would actively search in Google to follow this event. Therefore, using Google search data can also serve as another indicator to show the public attention. Both csv files are downloaded from Google Trends. Google assigns a search score to each keyword for each week. The scores represent search interest relative to the highest point for the given region and time. We are focusing on United State for the time period starting from 2021-01-01.

Table 2.3: Columns in dataset
Column Name	Type	Description
postid	character	unique identifier for post
author	character	post creator
created_utc	integer	post created time in utc timezone
permalink	character	URL to the post
score	integer	number of upvotes minus the number of downvotes
title	character	post title
selftext	character	post content