Chapter 2 Data sources
The primary data sources for this project are Yahoo Finance, Pushshift Reddit API and Google Trends. Yahoo Finance provides financial news, data and commentary including stock quotes, press releases, financial reports, and original content. The pushshift.io Reddit API was designed and created by the /r/datasets mod team to help provide enhanced functionality and search capabilities for searching Reddit comments and submissions. Google Trends is a website by Google that analyzes the popularity of top search queries in Google Search across various regions and languages.
From the above data sources we collected and used three major datasets: Stock dataset from Yahoo Finance, Reddit posts dataset from Reddit API and Google search score dataset from Google Trends. Each member in our team was responsible for downloading one dataset from corresponding data sources. More details of the datasets are as following:
2.1 Stock dataset
This dataset mainly contains the prices and volume of the stocks that are discussed in r/wallstreetbets subreddit, which help to indicate the stock performance. We used the R package quantmod
to download the data, which is sourced from Yahoo Finance.
Column Name | Type | Description |
---|---|---|
Date | Date | Date of the prices and volume |
Open | numeric | Open price |
High | numeric | Highest price of the day |
Low | numeric | Lowest price of the day |
Close | numeric | Close price adjusted for splits |
Volume | numeric | Number of shares traded |
Adjusted | numeric | Adjusted close price adjusted for splits and dividend and/or capital gain distributions. |
2.2 Reddit posts dataset
This dataset contains the posts in r/wallstreetbets subreddit, which we used to extract out the information about the stocks discussed and the post attitude. We created a python script to pull the data we need from Pushshift Reddit API and write to the csv files. Due to Github’s limit on file size, we slice the data by day to create one file for each day. The Pushshift Reddit API has the advantage over the official Reddit API that it gives full functionality for searching Reddit data, which is very helpful for us to only download the data we need.
Column Name | Type | Description |
---|---|---|
postid | character | unique identifier for post |
author | character | post creator |
created_utc | integer | post created time in utc timezone |
permalink | character | URL to the post |
score | integer | number of upvotes minus the number of downvotes |
title | character | post title |
selftext | character | post content |
Issues with this dataset:
Only the current state of the posts can be downloaded. Because the majority of the discussions on r/wallsteetbets we are interested in happened earlier this year, many of the posts are either deleted by the author or removed by the administrator. As a result, we can no longer access those posts’ content and perform analysis on top it. See [Section header text] for more details.
Because of the big volume of the posts and the large size of the data, we are unable to download all posts over the complete time period. Instead, we only download the posts created in Jan, Feb, May, Jun and Jul, which are the months most of the discussions that we care about happened in.
2.3 Google Trends dataset
Dataset: combined_news.csv and combined_web.csv
This dataset mainly contains the website search and news search information in Google. Since r/wallstreetbets subreddit, the heavily shorted stocks as well as the disagreement between retailer traders and professional traders were relatively new to the general public, we assume that they would actively search in Google to follow this event. Therefore, using Google search data can also serve as another indicator to show the public attention. Both csv files are downloaded from Google Trends. Google assigns a search score to each keyword for each week. The scores represent search interest relative to the highest point for the given region and time. We are focusing on United State for the time period starting from 2021-01-01.
Column Name | Type | Description |
---|---|---|
postid | character | unique identifier for post |
author | character | post creator |
created_utc | integer | post created time in utc timezone |
permalink | character | URL to the post |
score | integer | number of upvotes minus the number of downvotes |
title | character | post title |
selftext | character | post content |