Chapter 3 Data transformation
3.1 Reddit Data Transformation
These 2 waves are mainly caused by the discussions in the subreddit group, r/wallstreetbets. Therefore, we have downloaded the Reddit posts that have been posted during these 2 periods and also extend some time before and after the wave. This is because we believe it takes some time for the initial posts to get public attention and then the posts starts to influence the stock price. Also, by including the posts that are slightly after the wave, we can do a comparison during and after the wave to check how big the effects of the posts in the sub-Reddit group on the stock price. Therefore, we have downloaded the data for the whole January, February, May, June and July, which have completely cover the 2 waves and also cover some extension. Also, we have parsed the posts on daily basis, which also helps to show the change over time.
3.1.1 Parse Stock tickers from Reddit posts
Before starting to do the analysis, we need to find the list of WSB stocks first. the Generally speaking, all the stocks that have been discussed in the r/wallstreetbets subreddit group should be considered as WSB stock. However, given the time frame that we considered is relatively long, which is 5 months, also, the WSBs has caught huge public attention, which also attracted a lot of new redditers. Therefore, there are many people participate in the discussion and the stocks mentioned are quite diversified. In this project, we will only focus on the top 10 stocks that have been mentioned in the posts. In order to find the top 10 most discussed stocks, we created a python script to parse the reddit posts. You can find the code here. We’ll discuss what the script does in the following paragraphs.
In the source data, which are the daily posts csv files, there are 2 columns that may contains the stock information, which are title
and selftext
respectively. Therefore, we need to first find the stock ticker from these 2 columns. Generally, the stock ticker should start with $
and followed by 2 to 4 consecutive upper case letters. However, in the source data, there is no $
exists, we cannot get the exact stock data but rather, we can extract out the list of possible stock tickers by extracting all the sub-strings which are consists of 2 to 4 consecutive upper case letters.
We have used regex in python to extract out the list of possible stock tickers from title
and selftext
respectively for each of the daily reddit csv and then we combine these 2 lists together. However, there might be the case that the possible stock tickers are mentioned by the redditers both in title
and selftext
. In this case, we define that the possible stock ticker should be considered only once. Therefore, after combining the 2 lists together, we also need to drop duplicates to make sure there is no double count of possible stock ticker from both title
and selftext
. With that, we can make sure that the count of every possible stock ticker in 1 row is at most 1.
Then, we start to count the possible stock ticker and summarize the count information into a dictionary. For every daily reddit csv, we will get a possible stock ticker count dictionary and combine them together to get the final count dictionary and stored it into the clean data folder.
Now, in r, we can read the file which contains the information of the final possible stock ticker count. Since we are only interested in the top 10 WSB stocks, we need to re-order the data frame according to the word count in descending order.
We noticed that there are some invalid tickers exists in the data frame, but that is expected since there is no perfect way to extract out the stock ticker due to the lack of $
in the data source, so we manually go through the data frame and select out the top 10 tickers, which are: GME
, AMC
, BB
,NOK
, SND
, NAKD
, PLTR
, CLOV
, RETA
, MAR
, respectively.
3.1.2 r/wallstreetbets Subreddit Overview
We’d like to get an overview of r/wallstreetbets subreddit’s metrics. We first loaded separate daily posts data files into one data frame, deduplicate the data and then aggregated the data in following steps:
Covert post create date from unix epoch integer to a Date
date_utc
.Add two new boolean columns
contains_amc
andcontains_gme
to indicate whether the title/content contains the stock ticker, ignoring cases.Group the data by
date_utc
, and then usesummarise
to get aggregated metrics.
After cleaning, the data frame contains 5 columns and 210 records. We choose to specificly count the number of posts mentioning GME
and AMC
because we knoe they are the most prominent stocks in this event.
Column Name | Type | Description |
---|---|---|
date_utc | Date | post created date |
post_cnt | integer | number of posts |
distinct_usr_cnt | integer | number of distinct author that created posts |
gme_cnt | integer | number of posts containing GME |
amc_cnt | integer | number of posts containing AMC |
3.1.3 Reddit posts content
We’d also like to look into the content of the posts to understand what people are discussing on Reddit during the time period of the event. In order to obtain this information, we’d like do word level analysis on the posts’ contend. We cleaned the text data following steps including:
Filter to posts with
score
higher than 10.Concatenate
title
andselftext
to one single stringtext
Select columns we needed. i.e.
date_utc
,text
Load all post content text as a corpus using
tm
package.Convert Latin characters to ASCII.
Remove double spaces.
Remove special characters, numbers or punctuation from text.
Remove common stop words in order to produce meaningful results and avoid the most common frequent words such as “I” or “the”.
Remove specific stop words in this context, for example:
https
,www
,amp
,com
etc.
3.1.3.1 Frequent keywords in Reddit posts
In order to find frequent keywords mentioned in Reddit posts, we created a document term matrix with the TermDocumentMatrix
function from the tm
package to obtain words counts for each word mentioned. We created a dataframe containing each word mentioned in the post as the first column and their frequency in the second column, it has 36,261 rows:
Column Name | Type | Description |
---|---|---|
word | character | English word appeared in the Reddit posts |
freq | numeric | Number of appearances |
3.1.3.2 Sentiment Analysis on Reddit Posts regarding GME
We are also interested in what’s people’s sentiments on the stocks mentioned. We picked GME
as the representative, since it’s the most mentioned stock in r/wallstreetbets and the most reported stock in news. We’d like to perform word level sentiment analysis on the posts mentioning GME to get a better understanding of people’s view over the subject. In addition to the cleaning steps mentioned earlier, we further cleaned the data in following steps:
Filter to posts mention
GME
in either title or content text.Get stop words from
tidytext
package and remove from data frame
We then perform a simple sentiment analysis focusing only at the word level. The lexicon we used is nrc
by Saif Mohammad and Peter Turney, classifies words into emotions like positive, negative, anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. We then summarize the data to get the number of words in each sentiment class for each day.
For complete sentiment analysis code, see here.
We process the Reddit posts one month at a time because of the huge data size and limited memory. In the end we combined the monthly data into two data frames: 1. containing Jan and Feb, corresponding to the time period of first wave; 2. containing May, Jun and Jul, corresponding to the time period of second wave. Each of them has following schema:
Column Name | Type | Description |
---|---|---|
textDate | character | post created date |
sentiment | character | nrc classified sentiment |
wordCount | numeric | number of words in the sentiment class |
3.2 Stock Data Transformation
For each of the top 10 mentioned tickers, we download its stock price data from Yahoo Finance using quantmod
package for the time period starting from 2020-01-01. The data downloaded by using quantmod
are xts
zoo
objects and each object has following attributes: X.Open
,X.High
,X.Low
,X.Close
,X.Volume
,X.Adjusted
where X
is the stock ticker. We transformed each of those objects to the format we need in following steps:
Convert
xts
zoo
object todata.frame
, by loadingindex
toDate
column and extracting the core data.Select columns we needed. i.e.
Date
,X.Close
andX.Volume
.Rename
X.Close
asClose
,X.Volume
asVolume
.Add a
Symbol
column to indicate the ticker.
After finishing downloading data for all 10 tickers, we bind them to one big data frame containing all data, which has 3 columns and 4956 rows:
Column Name | Type | Description |
---|---|---|
Symbol | character | Date of the prices and volume |
Date | Date | Stock ticker |
Close | numeric | Close price adjusted for splits |
Volume | numeric | Number of shares traded |
Looking at the raw price might not be as useful when comparing different stocks because they have different volume. It is useful to look at change in price instead. We can calculate the aggregated return in percentage by using function periodReturn
. The aggregate level can be specified by changing argument period
. Here we set the aggregate level to be daily
. Similar to how we handle the price and volume, for each ticker we performed following steps to get the data in the format we need:
Download stock price data using
quantmod
packageApply
periodReturn
function withperiod="daily"
to calculate daily returnConvert
xts
zoo
object todata.frame
, by loadingindex
toDate
column anddaily.returns
.Multiply the calculated daily return by 100 to concert to percentage
Rename
daily.returns
asDaily_return
.Add a
Symbol
column to indicate the ticker.Select columns we needed. i.e.
Date
,Daily_return
andSymbol
.
After finishing downloading and processing data for all 10 tickers, we bind them to one big data frame containing all data, which has 4 columns and 4956 rows:
Column Name | Type | Description |
---|---|---|
Symbol | character | Date of the prices and volume |
Date | Date | Stock ticker |
Daily_return | numeric | Daily return in percentage |
3.3 Google Trends Data Transformation
In the Google Trends source data, it assigns a number to each keyword for each period(week). The numbers represent search interest relative to the highest point on the chart for the given region and time. A value of 100 is the peak popularity for the term. A value of 50 means that the term is half as popular. A score of 0 means there was not enough data for this term. A score of <1 means there is some search interest but it’s extremely low. We are replacing all <1’s with 0.5, so that we can keep the data frame consistent with all scores as numeric data.
We also need to tidy the dataset because some of the column names are not names of variables, but values of a variable: the original column names represent the values of the keyword
variable and the values in the columns represents the values of score
, and each row represents 5 observations, not one. To tidy the dataset, we make the offending columns into a new pair of variables using pivot_longer
.
After cleaning, it contains 3 rows and 260 records.
Column | Type | Description |
---|---|---|
week | date | Which week? |
keyword | string | Search keyword |
score | numeric | Search interest score |
References:
How to Generate Word Clouds in R (https://towardsdatascience.com/create-a-word-cloud-with-r-bde3e7422e8a)
Tutorial: Sentiment Analysis in R(https://www.kaggle.com/rtatman/tutorial-sentiment-analysis-in-r)