What’s in a post? Reddit pulls in around 115 million unique visitors each month, amassing a staggering 5 billion page views per month. For a long time, I’ve wondered what factors draw people to certain Reddit posts while shunning others - does it have to do with the time of day a post is submitted? Do certain users have a monopoly on the most viewed posts? What about text posts vs. links?
These are all questions I’m setting out to shed some light on in this short series. I haven’t decided how long the series will be, but this first post will only focus on the data mining portion as well as some initial inferences we can make about the data from a general perspective.
The Purpose of this Post
The purpose of this post is to be an introduction to analyzing Reddit’s data through the use of the R Programming language. With regards to my own motivation, I’m developing my skills as a data scientist and am open to feedback on the process.
Data Mining Reddit using Python
Source: Reddit Data Mining Script
Essentially, I threw together this Python script to scrape the I thought was the most useful without making it terribly complicated.
The Reddit JSON Wiki was really helpful when I was creating this script. Operation is relatively
simple - put the list of attributes you want in the ‘want’ array and watch it run. There are a few customized
options that I found useful (run
python parseReddit.com --help if you want to see them). After running
the script, here is a sample of the data I get:
I should note that the dataset for this is not nearly complete as it was only taken over a couple of days every 15 minutes - remember that the purpose of this post is just to familiarize the reader with the techniques for collecting and analyzing Reddit’s data on their own. In this post, I have tried my best not to make inferences about the data that would be affected by the lack of data.
Analysis Using R
After importing the data into R, we can view the summary of the data as follows:
> results <- read.csv("~/Developer/Data Mining/Reddit/results.csv") > summary(results) domain subreddit author score over_18 downs created_utc num_comments ups i.imgur.com :14673 funny : 6452 EntertheWu-Tang: 235 Min. : 0.0 False:54031 Min. :0 Min. :1.409e+09 Min. : 0.0 Min. : 0.0 imgur.com : 6903 AskReddit : 5805 m0rris0n_hotel : 162 1st Qu.: 25.0 True : 1204 1st Qu.:0 1st Qu.:1.409e+09 1st Qu.: 6.0 1st Qu.: 25.0 self.AskReddit : 5805 pics : 3886 __mck__ : 152 Median : 124.0 Median :0 Median :1.409e+09 Median : 23.0 Median : 124.0 youtube.com : 3450 aww : 3742 mrojek : 148 Mean : 694.7 Mean :0 Mean :1.409e+09 Mean : 170.8 Mean : 694.7 self.WritingPrompts: 1360 todayilearned: 2452 IpMedia : 134 3rd Qu.: 887.0 3rd Qu.:0 3rd Qu.:1.409e+09 3rd Qu.: 95.0 3rd Qu.: 887.0 self.Showerthoughts: 1195 videos : 2313 Tanglebrook : 113 Max. :6355.0 Max. :0 Max. :1.410e+09 Max. :15197.0 Max. :6355.0 (Other) :21849 (Other) :30585 (Other) :54291
As you would probably expect, the most frequent top links on Reddit are from Imgur, composing almost half of the front page links. Another interesting metric is the median number of comments on front page posts coming out to be only 23, which seems rather low to me. Let’s do a little data visualization to get a better idea of what’s going on here.
First, let’s take a random sample of 100 from our dataset.
> results_sample = results[sample(nrow(results),100),]
Using a scatterplot matrix, we can view some of the attributes I think are interesting and see if we can spot any relationships:
> pairs(~as.integer(subreddit)+score+num_comments,data=results_sample, main="Scatterplot Matrix")
After viewing this scatterplot, it is clear that we need to clean some of the data. This is valid as long as we don’t introduce any biases in the data. If you recall, we are interested in only the most popular posts on Reddit. Therefore, I’m going to scrub away any data that didn’t get a sufficiently high score to make a difference.
results_sample = results_sample[results_sample$score > 50, ]
As a teaser until the next post, here is the correlation matrix for some of the interesting variables in this dataset.
> library(ggplot2) > library(reshape2) > corr_data = results_sample[,c(1,2,3,4, 5)] > qplot(x=Var1, y=Var2, data=melt(cor(corr_data, use="p")), fill=value, geom="tile") + + scale_fill_gradient2(limits=c(-1, 1))
As you can see, the subreddit, the domain the link was posted under, and whether or not the link was marked NSFw are obviously very strongly correlated (for instance, most top links in the ‘pics’ subreddit will be dominated by links of Imgur, etc). For the rest of the attributes, however, there does not seem to be a very strong correlation.
Currently, I am collecting more data over a longer period of time to perform some analysis on using R - I am still very interested in answering questions that require a complete dataset to answer.