What's in a Post, Part 1

Photo Credit: WeGraphics

What's in a Post, Part 1

What’s in a post? Reddit pulls in around 115 million unique visitors each month, amassing a staggering 5 billion page views per month. For a long time, I’ve wondered what factors draw people to certain Reddit posts while shunning others - does it have to do with the time of day a post is submitted? Do certain users have a monopoly on the most viewed posts? What about text posts vs. links?

These are all questions I’m setting out to shed some light on in this short series. I haven’t decided how long the series will be, but this first post will only focus on the data mining portion as well as some initial inferences we can make about the data from a general perspective.

The Purpose of this Post

The purpose of this post is to be an introduction to analyzing Reddit’s data through the use of the R Programming language. With regards to my own motivation, I’m developing my skills as a data scientist and am open to feedback on the process.

Data Mining Reddit using Python

Source: Reddit Data Mining Script

Essentially, I threw together this Python script to scrape the I thought was the most useful without making it terribly complicated. The Reddit JSON Wiki was really helpful when I was creating this script. Operation is relatively simple - put the list of attributes you want in the ‘want’ array and watch it run. There are a few customized options that I found useful (run python parseReddit.com --help if you want to see them). After running the script, here is a sample of the data I get:

I should note that the dataset for this is not nearly complete as it was only taken over a couple of days every 15 minutes - remember that the purpose of this post is just to familiarize the reader with the techniques for collecting and analyzing Reddit’s data on their own. In this post, I have tried my best not to make inferences about the data that would be affected by the lack of data.

Analysis Using R

After importing the data into R, we can view the summary of the data as follows:

> results <- read.csv("~/Developer/Data Mining/Reddit/results.csv")
> summary(results)
                 domain              subreddit                 author          score         over_18          downs    created_utc         num_comments          ups        
 i.imgur.com        :14673   funny        : 6452   EntertheWu-Tang:  235   Min.   :   0.0   False:54031   Min.   :0   Min.   :1.409e+09   Min.   :    0.0   Min.   :   0.0  
 imgur.com          : 6903   AskReddit    : 5805   m0rris0n_hotel :  162   1st Qu.:  25.0   True : 1204   1st Qu.:0   1st Qu.:1.409e+09   1st Qu.:    6.0   1st Qu.:  25.0  
 self.AskReddit     : 5805   pics         : 3886   __mck__        :  152   Median : 124.0                 Median :0   Median :1.409e+09   Median :   23.0   Median : 124.0  
 youtube.com        : 3450   aww          : 3742   mrojek         :  148   Mean   : 694.7                 Mean   :0   Mean   :1.409e+09   Mean   :  170.8   Mean   : 694.7  
 self.WritingPrompts: 1360   todayilearned: 2452   IpMedia        :  134   3rd Qu.: 887.0                 3rd Qu.:0   3rd Qu.:1.409e+09   3rd Qu.:   95.0   3rd Qu.: 887.0  
 self.Showerthoughts: 1195   videos       : 2313   Tanglebrook    :  113   Max.   :6355.0                 Max.   :0   Max.   :1.410e+09   Max.   :15197.0   Max.   :6355.0  
 (Other)            :21849   (Other)      :30585   (Other)        :54291

As you would probably expect, the most frequent top links on Reddit are from Imgur, composing almost half of the front page links. Another interesting metric is the median number of comments on front page posts coming out to be only 23, which seems rather low to me. Let’s do a little data visualization to get a better idea of what’s going on here.

First, let’s take a random sample of 100 from our dataset.

> results_sample = results[sample(nrow(results),100),]

Using a scatterplot matrix, we can view some of the attributes I think are interesting and see if we can spot any relationships:

> pairs(~as.integer(subreddit)+score+num_comments,data=results_sample, main="Scatterplot Matrix")

After viewing this scatterplot, it is clear that we need to clean some of the data. This is valid as long as we don’t introduce any biases in the data. If you recall, we are interested in only the most popular posts on Reddit. Therefore, I’m going to scrub away any data that didn’t get a sufficiently high score to make a difference.

results_sample = results_sample[results_sample$score > 50, ]

As a teaser until the next post, here is the correlation matrix for some of the interesting variables in this dataset.

> library(ggplot2)
> library(reshape2)
> corr_data = results_sample[,c(1,2,3,4, 5)]
> qplot(x=Var1, y=Var2, data=melt(cor(corr_data, use="p")), fill=value, geom="tile") +
+     scale_fill_gradient2(limits=c(-1, 1))

As you can see, the subreddit, the domain the link was posted under, and whether or not the link was marked NSFw are obviously very strongly correlated (for instance, most top links in the ‘pics’ subreddit will be dominated by links of Imgur, etc). For the rest of the attributes, however, there does not seem to be a very strong correlation.

Currently, I am collecting more data over a longer period of time to perform some analysis on using R - I am still very interested in answering questions that require a complete dataset to answer.