This is another magnificent plug-in.
Reddit is a popular website for opinion sharing and news aggregation. The site consists of thousands of user-made forums, called subreddits, which cover a broad range of subjects, including politics, sports, technology, personal hobbies, and self-improvement. Given that most Reddit users contribute to multiple subreddits, one might think of Reddit as being organized into many overlapping communities. Moreover, one might understand the connections among these communities as making up a kind of social structure.
Uncovering a population’s social structure is useful because it tells us something about that population’s identity. In the case of Reddit, this identity could be uncovered by figuring out which subreddits are most central to Reddit’s network of subreddits. We could also study this network at multiple points in time to learn how this identity has evolved and maybe even predict what it’s going to look like in the future.
My goal in this post is to map the social structure of Reddit by measuring the proximity of Reddit communities (subreddits) to each other. I’m operationalizing community proximity as the number of posts to different communities that come from the same user. For example, if a user posts something to subreddit A and posts something else to subreddit B, subreddits A and B are linked by this user. Subreddits connected in this way by many users are closer together than subreddits connected by fewer users. The idea that group networks can be uncovered by studying shared associations among the people that make up those groups goes way back in the field of sociology (Breiger 1974). Hopefully this post will demonstrate the utility of this concept for making sense of data from social media platforms like Reddit.1
The data for this post come from an online repository of subreddit submissions and comments that is generously hosted by data scientist Jason Baumgartner. If you plan to download a lot of data from this repository, I implore you to donate a bit of money to keep Baumgartner’s database up and running (pushshift.io/donations/).
Here’s the link to the Reddit submissions data – files.pushshift.io/reddit/submissions/. Each of these files has all Reddit submissions for a given month between June 2005 and May 2019. Files are JSON objects stored in various compression formats that range between .017Mb and 5.77Gb in size. Let’s download something in the middle of this range – a 710Mb file for all Reddit submissions from May 2013. The file is called RS_2013-05.bz2. You can double-click this file to unzip it, or you can use the following command in the Terminal: bzip2 -d RS_2013-05.bz2. The file will take a couple of minutes to unzip. Make sure you have enough room to store the unzipped file on your computer – it’s 4.51Gb. Once I have unzipped this file, I load the relevant packages, read the first line of data from the unzipped file, and look at the variable names.
read_lines(“RS_2013-05”, n_max=1) %>% fromJSON() %>% names
##  “edited” “title”
##  “thumbnail” “retrieved_on”
##  “mod_reports” “selftext_html”
##  “link_flair_css_class” “downs”
##  “over_18” “secure_media”
##  “url” “author_flair_css_class”
##  “media” “subreddit”
##  “author” “user_reports”
##  “domain” “created_utc”
##  “stickied” “secure_media_embed”
##  “media_embed” “ups”
##  “distinguished” “selftext”
##  “num_comments” “banned_by”
##  “score” “report_reasons”
##  “id” “gilded”
##  “is_self” “subreddit_id”
##  “link_flair_text” “permalink”
##  “author_flair_text”
For this project, I’m only interested in three of these variables: the user name associated with each submission (author), the subreddit to which a submission has been posted (subreddit), and the time of submission (created_utc). If we could figure out a way to extract these three pieces of information from each line of JSON we could greatly reduce the size of our data, which would allow us to store multiple months worth of information on our local machine. Jq is a command-line JSON processor that makes this possible.
To install jq on a Mac, you need to make sure you have Homebrew (brew.sh/), a package manager that works in the Terminal. Once you have Homebrew, in the Terminal type brew install jq. I’m going to use jq to extract the variables I want from RS_2015-03 and save the result as a .csv file. To select variables with jq, list the JSON field names that you want like this: [.author, .created_utc, .subreddit]. I return these as raw output (-r) and render this as a csv file (@csv). Here’s the command that does all this:
jq -r ‘[.author, .created_utc, .subreddit] | @csv’ RS_2013-05 >parsed_json_to_csv_2013_05
Make sure the Terminal directory is set to wherever RS_2013-05 is located before running this command. The file that results from this command will be saved as “parsed_json_to_csv_2013_05”. This command parses millions of lines of JSON (every Reddit submission from 05-2013), so this process can take a few minutes. In case you’re new to working in the Terminal, if there’s a blank line at the bottom of the Terminal window, that means the process is still running. When the directory name followed by a dollar sign reappears, the process is complete. This file, parsed_json_to_csv_2013_05, is about 118Mb, much smaller than 4.5Gb.
Jq is a powerful tool for automating the process of downloading and manipulating data right from your harddrive. I’ve written the a bash script that lets you download multiple files from the Reddit repository, unzip them, extract the relevant fields from the resulting JSON, and delete the unparsed files (Reddit_Download_Script.bash). You can modify this script to pull different fields from the JSON. For instance, if you want to keep the content of Reddit submissions, add .selftext to the fields that are included in the brackets.
Now that I have a reasonably sized .csv file with the fields I want, I am ready to bring the data into R and analyze them as a network.
Each row of the data currently represents a unique submission to Reddit from a user. I want to turn this into a dataframe where each row represents a link between subreddits through a user. One problem that arises from this kind of data manipulation is that there are more rows in the network form of this data than there are in the current form of the data. To see this, consider a user who has submitted to 10 different subreddits. These submissions would take up ten rows of our dataframe in its current form. However, this data would be represented by 10 choose 2, or 45, rows of data in its network form. This is every combination of 2 subreddits among those to which the user has posted. This number gets exponentially larger as the number of submissions from the same user increases. For this reason, the only way to convert the data into a network form without causing R to crash is to convert the data into a Spark dataframe. Spark is a distributed computing platform that partitions large datasets into smaller chunks and operates on these chunks in parallel. If your computer has a multicore processor, Spark allows you to work with big-ish data on your local machine. I will be using a lot of functions from the sparklyr package, which supplies dplyr backend to Spark. If you’re new to Spark and sparklyr, check out RStudio’s guide for getting started with Spark in R (spark.rstudio.com/).
Once I have Spark configured, I import the data into R as a Spark dataframe.
To begin, I make a few changes to the data – renaming columns, converting the time variable from utc time to the day of the year, and removing submissions from deleted accounts. I also remove submissions from users who have posted only once – these would contribute nothing to the network data – and submissions from users who have posted 60 or more times – these users are likely bots.
rename(author=V1, created_utc=V2, subreddit=V3) %>%
mutate(dateRestored=timestamp(created_utc + 18000)) %>%
filter(author !=”[deleted]”) %>% group_by(author) %>% mutate(count=count()) %>%
filter(count 1) %>%
Next, I create a key that gives a numeric id to each subreddit. I add these ids to the data, and select the variables “author”, “day”, “count”, “subreddit”, and “id” from the data. Let’s have a look at the first few rows of the data.
subreddit_key % distinct(subreddit) %>% sdf_with_sequential_id()
select(author, day, count, subreddit, id)
## # Source: spark>[?? x 5]
## author day count subreddit id
## 1 Bouda 141 4 100thworldproblems 2342
## 2 timeXalchemist 147 4 100thworldproblems 2342
## 3 babydall1267 144 18 123recipes 2477
## 4 babydall1267 144 18 123recipes 2477
## 5 babydall1267 144 18 123recipes 2477
## 6 babydall1267 144 18 123recipes 2477
We have 5 variables. The count variable shows the number of times a user has posted to Reddit in May 2013, the id variable gives the subreddit’s numeric id, the day variable tells us what day of the year a submission has been posted, and the author and subreddit variables give user and subreddit names. We are now ready to convert this data to network format. The first thing I do is take an “inner_join” of the data with itself, merging by the “author” variable. For each user, the number of rows this ret
I be nuts about extensions, because they are the glamorous.
Copyright for syndicated content belongs to the Linked Source