Recently, I decided to mess around with PSAW, a Python wrapper for the Pushshift Reddit API. All I really planned to do was to gather some random posts from Reddit and see what statistics I could get from it.
Initial data collection
First, I gathered about 10,000 of the most recent Reddit submissions from all subreddits, and saved the data into a csv file for further analysis.
from psaw import PushshiftAPI
import pandas as pd
api = PushshiftAPI()
gen = api.search_submissions(limit=10000, filter=['url', 'author', 'title', 'subreddit','over_18'])
df = pd.DataFrame([thing.d_ for thing in gen])
df.to_csv('out.csv')
Then I did some very minor data analysis with it using pandas. As a quick sample of what the posts were like, I printed the titles of the posts, which gave me some interesting output.
A small surprise
This just might be the spiciest Komi image (Komi Can't Communicate ch364's cover)
Hi there I lost my job and I have no food can someone help me please?
What genre of porn is your favorite?
New residence pathways for New Zealand Green List , 56 occupations has fast track !
what have european beauty standards done to these poor bastards
pro tips- মেয়েরা প্রথমেই জোরে থাপ্ দিয়ে choda পছন্দ করে না .তাদের আগে ঘন রস টা পড়ার পর তারা অনেক horny হয়ে যখনই পা 2ta ফাঁক করে দেয় তখন ইচ্ছে মত থাপ্ দিয়ে choda যায় .ইউজ it caution can damage your dhon if you do it rough
🥵🔥 TIT AF 👀🔥 NEW AND UPDATED STUFF 👀💦 in comments👇👇
Zelina’s new fig
Katherine McNamara
Best internet provider? Any reviews on Starlink for Florida weather?
This is the first thing you see when you wake up in the morning
🍒 cum play with me 🍒 new content daily 🔥 $4 SALE 🔥
...
Lo and behold, there were a surprising number of apparently NSFW posts sprinkled throughout, so I did a quick run-through of the posts to see how many were NSFW using the over_18 tag that each post had.
Of the 9,996 posts that I collected, 4,496 of them were NSFW while 5500 were SFW, meaning that about 45% were NSFW and 55% were SFW. To me, this was a surprisingly large number and definitely not the split you’d expect while browsing through the front page of Reddit.
Are they real?
After a quick browse of post titles and authors, I noticed that some of the titles felt kind of like bot-spammy type posts you would usually ignore, so I wanted to know if these were really bots. As far as I know, there isn’t an easy way to know whether a Reddit account is really a bot or not, but my personal theory is that usually when an account has an auto-generated name and/or is fairly new, it’s likely to be a bot.
If you’re unfamiliar, what I mean by auto-generated name is that when you make a new Reddit account, some names are automatically suggested for you. These names follow a fairly standard pattern that you’ll see in the following screenshot, which appears when you make a new account:
As you can see, the pattern for these auto-generated names is as follows:
[Word](- or _)[Word](- or _)[3 or 4 numbers]
So, I first used regex to find usernames that matched this criteria, printing the results first to make sure that the expression was right:
for row in df.itertuples():
if (regex.findall(r"^[A-Z][a-z]+[_\-]?[A-Z][a-z]+[_\-]?[0-9]{3,4}$", row.author)):
print(row.author)
Output:
Time_Management_7472
SecureBaker1955
Maximum_Swimmer8112
Glittering_Youth_923
SnooPickles5616
Apprehensive-Use-342
Competitive_Desk7851
Spiritual-Sound-8066
HelpBroad7482
Inevitable-Text-1731
KiwiAccomplished5471
...
Then I wanted to see if the NSFW posts had more auto-generated usernames than the SFW ones (and therefore possibly more bots):
nsfw_autogenerated = 0
sfw_autogenerated = 0
for row in df.itertuples():
if (regex.findall(r"^[A-Z][a-z]+[_\-]?[A-Z][a-z]+[_\-]?[0-9]{3,4}$", row.author)):
print(f'{row.author}: {row.title}')
if row.over_18 == True:
nsfw_autogenerated += 1
else:
sfw_autogenerated += 1
And it turns out that I was kind of wrong: 19.5% of NSFW posts had auto-generated usernames, while 18.4% of SFW posts had auto-generated usernames — not as large of a difference as I’d expected. This is probably because creating a Reddit account using your Google/Apple account also auto-generates a username for you.
A plethora of emojis
Something else I noticed is that a lot of the NSFW posts in my data used emojis, like the following:
🥵🔥 TIT AF 👀🔥 NEW AND UPDATED STUFF 👀💦 in comments👇👇
which is kind of weird because I’m pretty sure it’s not that common to use emojis in Reddit post titles. So, doing a quick run-through of the posts that had emojis:
nsfw_emoji = 0
sfw_emoji = 0
emoji_autogenerated = 0
for row in df.itertuples():
if (regex.findall(r"\p{Emoji_Presentation}", row.title)):
print(row.title)
if (row.over_18):
nsfw_emoji += 1
else:
sfw_emoji += 1
if (regex.findall(r"^[A-Z][a-z]+[_\-]?[A-Z][a-z]+[_\-]?[0-9]{3,4}$", row.author)):
emoji_autogenerated += 1
I found that of the posts that used emojis, 81.1% of them were NSFW, while 18.9% were SFW. Which kind of makes sense, since emojis tend to be more eye-grabbing. Also, 24.2% of posts using emojis had auto-generated usernames, which is significantly higher than the percentage of auto-generated usernames overall (18.9%). Perhaps more of the posts using emojis were created by bots?
[Brackets101]
Something that often appears in Reddit titles is brackets [], at least from what I know. These brackets are usually used to identify someone’s gender and age, i.e. [23F] would mean a 23 year old female.
So, I used the regex to filter the posts that had brackets in the title:
Output:
[G4A] [20] [Discord] Starbrook Chateau - An RP and community server in a unique steampunk setting! Seeking writers of all types, come join us!
[Syphon Filter] #112 Games have come a long way in terms of graphics and controls. I don’t remember this being that bad, but the nostalgia was great.
[1140 The Bet Las Vegas] "I'm not sold on Denver, just because Russell is there. You don't move on from a guy if you're not ready to do it"
[Chime] Deposit $200 and receive $100 instantly!
28 [M4F] Drinks tonight in Metro
You guys asked for more here she is! 😈🔥 [image]
35 [M4F] Do you live to serve, slut?
...
And there were actually more posts that used emojis than used brackets (1252 posts vs. 914). Is everything I know about Reddit a lie?
When filtering for SFW vs NSFW, again I found that a higher percentage of posts using brackets were NSFW (69.0%).
Subreddit Statistics
To see what subreddits people were posting in, I put all of the subreddits into a dictionary, with the subreddit name being the key, and the number of posts as the value, then sorted the dictionary by value.
# Some definitely unoptimized code.
subreddits = []
for row in df.itertuples():
subreddits.append(row.subreddit)
subreddits.sort()
subreddit_frequencies = {}
for item in subreddits:
if item in subreddit_frequencies.keys():
subreddit_frequencies[item] += 1
else:
subreddit_frequencies[item] = 1
subreddit_frequencies = {k: v for k, v in sorted(subreddit_frequencies.items(), key=lambda item: item[1])}
And the top 20 subreddits that were posted in were the following:
# Subreddit: Number of posts
relationship_advice: 23
needysluts: 24
u_eshaxpeashaex: 24
DirtySnapchat: 28
teenagers: 31
jerkbudss: 33
onlyfansgirls101: 33
OnlyfansNewMegaaaa: 42
GaySnapchat: 48
dirtykikpals: 54
dirtyr4r: 54
SluttyOnlyfans: 57
onlyfanshottest: 65
AskReddit: 68
naughtychicks: 71
AdorableOnlyfans: 73
NaughtyOnlyfans: 74
Dhfhfh: 123
tomy69x: 153
Prueba0101: 200
Like, what? Of the subreddits here, I only recognized a few: AskReddit, teenagers, and relationship_advice. And the rest seemed to be apparently NSFW subreddits. Two days after I did this analysis, I went back and checked to see what the top-posted subreddit r/Prueba0101 even was, and was greeted with this page:
I guess the world will never know what the apparently super popular subreddit r/Prueba0101 held (though I suspect that they may have broken Rule 2: No content manipulation)
What did I learn?
That Reddit is apparently full of NSFW content and spammy-botty posts, though this is probably because I took the most recent posts (and therefore posts that hadn’t yet had time to be removed). I do hope to do more data analysis with Reddit though, particularly looking at the age of accounts that post.
Here’s a link to the data that I collected: reddit.csv