Going Down to South Park, Part 2: Text Analysis with R

text analysis with r, south park characters, vertabelo academy, learn r, learn data analysis, r data analysis, r statistics, how to start learning r,

Who do you think is the naughtiest character in South Park? You’ll know the answer by the end of this article—and I’m sure you’ll be surprised!

In the previous article of this series, I showed you how to use R to analyze South Park dialog. We mostly focused on the show overall.

This time, I’ll take a closer look at the most famous South Park characters. We’ll see how much they talk and how their sentiments change across the show. We’ll also see what words they use most often—and if those words are naughty. This will allow us to perform a statistical test to find the naughtiest South Park character!

Overview of South Park characters

How many characters are there in South Park? Using the data from 287 episodes, we find that there are quite a lot—4525! The great majority of these only show up one time, though. All of this data is available in my southparkr package if you’re curious.

Let’s first see a plot that will show us how many characters there are in each episode.

That’s about 40 characters on average per episode, but we can even see some episodes with 60 or more.

That’s all good and well, but we’re really only interested in the characters who speak the most; the next plot shows the top 20 of these. Each bar represents the total number of words spoken by a character across all episodes. Characters with red bars will be used in our more in-depth analyses.

Do you see anything interesting here? Eric Cartman speaks more than the other two main characters, Stan and Kyle, combined! He is indeed a blabbermouth. Kenny, on the other hand, doesn’t speak much. But we all know that, right?

We’ll analyze the six characters highlighted in red above: the four main boys plus Stan’s father Randy—and, of course, my very favorite: Butters.

How do our characters differ?

Do you recall what sentiment analysis is? We analyzed the overall sentiments of South Park in the previous article. Since we’re focusing on just our six lovely characters here, we’ll now look at how their sentiments differ.

Here, I created a column chart, which is great for displaying data with negative values. Kyle, Randy, and Stan have quite a similar sentiment evolution—their curves are a bit bumpy sometimes, but it’s nothing too interesting. Butters, on the other hand, started out as a generally positive person but turned a bit negative after the second half of the show.

Cartman was by far the most negative character from the outset. As you can see, though, this it changed in season 20. There are a few episodes with positive sentiment—he was forced to quit social media and found himself a girlfriend!

Kenny’s sentiment trend looks like a rollercoaster! His values go from one extreme to another. That’s because he doesn’t talk much. Take a look at the next plot for proof of this!

Each bar accounts for the number of spoken words in an episode. You can clearly see that Kenny really doesn’t speak much. There are a lot more interesting things going on here, though.

Take a look at Cartman. This plot proves that he really speaks the most; his trend is also very stable and constant. This, of course, can only mean one thing… South Park has always heavily depended on Eric Cartman. He is simply the edgiest character of the show, and the creators know it and want to keep it that way. That’s what people want!

Stan and Kyle, the other main characters, do have a significant downward trend. I actually noticed this even before conducting my analysis. Admittedly, these characters are becoming a bit stale.

The winner here is definitely Randy Marsh. He’s gaining a lot of attention in recent years, and I love it! Although he’s an adult, he often acts like a kid. And that’s simply awesome!

Let’s wrap this section up by looking at the words most commonly used by our six chosen ones:

Wow! There’s a lot of rich info to unpack here. What do all of the characters have in common? The four main boys sure like to use dude a lot, especially Stan. All of them also say guys quite often. Butters says it, too. But do you see what he uses more often? That’s right—he prefers fellas. For some reason, the creators only use this with Butters!

Butters is also very interesting in terms of how he refers to Cartman; he rarely calls him by his last name like all the other characters do. Instead, he calls him by his first name, Eric!

Notice how Kyle says Cartman a lot, and vice versa. Although they’re friends, these two always rip on each other.

Randy, being the good father that he is, talks with his son a lot. You can see that there are words like Stan, Stanley, and son that show up on his chart.

Now brace yourself… Take a look at Kenny’s words. Three out of his ten most used are a form of the F word! He seems to speak in a pretty dirty manner when he does! But is he naughtier than Cartman? We’ll find out soon enough.

Are naughtier episodes more popular?

This is a question I was really curious about! I just had to label certain words as swear words, but this was easy thanks to my side project, sweary—a database of swear words across different languages, all accessible in R.

With this, I was able to determine the percentage of swear words in each episode. I already had popularity ratings for each episode. So, voilà—I created the following simple plot!

Could you answer that question without me telling you? Take a thorough look at the plot, and read on once you have an idea.

Okay, I’ll tell you now. You just have to notice the slope of the blue trend line. It’s quite constant. More concretely, it’s around 2 percent everywhere. It doesn’t really matter if the rating is 7, 8, or 9 out of 10. That means that there is no relationship between the proportion of swear words used in an episode and its popularity.

Awesome! It’s nice to see that the show doesn’t rely on toilet humor to garner attention, as an uninformed observer might claim it does.

There are a few outliers worth mentioning, especially the episode titled It Hits the Fan. Almost 15% of all its spoken words are naughty—and on average, its characters use a swear word every 8 seconds!

Is Eric Cartman the naughtiest character?

At long last, we finally get to my favorite plot! I was quite confident that Eric Cartman must be the naughtiest character. But confidence itself gets you nowhere. You need a statistical test to be certain!

And there’s a perfect one just for this problem: the proportion test. The R function for conducting this test is aptly named prop.test.

In simple words, this test compares two fractions and determines if they differ significantly. Our simple fraction looks like this: swearWords/allWords. We construct this fraction for Eric Cartman and also for every other character we want to compare with him. Let’s say we choose our top 20 characters again.

This test calculates a so-called estimate and confidence interval for each character pair. The following plot is a graphical representation of our statistical results!

If you don’t understand what you’re looking at, don’t worry. I’ll explain it right now!

Each column contains an error bar. These differ in their vertical position, color, and spread. The vertical position is the statistical estimate of a character. If the color is blue, it means the results for this character are statistically significant. In more technical words, the p-value of that estimate is less than 0.05. If the spread is narrow, we are more confident about the estimate.

I’ll try it again in human terms if you didn’t catch any of that. If the spread is wide, the character doesn’t speak much. If the color is red, we can’t say for sure that Eric Cartman is naughtier than the character. And last but not least, if the vertical position of the bar is above zero, the character is naughtier than Cartman.

Using this information, can you confidently tell whether Eric Cartman is the naughtiest character in South Park? Well, yes you can! And if you followed my explanation, you should know that he is not!

The naughtiest character in South Park is officially Kenny McCormick! The vertical position of his error bar is below zero, which suggests that he is the only character naughtier than Cartman. The color of the bar confirms this—it’s blue (the result is statistically significant)! His large spread also confirms that he doesn’t speak much. This just means that when he does talk, he’s as dirty as you can imagine. And most of the time, you can really only imagine it because of his mumbling, right?

Wrap-up

This text analysis with R showed that intuition can be easily disproved using simple statistical tests. It’s often very easy to do this in R compared to other programming languages. Hand on your heart, did you think Kenny was really the naughtiest? I was surprised!

I didn’t show any code in this part of the series to keep things simple, but you can have a look at the source on my GitHub page. I’d be happy to help you get started with it if you’re interested, so don’t be afraid to reach out to me!

You can also check out the Data Visualization 101 course on Vertabelo Academy. You’ll learn how to use the ggplot2 R package that I used to produce all the plots for this article.

Do you have any specific data questions about South Park or its characters? Drop me a line in the comments section below, and I’ll help you answer them using R!

Patrik Drhlik

Patrik is a freelance data scientist from Liberec, Czech Republic. He’s doing his PhD at the Technical University of Liberec, where he also teaches software engineering courses. He is highly passionate about the R programming language and data in general. He never leaves home without his Rubik’s cube and loves hitchhiking, athletics, mountains, kangaroos, and beer.

comments powered by Disqus

GET ACCESS TO EXPERT CONTENT!

Over 85.000 happy students
and counting!