Going Down to South Park, Part 1: Text Analysis with R

First things first

I had to find a resource with all the text for South Park dialog in a reasonable format. It took just a bit of Googling to find a gold mine: the South Park archives, a wiki page with community-maintained scripts for all episodes. Awesome!

The archive has a list of seasons and their episodes. Each episode page has a nice table with two columns—the first denoting the character’s name, and the second containing the actual line the character said. This is a perfect start.

There was one more thing I wanted to know about each episode: their popularity! I’m sure you’re familiar with IMDB (the Internet Movie Database); it contains the ratings of all movies and TV shows known to man.

But how do I put all this data together? Well, I wrote an R package called southparkr that anyone can use to analyze this data. That package downloads all the information described above and makes it conveniently available, allowing you to simply focus on analyzing the data.

Data acquired. Engage!

The second step was to determine what exactly I wanted to analyze. And for this article, I decided on doing two things:

Conducting a sentiment analysis of South Park dialog.
Determining episode popularity based on IMDB ratings.

We’ll get to these in a minute. We should first have a look at some summary statistics for the data we’ve acquired. The table below has some basic South Park stats:

Number of seasons:	21
Number of episodes:	287
Number of words:	907 797
No stopwords (a, the, this, …):	310 759
% used for analysis:	34.23
Average IMDB rating:	8.14
Best episode (9.6):	Scott Tenorman Must Die S05E04
Worst episode (6.3):	Funnybot S15E02

You can see that the show has been running for a solid 21 seasons. All the characters combined have said nearly 1 million words! Of course, that’s if we count all words. If we exclude stop words (prepositions, articles, etc.), we end up with about 300,000 words.

The episodes have sustained an average rating of roughly 8.1, which is great! (I always consider anything above an 8 worth watching.) You can also see the best and worst episodes in that table above, in case you’re interested in checking those out.

Let’s get sentimental… and dirty!

We’ll tackle the first analysis now. Sentiment analysis involves analyzing and scoring text based on context, patterns, or other characteristics within the text. These scores are positive and negative and can be expressed with numbers or words.

We’ll be using the AFINN dictionary, which scores words from -5 to 5 (where -5 is a very negative word, 0 is neutral, and +5 is very positive). For example, we’d rank bastard (and more vulgar words) as -5 and thrilled as +5.

All of this has been prepared for you behind the curtains. You’ll now see a few lines of code in R that produce a sentiment score for all episodes:

gg<- ggplot(by_episode, aes(x = episode_number, y = mean_sentiment_score, group = 1, text = text_sent)) +
  geom_col(color = "#592a88") +
  geom_smooth()

ggplotly(gg, tooltip = "text")

Our code created an interactive plot! Each bar represents an episode. You can hover over the bars to see some information: the episode name, episode number, and sentiment score.

It’s just a few lines of code, but the result looks great! That wasn’t too difficult—with the Tidyverse suite of packages, coding in R is almost like writing an English sentence.

You can see that most episodes have a downward-pointing bar, below zero. That’s mostly because South Park characters aren’t afraid to use dirty words. And they do it quite a lot!

You’ll also notice a blue line in the plot; this denotes a trend in sentiment over time. There was a large increase in the score in earlier episodes that peaked roughly around episode 80 and then started to drop. In other words, the language used by South Park changes over time.

Episodes, how popular are you?

Pretty cool, huh? We can do something very similar with episode popularity. I’ll show you a different kind of plot here. Because the ratings can’t fall below zero, it’s better to use points instead of bars.

The data’s been prepared again. The following code produces an interactive plot of South Park episode ratings:

gg<- ggplot(by_episode,(episode_number,rating, group = 1, text = text_pop)) +
 geom_point(color = "#592a88", alpha = 0.6, size = 3) + 
 geom_vline(xintercept = 100, color = "red", linetype = "dashed") +
 geom_smooth()

ggplotly(gg, tooltip = "text")

Each point represents an episode. If you hover over one of these points, you can see the episode name along with its rating. Can you find the best and the worst episodes we talked about earlier? Give it a try!

I’ve also included a trend line; this helps us determine how the popularity changes over time. Do you see any pattern here? Take a look at the trend line after the vertical red line. Up to that point, the popularity increased. After that, it consistently fell.

The funny thing is that the creators themselves made a joke that a TV show shouldn’t go past 100 episodes. For South Park, popularity began to decline after its 97th episode, Cancelled. It looks like the creators were right even about their own show! Numbers don’t lie.

Conclusion

In this article, you learned that sentiment analysis scores words using a subjective dictionary or scale. You also saw how to use such information to get an overall feel of a show like South Park based on character dialog. We put all this awesome data together to make an interactive plot with just a few lines of R code.

In my next article in this series, I’ll focus on the main South Park characters; you’ll learn how their individual sentiments evolve as we take a look at some interesting stats. Stay tuned to see how they differ from each other!

And remember: once you have an idea, nothing is impossible. Answering data questions with R is easy. Be curious, and do what you like! R is a very valuable data science skill nowadays. I personally recommend learning to use the Tidyverse. I use it in every analysis and can’t really imagine a woRld without it!

If you already know R and want to explore the data I showed here on your own, check out my GitHub repository. The page comes with instructions to help you get started.

Good luck, and have fun!

Patrik Drhlik

Patrik is a freelance data scientist from Liberec, Czech Republic. He’s doing his PhD at the Technical University of Liberec, where he also teaches software engineering courses. He is highly passionate about the R programming language and data in general. He never leaves home without his Rubik’s cube and loves hitchhiking, athletics, mountains, kangaroos, and beer.

Going Down to South Park, Part 1: Text Analysis with R

First things first

Data acquired. Engage!

Let’s get sentimental… and dirty!

Episodes, how popular are you?

Conclusion

Patrik Drhlik

How to Start a Job in IT with Vertabelo Academy

R Jobs and Salaries—All You Need to Know!

First things first

Data acquired. Engage!

Let’s get sentimental… and dirty!

Episodes, how popular are you?

Conclusion

GET ACCESS TO EXPERT CONTENT

Patrik Drhlik

How to Start a Job in IT with Vertabelo Academy

R Jobs and Salaries—All You Need to Know!

Related Posts:

Go Data-Driven or Go Home

5 Best Python IDEs for Data Science

5 Ways Visualizing Data Can Help Your Business