A few years ago I saw someone tweet about being an excel expert and then their boss asked them to ma<e a pivot table. According to the tweet, that person immediately panicked and left their job. The tweet was funny I could relate because at first, they can seem confusing, especially in excel. But have no fear pivot tables are awesome and in Python, they are quick and simple. Pivot tables are a handy tool in data science. Anyone starting their journey in data science should be familiar with them. Let’s take a quick look at this process and by the end of this, we’ll get rid of any fear of pivot tables you may have.
Pivot tables are a technique where data is rearranged or “pivoted” to summarize certain information.
For instance, think of a product sales dataset. One of the columns might be ‘age categories’, such as young, middle age, and old. If you want to see an average amount of sales by each age category a pivot table would be a great tool to see this analysis. It would give you a new table showing the average amount sold to each category in the column.
Let’s walk through a real-life scenario where a pivot table is useful. We can use it to analyze the data and even draw a few conclusions.
To illustrate this workflow let’s take the following steps.
Let’s say a group of outraged parents think that once again video games are too violent. This time they are saying that video game developers are making too many games with cartoon and make-believe style violence. The ’ESRB’ is rating these as appropriate for children which means the developers can sell their games to a wider audience. To this group, a majority of the games rated for children have this style of cartoon violence. Let’s see if we can find some data and see if their claim has any validity.
“Won’t somebody please think of the children!” — Helen Lovejoy
Before we can even start creating our question or hypothesis we first need to have an understanding of video game ratings. The ESRB has a rating system we need to familiarize ourselves with before we can move forward. These ratings are described in detail on their website but I’ve also summarized the ratings in the table below.
Our group of irate parents was pretty vague in their accusations, but let’s take some liberty with their claim. Will predict a percentage of games they make think would be defined as a “majority” so we can make a hypothesis.
Since the rating EC specifically says “contains no material parents would find inappropriate” we can come up with our first hypothesis. If any ‘EC’ games have this style of violence something is wrong with the rating system. We can use the below hypothesis to communicate this.
Over 0% of the video games rated with ‘EC’ have some form of cartoon violence.
If this hypothesis is true then the parents are correct and there is something wrong with the rating system.
If our hypothesis is false, then the parents are incorrect and the rating system is working how it is designed.
We will also need a hypothesis for games rated ‘E’. ‘E’ is designed to be appropriate for kids over 6 but it can contain cartoon violence. Let’s say over 50% is a majority and use the following hypothesis.
Over 50% of the video games rated as ‘E’ have some sort of cartoon violence.
If our hypothesis is true, then the parents are correct that violence is now being snuck into games geared towards kids. If our hypothesis is false then we can sleep well at night knowing that there isn’t too much violence in video games rated ‘E’.
Kaggle has a perfect dataset named Video Games Rating By ‘ESRB’. From the description, we can see that this dataset contains 1895 games with 34 ESRP rating descriptors and the ESRB given rating. Each descriptor is listed as a binary value where 1 represents if the descriptor is present and 0 if it is not. The dataset also has a few more columns but we are only concerned with the rating descriptors. Let’s download this dataset and import it into Jupyter Notebook. Using Jupyter Notebook will allow us to import the Python libraries we need and give us a nice way to display our results.
import pandas as pd ratings = pd.read_csv("Video_games_esrb_rating.csv")
Now we need to decide which rating descriptors would be considered cartoony or animated. Let’s use the following descriptors: animated_blood, cartoon_violence, mild_cartoon_violence, mild_fantasy_violence. We will store these column names in a list.
cartoon_cols = ["animated_blood", "cartoon_violence", "mild_cartoon_violence", "mild_fantasy_violence"]
Our DataFrame has a method named pivot_table which will build the pivot table for us. We will use two parameters for this example. The first parameter is index which will be the rating. Think of the index as the value we are grouping by. The second parameter is the values which will be the list we made earlier. There is another very important parameter, aggfunc. This parameter will determine how our information is summarized. Since these columns are all boolean values the default value of finding the mean is perfect. The mean of these columns will give us the percentage of how many games have 1 listed for each of the descriptors.
pivot = ratings.pivot_table(index="esrb_rating", values=cartoon_cols) pivot
Awesome! Let’s breakdown what this output is telling us.
Visualizing Pivot Tables with Bar Graphs
The pivot table gave us some quick information in a matter of seconds. It is often easier for people to understand something if it is shown visually. We can use Pandas to make a bar graph with the pivot table.
Since this walkthrough is based on using Jupyter Notebook we need the first line to see the bar graph. We are also using numpy.arange as a shortcut to make 10 ticks on the y axis starting at 0 and going up in .1 increments. The pivot table we created is actually a DataFrame which allows us to call the plot.bar method on it. If we don’t specify the value on the x-axis, the index is used. In this case, that’s perfect as it will use our ‘ESRB’ ratings. The y axis will then show the values generated by each descriptor.
%matplotlib inline import numpy as np pivot.plot.bar(ylim=(0,1),yticks=np.arange(0,1,.1))
Pretty cool! Now we can visualize the information from the pivot table.
Immediately we saw our first hypothesis is false. None of the cartoon/animated violence descriptors showed up in the ‘EC’ rating. This is great! Children under 3 are still safe.
The second hypothesis is a little trickier. The problem stems from the fact that video games can have more than one descriptor. For this example, since we are trying to see if over 50% of the games have cartoon violence we can say no because .052885 + .0185096 is .237981. That number is significantly less than .50 so we are safe to make that conclusion. However, if our hypothesis had said that over 20% of the games had this style of violence we would need to see how many games have both mild cartoon violence and mild fantasy violence. This way we could properly account for the overlap when making our assessment.
There is something a little startling from this analysis. From games rated ‘E’ 18.5% have what they label as mild fantasy violence. Compared to the other ratings this seems really high. Maybe the parents are onto something. This could be something to explore further.
That was super easy! Thanks to Pandas we were able to quickly create a pivot table and visualize it. Pivot tables help us see certain patterns in our data that can help answer questions, analyze the data or even help us engineer new features for our machine learning models.