groupby and aggregation

Groupby and Aggregation with Pandas

Please follow and like us:

Groupby — the Least Understood Pandas Method

Groupby may be one of panda’s least understood commands. While the lessons in books and on websites are helpful, I find that real-world examples are significantly more complex than the ones in tutorials.

For this reason, I have decided to write about several issues that many beginners and even more advanced data analysts run into when attempting to use Pandas groupby.

Groupby can return a dataframe, a series, or a groupby object depending upon how it is used, and the output type issue leads to numerous problems when coders try to combine groupby with other pandas functions.

One especially confounding issue occurs if you want to make a dataframe from a groupby object or series. 

In addition to the complexity of getting what you want from groupby, other methods with the groupby module can also be more complicated than they first appear.

I will go over the use of groupby and the groupby aggregate functions. 

Groupby and Aggregation Tutorial

I used Jupyter Notebook for this tutorial, but the commands that I used will work with most any python installation that has pandas installed. It is important to point out that Jupyter notebook prints output as html, so any formating that you do that you want in the nice Jupyter notebook form, has to output to html. Regular text formating only outputs text not html.

First, as usual, begin by importing pandas and referring to the Pandas object as pd.

import pandas as pd

Then read in the small dataset that I prepared in excel for this tutorial.

df=pd.read_csv(‘test_purchase_data.csv’)

To start out with, it is a good idea to find out the size of the dataframe.

print(df.shape)

Next, it is also advisable to find out the names of the columns for future reference.

print(df.columns)

If you print out the dataframe you will get something like the following:

You can actually do agg functions on dataframe objects, without doing a groupby function. However, these should only be used in particular circumstances, because they perform the functions on all of the columns in the dataframe. For example, the command,

df2=df.agg(([‘sum’, ‘min’]))

will result in completely nonsense dataframe in which pandas performs the sum and min on the entire dataframe. It then attempts to place the result in just two rows.

The text is concatenated for the sum and the the user name is the text of multiple user names put together. This same thing is done to the gender, and the purchase_item. Obviously, no person is 223 years old. Again, the age is added together for the entire dataframe and placed in the sum row. All of the purchase ID numbers are added together and the prices are added together as well. So, the agg, sum function is particularly useless in this case. It does however add up the prices, correctly.

On the other hand, the min function looks almost rational, but be careful. It peformed the min function on each column in the entire dataframe. jacob88 is not female and 15, and did not buy the bo staff for $19.98. This combination might be difficult to catch as nonsense if the min name alphabetically happened to be female. When you perform aggregate functions, even with groupby, you should always be careful that the results are even a real row in the dataframe and not just some combination of drawn from many rows.

Let’s try our first groupby command.

grouped_df1=df.groupby(‘gender’)

If you print out this, you will get the pointer to the groupby object grouped_df1.

On my computer I get, 

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1133c6cd0>

In this case, you have not referred to any columns other than the groupby column. In such cases, you only get a pointer to the object reference. To get a series you need an index column and a value column. The groupby object above only has the index column.

However if you try:

grouped_df1=df.groupby([‘gender’, ‘age’]).sum()

and you print out the dataframe; then, you get some unusual results that don’t make sense. For example, the item_id numbers are (pointlessly) added together.

If you do group by multiple columns, then to refer to those column values later for other calculations, you will need to reset the index.

grouped_df1.reset_index()

Another use of groupby is to perform aggregation functions. My favorite way of implementing the aggregation function is to apply it to a dictionary. The column name serves as a key, and the built-in Pandas function serves as a new column name. The function is applied to the series within the column with that name. Each function has to be in square brackets and

grouped_df=df.groupby(‘gender’).agg({‘user_name’:[‘nunique’]})

The nunique function finds the number of unique values in the column, in this case user_name.

Dictionaries inside the agg function can refer to multiple columns, and multiple built-in functions can be applied to the each of the original column names. I made the relatively long code line,

grouped_df=df.groupby(‘gender’)
.agg({‘user_name’:[‘nunique’],
‘purchase_id’: [‘count’],
‘price’:[‘min’, ‘max’, ‘mean’, ‘sum’]})

Because the code is within brackets, no continuation characters are needed to add a line break to the code.

One thing to keep in mind is that, when you print out the dataframe object or groupby object that you create, the new column names will be function names like sum, count, nunique, mean, etc. 

If this is not what you want for the column names, you can change the column names. Beyond just the unusual names, you will often have issues performing functions on a series within a column named after a built-in function. The best way to resolve this issue is to rename the columns. You set the grouped_df.columns equal to a list of strings in quotes.

grouped_df.columns=[‘gender_count’, ‘purchase_count’, ‘low_price’, ‘high_price’, ‘average_price’, ‘total_by_gender’]

You will notice that even though gender is the column grouped by, it is not needed in the list of column names, because it is inherent in the groupby that you created.

Make sure you have the correct number of column names in this column name list, excluding gender, in this case, because the groupby object already includes it. If you have the incorrect number of column names in the list, you will get an error.

After naming the columns, you can make your own functions that use values from several different columns. In this simple example, I calculate the percentage of users of each gender. I don’t just make this into a regular variable, but I make it into a new column of the dataframe object. 

By referencing a column that does not yet exist and setting it equal to the result of the gender_percentage equation, the following statement creates the gender_percentage column and populates the column with values from the custom function. I used a line continuation character to continue the line.

grouped_df[‘gender_percentage’]=\(grouped_df[‘gender_count’])/(grouped_df[‘gender_count’].sum()) * 100

So now, when you print the result, you will see a new column, gender percentage.

The data and the code for the tutorial is available at my github page at: https://github.com/scottcm73/pandas_groupby_tutorial

Reference

Python Data Aggregation. python-ds.com, retrieved from http://python-ds.com/python-data-aggregation on Dec. 11, 2019.

Please follow and like us: