Joins with Pandas and SQL

Joins with Pandas

Please follow and like us:

For those of you who are familiar with SQL, you will recognize the logic and functionality behind the Pandas merge feature. However, if you are not familiar with SQL, or need a refresher, this post will go over different kinds of joins using the merge methods. 

One of the main ways that databases are structured is to be normalized. Normalization of databases is useful for transactions and allows you to roll back a transaction if anything goes wrong with few complications.

On the other hand, normalization complicates data analysis. For the most part, data analysis is more straightforward if all of the data is in a single table.

Data analysists will have to be able to get data out of normalized databases and into one big table to work with it more easily.

That is where joins come in. Joins come in nine varieties. Four of them are common, inner join, left join, right join, outer join (often referred to as the full outer join). Two of them are rarely put on diagrams of joins, cross joins, and self joins. 

Cross joins give you all the different combinations of the combining of two tables. Cross joins can be useful in machine learning. 

The other three are less common, but still available for certain cases. These three others come with a variety of naming conventions. I refer to them as minus joins. They only happen when the key of the left dataframe or the key of the right dataframe is null. To fully understand which is which, it is best to look at a Venn diagram.

I produced a ven diagram for just such an occasion.

Shows Join types for Pandas Merge and SQL

Going into which one to use when is beyond the scope of this post. 

StackOverflow has an extensive article about using pandas merging that I found useful. It is one of the few that goes into using the less common types of merges.

First, as with any other Pandas functionality, you have to import pandas, and the conventional way to do it is as pd. 

import pandas as pd

First, before you do any type of join (merge), you need to know which columns are common to the two tables, and if these columns have the same names. 

If the common columns do have the same names, it makes the merge easier.

Then you need to figure out which columns you want in the result. Sometimes you will want to have slightly different names for the columns depending on the subject of the columns. 

For example, if you want to combine tables of Amazon stock prices and Apple stock prices and compare them on various dates. It might be useful to have columns for daily_high for Amazon and another column for the daily high of Apple. In such cases, you can use suffixes, such as _amazon and _apple to designate the column names. 

result = pd.merge(amazon_df, apple_df[[‘Date’, ‘Open’, ‘High’, ‘Low’, ‘Close’, ‘Volume’, ‘Market Cap’]], on=’Date’, how=‘inner’ suffixes=(‘_amazon, ‘_apple’))

If you do not include how and the type of join in the merge statement, the default statement is an inner join. 

Below is a basic example with a diagram that I made of the Amazon and Apple stock price example in which joined only on the dates in common. Hence an inner join. In the example in the diagram that I made, how is not included, but it is an inner join by default.

Another relatively common occurance that you may need to know how to handle is when the indexed columns have different names in each table. 

merge_result=pd.merge(left_df, right_df, left_on=‘customer_id’, right_on=‘id’, how=’inner’)

Reference

Pandas Merging 101. stackoverflow.com. Retrieved from https://stackoverflow.com/questions/53645882/pandas-merging-101 on Dec. 6, 2019.

Please follow and like us: