Many starting in data science and data analytics may get introduced to pandas before being introduced to the structured query language (SQL).
SQL has a lot of analogous methods and functions, with some even having the same name.
However, SQL and relational databases, in general, have one profound and important difference with pandas. Pandas primarily examines at and manipulates data in an unchanging form, with the underlying data remaining the same.
On the other hand, SQL and databases that use SQL are specifically designed to do fast changes, inserts, and updates and deletes. Furthermore, SQL and databases that use SQL can have many calls to insert, update, and delete data at once (concurrent transactions).
Pandas is not meant to make significant changes at all to the original data.
That being said, SQL and Pandas have many similarities in their functionality.
Connecting with as SQL database in python code is more complex
than using the read_csv command of pandas.
For this reason, “Why don’t you just read from and write to CSV files instead of databases?” is a common question for those new to databases.
The answer is that doing so is extremely slow and uses a lot of a computer’s ram to do so, and only one change to a CSV file can be done at a time. So, having multiple people reading from and writing to the same CSV file would get messy very quickly.
The official Pandas documentation gives a direct comparison with SQL functions.
For example, after getting data from a CSV file with pandas the head() function specfies the number of records to return. Similarly, the limit function in SQL returns a specific number of records. The limit function unlike some other functions in SQL does not require parethesis or even an equal sign.
The convention with SQL is that all SQL words that are not tables or columns are capitalized. Also, all SQL commands must end with a semicolon. The semicolon at the end is the first thing to check for if you get a syntax error in running single or multiple commands.
This SQL tutorial will require that you have MySQL installed locally and that you connect with MySQL via MySQL workbench.
I will not go into how to install MySQL and MySQL Workbench. Here are some links about this:
I went with an older version of MySQL, 5.7. They are now up to 8. It would likely also work with 8, because SQL remains the same.
Follow the directions on this website to install MySQL on Macs:
and this one for Windows:
To download the latest version of MySQL workbench go here:
Then you will need to make a connection from MySQL Workbench to the local MySQL Community Server.
Then you will need to download and run the queries to create the sakila database. The SQL files are downloadable from a zip file here:
Make sure to start MySQL Server before opening Workbench.
For macs, MySQL can be started by going to preferences and clicking on the gear icon below MySQL at the bottom of the window. Then, a new dialog box will appear showing MySQL server status. If MySQL server is not running, click on the Start MySQL Server button.
With Macs, clicking on this button raises the password prompt for you administrative password for your mac. Type in you login/administrative password for your system and click ok.
The sakila database can be created by running the sakila sakila-schema.sql commands from the zip file you downloaded and unzipped.
First, open the file by going to the File drop-down menu and selecting Open SQL Script then finding the sakila-schema.sql file.
Run the commands from the file by clicking on the lighting bolt button.
Next, go through the same procedure for the sakila-data.sql commands. After running the commands from this file, the database will be populated with data.
Close the tabs for each of the two open files.
Now, you are ready to begin this tutorial.
While SQL commands are not case sensitive, the names of columns and tables are.
On the left, side of the Workbench window, click on schemas to show the list of databases. Then click on sakila. You should see a list of tables including actor, address alphabetically list through store.
When you click on any one of the tables you will see the corresponding schema data types for that table below.
Find the File drop-down menu and select the new query tab.
The server has to know which database on MySQL Server to use.
So type in the command:
Be sure to end every command including this one with a semicolon!!!
This command will ensure that the SQL commands afterward use the sakila database.
The first and most common SQL command is the SELECT statement. The * is used in place of a list of all of the columns of a table.
So a command to might be:
SELECT * FROM actor;
However, before running this command, we will want to find out how big the actor table is.
You can accomplish this with COUNT.
SELECT COUNT(*) FROM actor;
You will get 200.
The COUNT(*) command in SQL has a subtle but often noticeable difference from the pandas count method. SQL COUNT(*) returns a count of all the records of a table. Whereas pandas.count() returns only the non-null records in the table. In this case, the result is the same. The schema requires the use of not-null values for all of the columns.
With normalized databases null values are often not an issue. However, non-normalized databases can allow null values.
If you use Count (column_name), you will get the non-null values in that column.
So if you try the command,
SELECT COUNT(last_name) FROM actor;
You will get the same number, 200.
LIMIT at the end of a SQL statement limits the number of returned results to the number specified.
SELECT COUNT(last_name) FROM actor LIMIT 5;
Will return one row/record with the value 200. It does not reach the limit of five records returned.
On the other hand if you try,
SELECT COUNT(last_name) FROM actor LIMIT 0;
It will return nothing.
Now, let’s try something that returns multiple records.
SELECT * from actor LIMIT 5;
You will notice that the five rows that the command returns are not sorted in any discernible way.
This can be remedied with the addition of an ORDER BY clause.
You want to order by the last_name.
SELECT * from actor LIMIT 5 ORDER BY last_name;
You will get an error saying that the ORDER BY clause is not valid in that position. The LIMIT clause needs to be last.
If you try,
SELECT * from actor ORDER BY last_name LIMIT 5;
You will get five records beginning with DEBBIE AKROYD.
For simplicity, all of the data was apparently inputted in all caps.
Another clause that is analogous to pandas is group by.
Through experimenting, I found that SQL GROUP BY has to come before ORDER BY. By default, it has to include the same list of columns as in the SELECT clause.
You will find that SQL is quite strict about the order that clauses have to be in to work.
If you attempt to do only one of two columns listed in the SELECT clause,
SELECT first_name, last_name FROM actor GROUP BY first_name LIMIT 5;
You will get an error saying that last_name was not included in the GROUP BY clause. The error will also state that “This is incompatible with sql_mode=only_full_groupby.”
Once you do include both columns first_name, last_name in the GROUP BY clause, the command runs without a hitch.
The returned results begin with ADAM GRANT and ADAM HOPPER.
Let’s switch to looking at a different table with some numbers, payment.Full utilization of the payment table requires joining it to other tables. I will get to that later. Now I want to go over aggregate functions.
Similar to pandas, SQL has its own aggregate functions. The most often used of these include SUM, MAX, MIN, COUNT, COUNT (distinct), and AVG, which is the mean.
Time and date functions are in a separate category of SQL and MySQL functions.
SELECT AVG(amount), customer_id FROM payment GROUP BY customer_id;
This command returns 599 customer payment averages. GROUP BY does not require you to list the aggregate functions of the SELECT clause.
If you try the same command with MAX instead of AVG, you will get the maximum amount paid for each customer_id.SELECT MAX(amount), customer_id FROM payment GROUP BY customer_id;
Similarly, you can calculate the SUM by customer_id;
SELECT SUM(amount), customer_id FROM payment
GROUP BY customer_id;
To find the maximum amount overall for all the customers, you take away the customer_id from the select clause, and you eliminate the GROUP BY.
SELECT MAX(amount) FROM payment;
The above command returns 11.99 without a dollar sign.
You might think that MAX(SUM(amount)) will work. Unfortunately not. The MAX function, like the SUM function only works on an entire set of values from a column with its own GROUP BY.
One way to resolve this issue is to use a subquery. You have be sure to give an alias for the subquery in such a case.
SELECT MAX(X.cust_sum) FROM (SELECT SUM(amount) as cust_sum, customer_id FROM payment GROUP BY customer_id) as X
It is also a good idea to name the result of any type of aggregate function as something to make it easy to refer to.
Subqueries go within parentheses in SQL commands.
Now for joins. Joins in SQL are analogous to merges in pandas, and like pandas an INNER JOIN is the most common and is the default if just the word JOIN is in the command. SQL also has left joins, and right joins.
One thing that SQL is missing is outer join. This can be accomplished by doing a left join, then a right join and concatenating the results of the joins.
Joins require an ON clause. Joins can use aliases that rename the table to a shorter name. This shorter name is usually just one or two letters from the table name. Using aliases requires that in the FROM clause after the table name you but the one or two letter alias. FROM address a, for example.
Another place to put the shortened version of the table name next to the full table name is often right after a the word JOIN. So, INNER JOIN film f for example.
Say you want to find the name of the city for each address. That requires looking at the address table and the city table. Both tables have city_id, but only the city table has the city.
So in this case,
SELECT a.address_id, a.address, c.city FROM address a INNER JOIN city c ON a.city_id = c.city_id LIMIT 10;
Again, the limit clause has to be at the end. The result happens to show the cities listed in alphabetical order.
What if you want it in descending alphabetical order? You add an ORDER BY clause before the limit function.
SELECT a.address_id, a.address, c.city FROM address a INNER JOIN city c ON a.city_id = c.city_id ORDER BY c.city LIMIT 10;
The result gives a list of cities beginning with Z’s.
Let’s say you also want to find the country that each city is in. Unfortunately, the country is in another table. Therefore an additional join is needed.
The additional join has to before the ORDER BY and LIMIT portions of the command, but it also has to be after the ON clause.
Another point to make is that the country table cannot also be aliased as c. I suggest co.
If you try the following, you will not get an error, but you get exactly the same results as the previous command.
SELECT a.address_id, a.address, c.city FROM address a INNER JOIN city c ON a.city_id = c.city_id INNER JOIN country co ON c.country_id=co.country_id ORDER BY c.city DESC LIMIT 10;
You forgot something. The country has to be in the SELECT clause if you want to see it in the results.
SELECT a.address_id, a.address, c.city, co.country FROM address a INNER JOIN city c ON a.city_id = c.city_id INNER JOIN country co ON c.country_id=co.country_id ORDER BY c.city DESC LIMIT 10;
You will notice that SQL is smart enough to know from context that country co refers to the table and co.country refers to the country column of the country table.
So far, we haven’t added any data or added any tables.
This part is tricky because MySQL does not support insertions into multiple tables at the same time. So you have to insert into one and then insert into the other one.
As a conclusion to this lesson, lets start from scratch and create a new database. Run the command:
CREATE DATABASE testing_db;
After running the above command, click on the refresh button for the MySQL schemas. You will see an empty database. Now, you need to make tables with a table schema.
If you double-click on testing_db and click on tables, you will see that the database doesn’t have tables yet.
CREATE TABLE person (person_id MEDIUMINT NOT NULL AUTO_INCREMENT, first_name VARCHAR(30) NOT NULL, last_name VARCHAR(30) NOT NULL, sex CHAR(1), birth DATE, PRIMARY KEY (person_id));
Refresh the schemas and double-click on tables for testing_db. Now you should see the person table. The NOT NULL ensures that the value is added for each record, and the primary key serves as the index.
INSERT INTO table(col1,col2,…) VALUES (val1, val2,…);
So for the person table, you don’t need to insert values into the auto_incremented person_id. However, all the other columns with NOT NULL in the schema do need values. So,
INSERT INTO person(first_name, last_name, sex, birth) VALUES(“ERIC”, “SMITH”, “M”, “19951025”);
Now, to see if the insert worked, we can use the command:
SELECT * FROM person;
You will notice that the birth column now has hyphens in the date.
Let’s say you want to add a column a person’s email address. You also want to allow this
ALTER TABLE person ADD email varchar(50) AFTER birth;
In this case we want the email to allow nulls.
So, check the table again.
SELECT * FROM person;
It now returns the row you inserted with a null value for the email.
I will cover UPDATE and views in the next post.