Install SQL packages:

In [1]:
# !conda install -y psycopg2
# !conda install -y postgresql
# !pip install ipython-sql
# !pip install sqlalchemy

Standard imports + sqlalchemy

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlalchemy

%matplotlib inline
%load_ext sql

Establish a database connection to the Postgres database running on my machine localhost using the schema ds100

In [3]:
postgresql_uri = "postgres://jegonzal:@localhost:5432/ds100"
sqlite_uri = "sqlite:///data/ds100.db"
default_db = postgresql_uri
In [4]:
%%sql $postgresql_uri
-- Need to drop views to prevent integrity constraint violations later.
DROP VIEW IF EXISTS date_stats;
Done.
Out[4]:
[]

Basic Table Creation, Manipulation, and Queries

The following example works through some basic table operations including:

  1. creating a table
  2. adding rows
  3. updating rows
  4. deleting rows
  5. querying the table

CREATE TABLE and DROP TABLE

To start, we are going to define a toy relation (a.k.a. table), populate it with some toy data, and work through some basic SQL. Deeper stuff coming soon though, I promise!

First, let's create the table of students

In [5]:
%%sql $default_db

-- Drop the table if it already exists
DROP TABLE IF EXISTS students;

-- Create the table profs
CREATE TABLE students(
    name TEXT PRIMARY KEY, 
    gpa FLOAT CHECK (gpa >= 0.0 and gpa <= 4.0), 
    age INTEGER, 
    dept TEXT, 
    gender CHAR);
Done.
Done.
Out[5]:
[]

Note that each column has a fixed data type.

The DBMS will enforce these types as data is inserted.

Note also the definition of a primary key, as we discussed in the EDA lecture.

The DBMS will enforce the uniqueness of values in the key columns.

To see what we've done, let's run our first query, dumping out the content of the table: every column for each row. We denote every column with *:

In [6]:
%%sql $default_db

SELECT * FROM students;
0 rows affected.
Out[6]:
name gpa age dept gender

INSERTing VALUES

Now let's manually insert some values into the table.

In [7]:
%%sql $default_db

INSERT INTO students VALUES 
 ('Sergey Brin', 2.8, 40, 'CS', 'M'),
  ('Danah Boyd', 3.9, 35, 'CS', 'F'),
  ('Bill Gates', 1.0, 60, 'CS', 'M'),
  ('Hillary Mason', 4.0, 35, 'DATASCI', 'F'),
  ('Mike Olson', 3.7, 50, 'CS', 'M'),
  ('Mark Zuckerberg', 4.0, 30, 'CS', 'M'),
  ('Cheryl Sandberg', 4.0, 47, 'BUSINESS', 'F'),
  ('Susan Wojcicki', 4.0, 46, 'BUSINESS', 'F'),
  ('Marissa Meyer', 4.0, 45, 'BUSINESS', 'F');
9 rows affected.
Out[7]:
[]

Note that strings in SQL must be quoted with a single quote ' character.

Note how insertions need to have values in the same order as the columns in the create table statement! Let's make sure our data is there:

In [8]:
%%sql $default_db

SELECT * FROM students;
9 rows affected.
Out[8]:
name gpa age dept gender
Sergey Brin 2.8 40 CS M
Danah Boyd 3.9 35 CS F
Bill Gates 1.0 60 CS M
Hillary Mason 4.0 35 DATASCI F
Mike Olson 3.7 50 CS M
Mark Zuckerberg 4.0 30 CS M
Cheryl Sandberg 4.0 47 BUSINESS F
Susan Wojcicki 4.0 46 BUSINESS F
Marissa Meyer 4.0 45 BUSINESS F

Primary Key Integrity Constraint

What happens if we try to insert another record with the same primary key (name)?

In [9]:
# %%sql $default_db
# INSERT INTO students VALUES ('Bill Gates', 4.0, 60, 'BUSINESS', 'M')

Using Pandas and SQL

We can populate the database using Pandas as well:

In [10]:
tips_df = sns.load_dataset("tips")
tips_df.head()
Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Create a connection with the database

In [11]:
engine = sqlalchemy.create_engine(default_db)

Drop the table if it already exists and then upload the table to the database.

In [12]:
_ = engine.execute("DROP TABLE IF EXISTS tips;")
with engine.connect() as conn:
    tips_df.to_sql("tips", conn)

We can also download tables directly into pandas:

In [13]:
with engine.connect() as conn:
    students = pd.read_sql("SELECT * FROM students", conn)
students
Out[13]:
name gpa age dept gender
0 Sergey Brin 2.8 40 CS M
1 Danah Boyd 3.9 35 CS F
2 Bill Gates 1.0 60 CS M
3 Hillary Mason 4.0 35 DATASCI F
4 Mike Olson 3.7 50 CS M
5 Mark Zuckerberg 4.0 30 CS M
6 Cheryl Sandberg 4.0 47 BUSINESS F
7 Susan Wojcicki 4.0 46 BUSINESS F
8 Marissa Meyer 4.0 45 BUSINESS F

Another Pandas + SQL

We can also evaluate a query string and use the SQL iPython extension to execute the query and construct a DataFrame.

In [14]:
query = """
SELECT * FROM tips
WHERE sex = 'Female';
"""

res = %sql $query

df = res.DataFrame()

df.head()
87 rows affected.
Out[14]:
index total_bill tip sex smoker day time size
0 0 16.99 1.01 Female No Sun Dinner 2
1 4 24.59 3.61 Female No Sun Dinner 4
2 11 35.26 5.00 Female No Sun Dinner 4
3 14 14.83 3.02 Female No Sun Dinner 2
4 16 10.33 1.67 Female No Sun Dinner 3

Exploring the Schema

There is no mechanism in standard SQL to access the schema associated with each database management systems. Here we use the corresponding client tools

  • Sqlite3 schema information:
In [15]:
!sqlite3 data/ds100.db ".schema students"
CREATE TABLE students(
    name TEXT PRIMARY KEY, 
    gpa FLOAT CHECK (gpa >= 0.0 and gpa <= 4.0), 
    age INTEGER, 
    dept TEXT, 
    gender CHAR);
  • PostgreSQL schema information:
In [16]:
!psql ds100 -c "\d students"
        Table "public.students"
 Column |       Type       | Modifiers 
--------+------------------+-----------
 name   | text             | not null
 gpa    | double precision | 
 age    | integer          | 
 dept   | text             | 
 gender | character(1)     | 
Indexes:
    "students_pkey" PRIMARY KEY, btree (name)
Check constraints:
    "students_gpa_check" CHECK (gpa >= 0.0::double precision AND gpa <= 4.0::double precision)

  • Using SQL Alchemy's Generic Driver

I found the following SQL Alchemy Quick Reference Sheet to be very helpful.

In [17]:
engine = sqlalchemy.create_engine(postgresql_uri)
inspector = sqlalchemy.inspect(engine)
for col in inspector.get_columns("students"):
    print(col)
{'name': 'name', 'type': TEXT(), 'nullable': False, 'default': None, 'autoincrement': False}
{'name': 'gpa', 'type': DOUBLE_PRECISION(precision=53), 'nullable': True, 'default': None, 'autoincrement': False}
{'name': 'age', 'type': INTEGER(), 'nullable': True, 'default': None, 'autoincrement': False}
{'name': 'dept', 'type': TEXT(), 'nullable': True, 'default': None, 'autoincrement': False}
{'name': 'gender', 'type': CHAR(length=1), 'nullable': True, 'default': None, 'autoincrement': False}
In [18]:
engine = sqlalchemy.create_engine(sqlite_uri)
inspector = sqlalchemy.inspect(engine)
for col in inspector.get_columns("students"):
    print(col)
{'name': 'name', 'type': TEXT(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 1}
{'name': 'gpa', 'type': FLOAT(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}
{'name': 'age', 'type': INTEGER(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}
{'name': 'dept', 'type': TEXT(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}
{'name': 'gender', 'type': CHAR(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}

UPDATE values

What is Bill Gates' GPA?

In [19]:
%%sql $default_db

SELECT * FROM students
    WHERE name LIKE  '%Bill%' -- SQL like regular expression
1 rows affected.
Out[19]:
name gpa age dept gender
Bill Gates 1.0 60 CS M

Wow, Bill has a low GPA let's lend him a hand.

In [20]:
%%sql $default_db

UPDATE students
    SET gpa = 1.0 + gpa
    WHERE LOWER(name) = 'bill gates';
1 rows affected.
Out[20]:
[]

And let's check the table now:

In [21]:
%%sql $default_db

SELECT * FROM students
    WHERE name ~'^Bil.'; -- Regular expression 
1 rows affected.
Out[21]:
name gpa age dept gender
Bill Gates 2.0 60 CS M

Suppose Mark logged into the database and tried to give himself a 5.0? Uncomment the following line to see what happens:

In [22]:
# %%sql 

# UPDATE students
#     SET gpa = 5.0
#     WHERE LOWER(name) LIKE '%zuck%';

The above code fails. Why? (check the gpa.)

Reviewing our table

In [23]:
%%sql $default_db

SELECT * FROM students
9 rows affected.
Out[23]:
name gpa age dept gender
Sergey Brin 2.8 40 CS M
Danah Boyd 3.9 35 CS F
Hillary Mason 4.0 35 DATASCI F
Mike Olson 3.7 50 CS M
Mark Zuckerberg 4.0 30 CS M
Cheryl Sandberg 4.0 47 BUSINESS F
Susan Wojcicki 4.0 46 BUSINESS F
Marissa Meyer 4.0 45 BUSINESS F
Bill Gates 2.0 60 CS M

Notice two things:

  1. If you are using Postgres the rows likely came back in a different order than before. Remember: relations do not have a defined order, and in fact two different orders are just two ways of describing the same relation!
  2. Note the relational style of the update statement: we decide which rows get updated based entirely on the values in each row, as checked by the where clause. There is no notion of any information outside the values in the row--e.g. there are no "object identifiers" or "row numbers"... everything is just the data and only the data.

Deleting Records

We can delete rows in much the same way we update rows:

In [24]:
%%sql $default_db

DELETE FROM students 
    WHERE name = 'Sergey Brin'
1 rows affected.
Out[24]:
[]
In [25]:
%%sql $default_db

SELECT * FROM students;
8 rows affected.
Out[25]:
name gpa age dept gender
Danah Boyd 3.9 35 CS F
Hillary Mason 4.0 35 DATASCI F
Mike Olson 3.7 50 CS M
Mark Zuckerberg 4.0 30 CS M
Cheryl Sandberg 4.0 47 BUSINESS F
Susan Wojcicki 4.0 46 BUSINESS F
Marissa Meyer 4.0 45 BUSINESS F
Bill Gates 2.0 60 CS M

Restoring Sergey

In [26]:
%%sql $default_db

INSERT INTO students VALUES
      ('Sergey Brin', 4.0, 40, 'CS', 'M');
    
1 rows affected.
Out[26]:
[]

SELECT Queries

Now let's start looking at some slightly more interesting queries. The canonical SQL query block includes the following clauses, in the order they appear. Square brackets indicate optional clauses.

SELECT ...
  FROM ...
[WHERE ...]
[GROUP BY ...]
[HAVING ...]
[ORDER BY ...]
[LIMIT ...];

Query blocks can reference one or more tables, and be nested in various ways. Before we worry about multi-table queries or nested queries, we'll work our way through examples that exercise all of these clauses on a single table.

The SELECT LIST

The SELECT list determines which columns to include in the output.

In [27]:
%%sql $default_db

SELECT name
FROM students;
9 rows affected.
Out[27]:
name
Danah Boyd
Hillary Mason
Mike Olson
Mark Zuckerberg
Cheryl Sandberg
Susan Wojcicki
Marissa Meyer
Bill Gates
Sergey Brin

Functions in the Selection List

SQL has a wide range of functions that can be applied to each attribute in the select list. Notice that we can alias (name) the columns with AS. The complete list of built in PostreSQL functions is available here.

In [28]:
%%sql $default_db

SELECT UPPER(name) AS n, LOWER(dept) as d, LOG(gpa) * 4.0 AS log_gpa
FROM students;
9 rows affected.
Out[28]:
n d log_gpa
DANAH BOYD cs 2.364258428106
HILLARY MASON datasci 2.40823996531185
MIKE OLSON cs 2.27280689626798
MARK ZUCKERBERG cs 2.40823996531185
CHERYL SANDBERG business 2.40823996531185
SUSAN WOJCICKI business 2.40823996531185
MARISSA MEYER business 2.40823996531185
BILL GATES cs 1.20411998265592
SERGEY BRIN cs 2.40823996531185

Selecting Distinct Rows

As we know, SQL is a multiset logic, preserving the meaning of the number of duplicates in query results. Sometimes, however, we don't want to keep the duplicates, we want to eliminate them. This is done simply by adding the keyword DISTINCT after the SELECT statement:

In [29]:
%%sql $default_db

SELECT DISTINCT dept
FROM students;
3 rows affected.
Out[29]:
dept
CS
BUSINESS
DATASCI

Which rows are used when taking the distinct entries? Does it really matter?

The WHERE Clause

The WHERE clause determines which rows of to include by specifying a predicate (boolean expression). Rows (tuples) that satisfy this expression are returned.

In [30]:
%%sql $default_db

SELECT name, gpa
FROM students
WHERE dept = 'CS'
5 rows affected.
Out[30]:
name gpa
Danah Boyd 3.9
Mike Olson 3.7
Mark Zuckerberg 4.0
Bill Gates 2.0
Sergey Brin 4.0

And of course we can specify both rows and columns explicitly. If we have a primary key, we can filter things down to even the cell level via a select list of one column, and a where clause checking equality on the primary key columns:

In [31]:
%%sql $default_db

SELECT gpa
FROM students
WHERE name = 'Bill Gates';
    
1 rows affected.
Out[31]:
gpa
2.0

Note that even this "single-celled" response still has a uniform data type of a relation.

SQL is Closed Over Tables: SQL expressions take in tables and always produce tables. How does this compare to Pandas?

Now that you can slice and dice tables into columns, rows and cells, you have enough knowledge to poke around in a database. Let's move on to skills that you'll need as a data scientist.

Group By Aggregation

GROUP BY aggregation in SQL is a lot like the group by in Pandas. SQL provides a family of [aggregate functions] for use in the select clause. In the simplest form, queries with aggregates in the select clause generate a single row of output, with each aggregate function performing a summary of all the rows of input. You can have many aggregate functions in your select clause:

A list of built-in aggregate functions in PostgreSQL is here. In our case, the query we are looking for is as follows.

In the following we compute the average GPA as well as the number of students in each department:

In [32]:
%%sql $default_db

SELECT dept, AVG(gpa) as avg_gpa, COUNT(*)
FROM students
GROUP BY dept
3 rows affected.
Out[32]:
dept avg_gpa count
CS 3.52 5
BUSINESS 4.0 3
DATASCI 4.0 1

We can use the HAVING clause to apply a predicate to groups.

In [33]:
%%sql $default_db

SELECT dept, AVG(gpa) as avg_gpa, COUNT(*)
FROM students
GROUP BY dept
HAVING COUNT(*) >= 2
2 rows affected.
Out[33]:
dept avg_gpa count
CS 3.52 5
BUSINESS 4.0 3
In [34]:
%%sql  $default_db

SELECT dept, AVG(gpa) as avg_gpa, COUNT(*)
FROM students
WHERE gender = 'F'
GROUP BY dept
HAVING COUNT(*) >= 2
1 rows affected.
Out[34]:
dept avg_gpa count
BUSINESS 4.0 3

Ordering the output

As a nicety, SQL allows you to order your output rows, in either ascending (ASC) or descending (DESC) order of the values in columns. For example:

In [35]:
%%sql $default_db

SELECT *
FROM students
ORDER BY gpa;
9 rows affected.
Out[35]:
name gpa age dept gender
Bill Gates 2.0 60 CS M
Mike Olson 3.7 50 CS M
Danah Boyd 3.9 35 CS F
Cheryl Sandberg 4.0 47 BUSINESS F
Susan Wojcicki 4.0 46 BUSINESS F
Marissa Meyer 4.0 45 BUSINESS F
Sergey Brin 4.0 40 CS M
Hillary Mason 4.0 35 DATASCI F
Mark Zuckerberg 4.0 30 CS M
In [36]:
%%sql $default_db

SELECT *
FROM students
ORDER BY gpa, age;
9 rows affected.
Out[36]:
name gpa age dept gender
Bill Gates 2.0 60 CS M
Mike Olson 3.7 50 CS M
Danah Boyd 3.9 35 CS F
Mark Zuckerberg 4.0 30 CS M
Hillary Mason 4.0 35 DATASCI F
Sergey Brin 4.0 40 CS M
Marissa Meyer 4.0 45 BUSINESS F
Susan Wojcicki 4.0 46 BUSINESS F
Cheryl Sandberg 4.0 47 BUSINESS F
In [37]:
%%sql $default_db

SELECT *
FROM students
ORDER BY gpa DESC, age ASC;
9 rows affected.
Out[37]:
name gpa age dept gender
Mark Zuckerberg 4.0 30 CS M
Hillary Mason 4.0 35 DATASCI F
Sergey Brin 4.0 40 CS M
Marissa Meyer 4.0 45 BUSINESS F
Susan Wojcicki 4.0 46 BUSINESS F
Cheryl Sandberg 4.0 47 BUSINESS F
Danah Boyd 3.9 35 CS F
Mike Olson 3.7 50 CS M
Bill Gates 2.0 60 CS M

LIMIT Clause

The limit clause limits the number of elements returned. Which elements are returned? While this depends on the order of elements which could be arbitrary beyond anything specified by the ORDER BY clauses.

Is this a random sample?

No

Why do we use the LIMIT clause?

Often the database we are querying is massive and retrieving the entire table as we are debugging the query can be costly in time and system resources. However, we should avoid using LIMIT when constructing a sample of the data.

In [38]:
%%sql
SELECT * FROM students 
LIMIT 3
3 rows affected.
Out[38]:
name gpa age dept gender
Danah Boyd 3.9 35 CS F
Hillary Mason 4.0 35 DATASCI F
Mike Olson 3.7 50 CS M

If this is funny you are Getting IT

Boby Drop tables

... it's funny, believe me.