Install SQL packages:

In [1]:
# !conda install -y psycopg2
# !conda install -y postgresql
# !pip install ipython-sql
# !pip install sqlalchemy

Standard imports + sqlalchemy

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sqlalchemy

%matplotlib inline
%load_ext sql

Establish a database connection to the Postgres database running on my machine localhost using the schema ds100

In [3]:
postgresql_uri = "postgres://jegonzal:@localhost:5432/ds100"
sqlite_uri = "sqlite:///data/ds100.db"
default_db = postgresql_uri
In [4]:
%%sql $postgresql_uri
-- Need to drop views to prevent integrity constraint violations later.
DROP VIEW IF EXISTS date_stats;
Done.
Out[4]:
[]

Basic Table Creation, Manipulation, and Queries

The following example works through some basic table operations including:

  1. creating a table
  2. adding rows
  3. updating rows
  4. deleting rows
  5. querying the table

CREATE TABLE and DROP TABLE

To start, we are going to define a toy relation (a.k.a. table), populate it with some toy data, and work through some basic SQL. Deeper stuff coming soon though, I promise!

First, let's create the table of students

In [5]:
%%sql $default_db

-- Drop the table if it already exists
DROP TABLE IF EXISTS students;

-- Create the table profs
CREATE TABLE students(
    name TEXT PRIMARY KEY, 
    gpa FLOAT CHECK (gpa >= 0.0 and gpa <= 4.0), 
    age INTEGER, 
    dept TEXT, 
    gender CHAR);
Done.
Done.
Out[5]:
[]

Note that each column has a fixed data type.

The DBMS will enforce these types as data is inserted.

Note also the definition of a primary key, as we discussed in the EDA lecture.

The DBMS will enforce the uniqueness of values in the key columns.

To see what we've done, let's run our first query, dumping out the content of the table: every column for each row. We denote every column with *:

In [6]:
%%sql $default_db

SELECT * FROM students;
0 rows affected.
Out[6]:
name gpa age dept gender

If this is funny you are Getting IT

Boby Drop tables

... it's funny, believe me.

INSERTing VALUES

Now let's manually insert some values into the table.

In [7]:
%%sql $default_db

INSERT INTO students VALUES 
 ('Sergey Brin', 2.8, 40, 'CS', 'M'),
  ('Danah Boyd', 3.9, 35, 'CS', 'F'),
  ('Bill Gates', 1.0, 60, 'CS', 'M'),
  ('Hillary Mason', 4.0, 35, 'DATASCI', 'F'),
  ('Mike Olson', 3.7, 50, 'CS', 'M'),
  ('Mark Zuckerberg', 4.0, 30, 'CS', 'M'),
  ('Cheryl Sandberg', 4.0, 47, 'BUSINESS', 'F'),
  ('Susan Wojcicki', 4.0, 46, 'BUSINESS', 'F'),
  ('Marissa Meyer', 4.0, 45, 'BUSINESS', 'F');
9 rows affected.
Out[7]:
[]

Note that strings in SQL must be quoted with a single quote ' character.

Note how insertions need to have values in the same order as the columns in the create table statement! Let's make sure our data is there:

In [8]:
%%sql $default_db

SELECT * FROM students;
9 rows affected.
Out[8]:
name gpa age dept gender
Sergey Brin 2.8 40 CS M
Danah Boyd 3.9 35 CS F
Bill Gates 1.0 60 CS M
Hillary Mason 4.0 35 DATASCI F
Mike Olson 3.7 50 CS M
Mark Zuckerberg 4.0 30 CS M
Cheryl Sandberg 4.0 47 BUSINESS F
Susan Wojcicki 4.0 46 BUSINESS F
Marissa Meyer 4.0 45 BUSINESS F

Primary Key Integrity Constraint

What happens if we try to insert another record with the same primary key (name)?

In [9]:
# %%sql $default_db
# INSERT INTO students VALUES ('Bill Gates', 4.0, 60, 'BUSINESS', 'M')

Using Pandas and SQL

We can populate the database using Pandas as well:

In [10]:
tips_df = sns.load_dataset("tips")
tips_df.head()
Out[10]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Create a connection with the database

In [11]:
engine = sqlalchemy.create_engine(default_db)

Drop the table if it already exists and then upload the table to the database.

In [12]:
_ = engine.execute("DROP TABLE IF EXISTS tips;")
with engine.connect() as conn:
    tips_df.to_sql("tips", conn)

We can also download tables directly into pandas:

In [13]:
with engine.connect() as conn:
    students = pd.read_sql("SELECT * FROM students", conn)
students
Out[13]:
name gpa age dept gender
0 Sergey Brin 2.8 40 CS M
1 Danah Boyd 3.9 35 CS F
2 Bill Gates 1.0 60 CS M
3 Hillary Mason 4.0 35 DATASCI F
4 Mike Olson 3.7 50 CS M
5 Mark Zuckerberg 4.0 30 CS M
6 Cheryl Sandberg 4.0 47 BUSINESS F
7 Susan Wojcicki 4.0 46 BUSINESS F
8 Marissa Meyer 4.0 45 BUSINESS F

Exploring the Schema

There is no mechanism in standard SQL to access the schema associated with each database management systems. Here we use the corresponding client tools

  • Sqlite3 schema information:
In [14]:
!sqlite3 data/ds100.db ".schema students"
CREATE TABLE students(
    name TEXT PRIMARY KEY, 
    gpa FLOAT CHECK (gpa >= 0.0 and gpa <= 4.0), 
    age INTEGER, 
    dept TEXT, 
    gender CHAR);
  • PostgreSQL schema information:
In [15]:
!psql ds100 -c "\d students"
        Table "public.students"
 Column |       Type       | Modifiers 
--------+------------------+-----------
 name   | text             | not null
 gpa    | double precision | 
 age    | integer          | 
 dept   | text             | 
 gender | character(1)     | 
Indexes:
    "students_pkey" PRIMARY KEY, btree (name)
Check constraints:
    "students_gpa_check" CHECK (gpa >= 0.0::double precision AND gpa <= 4.0::double precision)

  • Using SQL Alchemy's Generic Driver

I found the following SQL Alchemy Quick Reference Sheet to be very helpful.

In [16]:
engine = sqlalchemy.create_engine(postgresql_uri)
inspector = sqlalchemy.inspect(engine)
for col in inspector.get_columns("students"):
    print(col)
{'name': 'name', 'type': TEXT(), 'nullable': False, 'default': None, 'autoincrement': False}
{'name': 'gpa', 'type': DOUBLE_PRECISION(precision=53), 'nullable': True, 'default': None, 'autoincrement': False}
{'name': 'age', 'type': INTEGER(), 'nullable': True, 'default': None, 'autoincrement': False}
{'name': 'dept', 'type': TEXT(), 'nullable': True, 'default': None, 'autoincrement': False}
{'name': 'gender', 'type': CHAR(length=1), 'nullable': True, 'default': None, 'autoincrement': False}
In [17]:
engine = sqlalchemy.create_engine(sqlite_uri)
inspector = sqlalchemy.inspect(engine)
for col in inspector.get_columns("students"):
    print(col)
{'name': 'name', 'type': TEXT(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 1}
{'name': 'gpa', 'type': FLOAT(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}
{'name': 'age', 'type': INTEGER(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}
{'name': 'dept', 'type': TEXT(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}
{'name': 'gender', 'type': CHAR(), 'nullable': True, 'default': None, 'autoincrement': 'auto', 'primary_key': 0}

UPDATE values

What is Bill Gates' GPA?

In [18]:
%%sql $default_db

SELECT * FROM students
    WHERE name LIKE  '%Bill%' -- SQL like regular expression
1 rows affected.
Out[18]:
name gpa age dept gender
Bill Gates 1.0 60 CS M

Wow, Bill has a low GPA let's lend him a hand.

In [19]:
%%sql $default_db

UPDATE students
    SET gpa = 1.0 + gpa
    WHERE LOWER(name) = 'bill gates';
1 rows affected.
Out[19]:
[]

And let's check the table now:

In [20]:
%%sql $default_db

SELECT * FROM students
    WHERE name ~'^Bil.'; -- Regular expression 
1 rows affected.
Out[20]:
name gpa age dept gender
Bill Gates 2.0 60 CS M

Suppose Mark logged into the database and tried to give himself a 5.0? Uncomment the following line to see what happens:

In [21]:
# %%sql 

# UPDATE students
#     SET gpa = 1.0 + gpa
#     WHERE LOWER(name) LIKE '%zuck%';

The above code fails. Why? (check the gpa.)

Reviewing our table

In [22]:
%%sql $default_db

SELECT * FROM students
9 rows affected.
Out[22]:
name gpa age dept gender
Sergey Brin 2.8 40 CS M
Danah Boyd 3.9 35 CS F
Hillary Mason 4.0 35 DATASCI F
Mike Olson 3.7 50 CS M
Mark Zuckerberg 4.0 30 CS M
Cheryl Sandberg 4.0 47 BUSINESS F
Susan Wojcicki 4.0 46 BUSINESS F
Marissa Meyer 4.0 45 BUSINESS F
Bill Gates 2.0 60 CS M

Notice two things:

  1. If you are using Postgres the rows likely came back in a different order than before. Remember: relations do not have a defined order, and in fact two different orders are just two ways of describing the same relation!
  2. Note the relational style of the update statement: we decide which rows get updated based entirely on the values in each row, as checked by the where clause. There is no notion of any information outside the values in the row--e.g. there are no "object identifiers" or "row numbers"... everything is just the data and only the data.

Deleting Records

We can delete rows in much the same way we update rows:

In [23]:
%%sql $default_db

DELETE FROM students 
    WHERE name = 'Sergey Brin'
1 rows affected.
Out[23]:
[]
In [24]:
%%sql $default_db

SELECT * FROM students;
8 rows affected.
Out[24]:
name gpa age dept gender
Danah Boyd 3.9 35 CS F
Hillary Mason 4.0 35 DATASCI F
Mike Olson 3.7 50 CS M
Mark Zuckerberg 4.0 30 CS M
Cheryl Sandberg 4.0 47 BUSINESS F
Susan Wojcicki 4.0 46 BUSINESS F
Marissa Meyer 4.0 45 BUSINESS F
Bill Gates 2.0 60 CS M

Restoring Sergey

In [25]:
%%sql $default_db

INSERT INTO students VALUES
      ('Sergey Brin', 4.0, 40, 'CS', 'M');
    
1 rows affected.
Out[25]:
[]

SELECT Queries

Now let's start looking at some slightly more interesting queries. The canonical SQL query block includes the following clauses, in the order they appear. Square brackets indicate optional clauses.

SELECT ...
  FROM ...
[WHERE ...]
[GROUP BY ...]
[HAVING ...]
[ORDER BY ...]
[LIMIT ...];

Query blocks can reference one or more tables, and be nested in various ways. Before we worry about multi-table queries or nested queries, we'll work our way through examples that exercise all of these clauses on a single table.

The SELECT LIST

The SELECT list determines which columns to include in the output.

In [26]:
%%sql $default_db

SELECT name
FROM students;
9 rows affected.
Out[26]:
name
Danah Boyd
Hillary Mason
Mike Olson
Mark Zuckerberg
Cheryl Sandberg
Susan Wojcicki
Marissa Meyer
Bill Gates
Sergey Brin

Functions in the Selection List

SQL has a wide range of functions that can be applied to each attribute in the select list. Notice that we can alias (name) the columns with AS. The complete list of built in PostreSQL functions is available here.

In [27]:
%%sql $default_db

SELECT UPPER(name) AS n, LOWER(dept) as d, gpa * 4.0 AS four_gpa
FROM students;
9 rows affected.
Out[27]:
n d four_gpa
DANAH BOYD cs 15.6
HILLARY MASON datasci 16.0
MIKE OLSON cs 14.8
MARK ZUCKERBERG cs 16.0
CHERYL SANDBERG business 16.0
SUSAN WOJCICKI business 16.0
MARISSA MEYER business 16.0
BILL GATES cs 8.0
SERGEY BRIN cs 16.0

Selecting Distinct Rows

As we know, SQL is a multiset logic, preserving the meaning of the number of duplicates in query results. Sometimes, however, we don't want to keep the duplicates, we want to eliminate them. This is done simply by adding the keyword DISTINCT after the SELECT statement:

In [28]:
%%sql $default_db

SELECT DISTINCT dept
    FROM students
3 rows affected.
Out[28]:
dept
CS
BUSINESS
DATASCI

Which rows are used when taking the distinct entries? Does it really matter?

The WHERE Clause

The WHERE clause determines which rows of to include by specifying a predicate (boolean expression). Rows (tuples) that satisfy this expression are returned.

In [29]:
%%sql $default_db

SELECT name, gpa
    FROM students
    WHERE dept = 'CS'
5 rows affected.
Out[29]:
name gpa
Danah Boyd 3.9
Mike Olson 3.7
Mark Zuckerberg 4.0
Bill Gates 2.0
Sergey Brin 4.0

And of course we can specify both rows and columns explicitly. If we have a primary key, we can filter things down to even the cell level via a select list of one column, and a where clause checking equality on the primary key columns:

In [30]:
%%sql $default_db

SELECT gpa
FROM students
WHERE name = 'Bill Gates';
    
1 rows affected.
Out[30]:
gpa
2.0

Note that even this "single-celled" response still has a uniform data type of a relation.

SQL is Closed Over Tables: SQL expressions take in tables and always produce tables. How does this compare to Pandas?

Now that you can slice and dice tables into columns, rows and cells, you have enough knowledge to poke around in a database. Let's move on to skills that you'll need as a data scientist.

Group By Aggregation

GROUP BY aggregation in SQL is a lot like the group by in Pandas. SQL provides a family of [aggregate functions] for use in the select clause. In the simplest form, queries with aggregates in the select clause generate a single row of output, with each aggregate function performing a summary of all the rows of input. You can have many aggregate functions in your select clause:

A list of built-in aggregate functions in PostgreSQL is here. In our case, the query we are looking for is as follows.

In the following we compute the average GPA as well as the number of students in each department:

In [31]:
%%sql $default_db

SELECT dept, AVG(gpa) as avg_gpa, COUNT(*)
    FROM students
    GROUP BY dept
3 rows affected.
Out[31]:
dept avg_gpa count
CS 3.52 5
BUSINESS 4.0 3
DATASCI 4.0 1

We can use the HAVING clause to apply a predicate to groups.

In [32]:
%%sql $default_db

SELECT dept, AVG(gpa) as avg_gpa, COUNT(*)
    FROM students
    GROUP BY dept
    HAVING COUNT(*) >= 2
2 rows affected.
Out[32]:
dept avg_gpa count
CS 3.52 5
BUSINESS 4.0 3
In [33]:
%%sql  $default_db

SELECT dept, AVG(gpa) as avg_gpa, COUNT(*)
    FROM students
    WHERE gender = 'F'
    GROUP BY dept
    HAVING COUNT(*) >= 2
1 rows affected.
Out[33]:
dept avg_gpa count
BUSINESS 4.0 3

Ordering the output

As a nicety, SQL allows you to order your output rows, in either ascending (ASC) or descending (DESC) order of the values in columns. For example:

In [34]:
%%sql $default_db

SELECT *
FROM students
ORDER BY gpa;
9 rows affected.
Out[34]:
name gpa age dept gender
Bill Gates 2.0 60 CS M
Mike Olson 3.7 50 CS M
Danah Boyd 3.9 35 CS F
Cheryl Sandberg 4.0 47 BUSINESS F
Susan Wojcicki 4.0 46 BUSINESS F
Marissa Meyer 4.0 45 BUSINESS F
Sergey Brin 4.0 40 CS M
Hillary Mason 4.0 35 DATASCI F
Mark Zuckerberg 4.0 30 CS M
In [35]:
%%sql $default_db

SELECT *
FROM students
ORDER BY gpa, age;
9 rows affected.
Out[35]:
name gpa age dept gender
Bill Gates 2.0 60 CS M
Mike Olson 3.7 50 CS M
Danah Boyd 3.9 35 CS F
Mark Zuckerberg 4.0 30 CS M
Hillary Mason 4.0 35 DATASCI F
Sergey Brin 4.0 40 CS M
Marissa Meyer 4.0 45 BUSINESS F
Susan Wojcicki 4.0 46 BUSINESS F
Cheryl Sandberg 4.0 47 BUSINESS F
In [36]:
%%sql $default_db

SELECT *
FROM students
ORDER BY gpa DESC, age ASC;
9 rows affected.
Out[36]:
name gpa age dept gender
Mark Zuckerberg 4.0 30 CS M
Hillary Mason 4.0 35 DATASCI F
Sergey Brin 4.0 40 CS M
Marissa Meyer 4.0 45 BUSINESS F
Susan Wojcicki 4.0 46 BUSINESS F
Cheryl Sandberg 4.0 47 BUSINESS F
Danah Boyd 3.9 35 CS F
Mike Olson 3.7 50 CS M
Bill Gates 2.0 60 CS M

LIMIT Clause

The limit clause limits the number of elements returned. Which elements are returned? While this depends on the order of elements which could be arbitrary beyond anything specified by the ORDER BY clauses.

Is this a random sample? NO

Why do we use the LIMIT clause? Often the database we are querying is massive and retrieving the entire table as we are debugging the query can be costly in time and system resources. However, we should avoid using LIMIT when constructing a sample of the data.

In [37]:
%%sql
SELECT * FROM students LIMIT 3
3 rows affected.
Out[37]:
name gpa age dept gender
Danah Boyd 3.9 35 CS F
Hillary Mason 4.0 35 DATASCI F
Mike Olson 3.7 50 CS M

Data Generation

It is often assumed that when working with a database all relations (tables) must come from outside or be derived from other sources of data. It is possible to construct tables in SQL.

Sometimes it's useful to auto-generate data in queries, rather than examine data in the database. This is nice for testing, but also can be useful to play some computational tricks as you'll see in your homework.

SQL has a simple scalar function called random that returns a random value between 0.0 and 1.0. You can use this if you need to generate a column of random numbers. (The PostgreSQL manual doesn't promise much about the statistical properties of this random number generator.)

Let's roll a 6-sided die for each of the students

In [38]:
%%sql $postgresql_uri

SELECT *, ROUND(RANDOM() * 6) as roll_dice 
FROM students;
9 rows affected.
Out[38]:
name gpa age dept gender roll_dice
Danah Boyd 3.9 35 CS F 4.0
Hillary Mason 4.0 35 DATASCI F 1.0
Mike Olson 3.7 50 CS M 4.0
Mark Zuckerberg 4.0 30 CS M 3.0
Cheryl Sandberg 4.0 47 BUSINESS F 2.0
Susan Wojcicki 4.0 46 BUSINESS F 5.0
Marissa Meyer 4.0 45 BUSINESS F 2.0
Bill Gates 2.0 60 CS M 2.0
Sergey Brin 4.0 40 CS M 6.0

Is this a good implementation of a fair 6 sided die?

Suppose we want to generate a whole bunch of random numbers, not tied to any particular stored table -- can we do that in SQL?

SQL has a notion of table-valued functions: functions that return tables, and hence can be used in a FROM clause of a query. The standard table-valued function is called generate_series, and it's much like numpy's arange:

In [39]:
%%sql $postgresql_uri

SELECT * 
FROM generate_series(1,5);
5 rows affected.
Out[39]:
generate_series
1
2
3
4
5
In [40]:
%%sql $postgresql_uri

SELECT * 
FROM generate_series(1,10, 2);
5 rows affected.
Out[40]:
generate_series
1
3
5
7
9

So to generate 5 random real numbers between 0 and 6, we might use this SQL:

In [41]:
%%sql $postgresql_uri

SELECT trial, (6*RANDOM()) AS rando
FROM generate_series(1, 5) AS flip(trial);
5 rows affected.
Out[41]:
trial rando
1 5.57499708328396
2 2.05683271866292
3 3.23084431327879
4 5.27568647358567
5 0.187157448381186

Let's test the distribution of our earlier generator:

In [42]:
%%sql $postgresql_uri


SELECT ROUND(6*RANDOM()) AS rando, COUNT(*)
FROM generate_series(1, 100000) AS flip(trial)
GROUP BY rando
ORDER BY count
7 rows affected.
Out[42]:
rando count
6.0 8171
0.0 8371
5.0 16459
2.0 16637
4.0 16678
1.0 16753
3.0 16931

And if we want integers, we can use a PostgreSQL typecast operator (postfix ::<type>):

In [43]:
%%sql $postgresql_uri

-- NOTE WE ALSO TAKE THE CEIL 

SELECT CEIL(6*RANDOM())::INTEGER AS rando, COUNT(*)
FROM generate_series(1, 100000) AS flip(trial)
GROUP BY rando
ORDER BY count
6 rows affected.
Out[43]:
rando count
3 16499
1 16519
4 16625
6 16705
2 16809
5 16843

Making a Random Matrix in SQL?!

Now suppose we want to populate a "matrix" relation my_matrix(x, y, val) full of random values. In Python during Lecture 7 we used np.random.randn(3,2).

In [44]:
import numpy as np
# normally distributed random numbers, mean 0 variance 1
np.random.randn(3,2)
Out[44]:
array([[-0.72298885, -0.23577457],
       [-1.30124943,  0.49493354],
       [ 0.6366525 , -0.69032592]])

In this relational version we need to explicitly generate the x and y values. We can do this via SQL's built-in cartesian product!

In [45]:
%%sql $postgresql_uri

SELECT rows.x, columns.y, random() AS val
  FROM generate_series(0,2) AS rows(x),
       generate_series(0,1) AS columns(y);
6 rows affected.
Out[45]:
x y val
0 0 0.405977200716734
0 1 0.206699903588742
1 0 0.806940271519125
1 1 0.139019494410604
2 0 0.144303449429572
2 1 0.517987427301705

We may want to store a matrix as a table—in which case we should set up the schema properly to ensure that it remains a legal matrix.

In [46]:
%%sql $postgresql_uri

DROP TABLE IF EXISTS my_matrix;

CREATE TABLE my_matrix(x INTEGER, y INTEGER, val FLOAT, PRIMARY KEY(x,y));

INSERT INTO my_matrix
SELECT rows.x, columns.y, random() AS val
  FROM generate_series(0,2) AS rows(x),
       generate_series(0,1) AS columns(y);
        
SELECT * FROM my_matrix;
Done.
Done.
6 rows affected.
6 rows affected.
Out[46]:
x y val
0 0 0.27934679063037
0 1 0.887475026305765
1 0 0.506572833284736
1 1 0.306468438357115
2 0 0.0938575188629329
2 1 0.238600226584822

A few take-aways from the previous cell:

  • Notice the schema of my_matrix reflects the fact that val is a function of the row (x) and column (y) IDs.
  • We've said before that the order of rows in a table isn't defined in SQL. Is this relational representation of a "matrix" faithful to the mathematical definition of a matrix? Why or why not?
  • Notice the INSERT statement, which contains a SELECT query rather than the VALUES we saw before. You might want to experiment and see what would happen if the SELECT query produces a different schema than my_matrix: try having it produce too few columns, too many columns, columns in different orders, etc.
  • In the INSERT...SELECT statement, notice the definition of output column names via the AS in the SELECT clause. Is that necessary here?
  • In the INSERT...SELECT statement, notice the definition of table and column names in the FROM clause via AS, and the way they get referenced in the SELECT clause. Do we need the tablenames specified in the SELECT clause? Try it and see!
  • Count the rows in the output...does it look good?

User-defined functions (UDFs)

Sometimes we may want a custom scalar function that isn't built into SQL. Some database systems allow you to register your own user-defined functions (UDFs) in one or more programming languages. Conveniently, PostgreSQL allows us to register user-defined functions written in Python. Be aware of two things:

  1. Calling Python for each row in a query is quite a bit slower than using the pre-compiled built-in functions in SQL ... this is akin to the use of Python loops instead of numpy calls. If you can avoid using Python UDFs you should do so to get better performance.

  2. Python is a full-feature programming language with access to your operating system's functionality, which means it can reach outside of the scope of the query and wreak havoc, including running arbitrary UNIX commands. (PostgreSQL refers to this as an untrusted language.) Be very careful with the Python UDFs you use in your Postgres queries! If you want to be safer write UDFs in a trusted language. PostgreSQL has a number of other languages to choose from, including Java and even R!.

First we tell PostgreSQL we want to use the plpythonu package (so named because of "pl" for "programming language", "u" for "untrusted"):

In [47]:
%%sql $postgresql_uri

CREATE EXTENSION IF NOT EXISTS plpythonu;
Done.
Out[47]:
[]

Now let's write some trivial Python code and register it as a UDF using the create function command. Since SQL is a typed language, we need to specify the SQL types for the input and output to our function, in addition to the code (within $$ delimiters) and the language:

In [48]:
%%sql $postgresql_uri

DROP FUNCTION IF EXISTS fib(x INTEGER);

CREATE FUNCTION fib(x INTEGER) RETURNS INTEGER
AS $$
def fib(x):
    if x < 2:
        return x
    else:
        return fib(x-1) + fib(x-2)
return fib(x)
$$ LANGUAGE plpythonu;
Done.
Done.
Out[48]:
[]
In [49]:
%%sql $postgresql_uri

SELECT x, fib(x)
FROM generate_series(1,10) AS row(x);
10 rows affected.
Out[49]:
x fib
1 1
2 1
3 2
4 3
5 5
6 8
7 13
8 21
9 34
10 55

A Discussion on Transactions

It is possible to create transactions that isolate changes. This is done by starting a transaction with BEGIN. We can then proceed to make changes to the database. During this time others will not be able to see our changes. Until we end the transactions by saying ROLLBACK or COMMIT:

BEGIN;

UPDATE profs SET luckynumber = 888 WHERE lastname = 'Gonzalez';

SELECT * FROM profs;

ROLLBACK;

SELECT * FROM profs;

Try running this in the postgres shell...

Descriptive Statistics in SQL

Statistics doesn't deal with individuals, it deals with groups: distributions, populations, samples and the like. As such, computing statistics in SQL focuses heavily on aggregation functions.

All SQL systems have simple descriptive statistics built in as aggregation functions:

  • min, max
  • count
  • sum
  • avg
  • stddev and variance, the sample standard deviation and variance.

PostgreSQL offers many more. Some handy ones include

  • stddev_pop and var_pop: the population standard deviation and variance, which you should use rather than stddev and variance if you know your data is the full population, not a sample.
  • covar_samp and covar_pop: sample and population covariance
  • corr, Pearson's correlation coefficient

Order Statistics: Aggregates requiring ordered input

You'll notice that a number of handy statistics are missing from this list, including the median and quartiles. That's because those are order statistics: they are defined based on an ordering of the values in a column.

SQL provides for this by allowing what it calls "ordered set functions", which require a WITHIN GROUP (ORDER BY <columns>) clause to accompany the order-statistic aggregate. For example, to compute the 25th percentile, 50th percentile (median) and 75th percentile in SQL, we can use the following:

In [50]:
%%sql $postgresql_uri

SELECT 
    percentile_cont(0.5) WITHIN GROUP (ORDER BY x) 
FROM generate_series(1,10) AS data(x);
1 rows affected.
Out[50]:
percentile_cont
5.5

There are two versions of the percentile function:

  • percentile_cont inuous : interpolates
  • percentile_disc rete : returns an entry from the table

What will the following expressions return?

In [51]:
%%sql $postgresql_uri

SELECT 
    percentile_disc(0.5) WITHIN GROUP (ORDER BY x) 
FROM generate_series(1,10) AS data(x);
1 rows affected.
Out[51]:
percentile_disc
5

We can compute the edges and middle of the box in a box plot:

In [52]:
%%sql $postgresql_uri
SELECT 
    percentile_disc(0.25) WITHIN GROUP (ORDER BY x) as lower_quartile,
    percentile_disc(0.5) WITHIN GROUP (ORDER BY x) as median,
    percentile_disc(0.75) WITHIN GROUP (ORDER BY x) as upper_quartile
FROM generate_series(1,10) AS data(x);
1 rows affected.
Out[52]:
lower_quartile median upper_quartile
3 5 8

Working with Real Data in psql

In a separate notebook (load_fec.ipynb) you'll find the commands to load publicly-available campaign finance data from the Federal Election Commission into a PostgreSQL database.

To see what we have in the database, it's simplest to use the PostgreSQL shell command psql to interact with the database. You can run man psql to learn more about it. A few handy tips:

  1. psql supports some useful non-SQL "meta-"commands, which you access via backslash (\). To find out about them, run psql in a bash shell, and at the prompt you can type \?.
  2. psql has builtin documentation for SQL. To see that, at the psql prompt type \help.
  3. psql is an interactive SQL shell, so not suitable for use inside a Jupyter notebook. If you want to invoke it within a Jupyter notebook, you should use !psql -c <SQL statement> -- the -c flag tells psql to run the SQL statement and then exit:
In [53]:
!psql ds100 -c "select * from students;"
      name       | gpa | age |   dept   | gender 
-----------------+-----+-----+----------+--------
 Danah Boyd      | 3.9 |  35 | CS       | F
 Hillary Mason   |   4 |  35 | DATASCI  | F
 Mike Olson      | 3.7 |  50 | CS       | M
 Mark Zuckerberg |   4 |  30 | CS       | M
 Cheryl Sandberg |   4 |  47 | BUSINESS | F
 Susan Wojcicki  |   4 |  46 | BUSINESS | F
 Marissa Meyer   |   4 |  45 | BUSINESS | F
 Bill Gates      |   2 |  60 | CS       | M
 Sergey Brin     |   4 |  40 | CS       | M
(9 rows)

Let's see what tables we have our database after loading the FEC data:

In [54]:
!psql ds100 -c "\d"
             List of relations
 Schema |     Name      | Type  |  Owner   
--------+---------------+-------+----------
 public | boats         | table | jegonzal
 public | ccl           | table | jegonzal
 public | cm            | table | jegonzal
 public | cn            | table | jegonzal
 public | indiv_sample  | table | jegonzal
 public | indiv_sample2 | table | jegonzal
 public | individual    | table | jegonzal
 public | my_matrix     | table | jegonzal
 public | oppexp        | table | jegonzal
 public | pas           | table | jegonzal
 public | profs         | table | jegonzal
 public | rando         | view  | jegonzal
 public | reserves      | table | jegonzal
 public | sailors       | table | jegonzal
 public | students      | table | jegonzal
 public | tips          | table | jegonzal
(16 rows)

And let's have a look at the individual table's schema:

In [55]:
!psql ds100 -c "\d individual"
              Table "public.individual"
     Column      |         Type          | Modifiers 
-----------------+-----------------------+-----------
 cmte_id         | character varying(9)  | 
 amndt_ind       | character(1)          | 
 rpt_tp          | character varying(3)  | 
 transaction_pgi | character(5)          | 
 image_num       | bigint                | 
 transaction_tp  | character varying(3)  | 
 entity_tp       | character varying(3)  | 
 name            | text                  | 
 city            | text                  | 
 state           | character(2)          | 
 zip_code        | character varying(12) | 
 employer        | text                  | 
 occupation      | text                  | 
 transaction_dt  | character varying(9)  | 
 transaction_amt | integer               | 
 other_id        | text                  | 
 tran_id         | text                  | 
 file_num        | bigint                | 
 memo_cd         | text                  | 
 memo_text       | text                  | 
 sub_id          | bigint                | 

If you are curious about the meaning of these columns check out the FEC data description

How big is this table?

In [56]:
%%sql $postgresql_uri

SELECT COUNT(*)
FROM individual 
1 rows affected.
Out[56]:
count
20347829

Browsing Tables: LIMIT and sampling

This is not the first topic usually taught in SQL, but it's extremely useful for exploration.

OK, now we have some serious data loaded and we're ready to explore it.

Database tables are often big--hence the use of a database system. When browsing them at first, we may want to look at exemplary rows: e.g., an arbitrary number of rows, or a random sample of the rows.

To look at all of the data in the individual table, we would simply write:

select * \
  from individual;

But that would return 20,347,829 rows into our Jupyter notebook's memory, and perhaps overflow the RAM in your computer. Instead, we could limit the size of the output to the first 3 rows as follows:

In [57]:
%%sql $postgresql_uri

SELECT *
FROM individual 
LIMIT 4;
4 rows affected.
Out[57]:
cmte_id amndt_ind rpt_tp transaction_pgi image_num transaction_tp entity_tp name city state zip_code employer occupation transaction_dt transaction_amt other_id tran_id file_num memo_cd memo_text sub_id
C00257642 N Q1 P 15020148840 15 IND MASON, MARCIA SEATTLE WA 98199 ESTERLINE CORPORATION VP & GENERAL COUNSEL 01072015 1250 None SA0430158840438 1006120 None None 2043020151242368282
C00257642 N Q1 P 15020148670 15 IND DARRAH, ELIZABETH D MERCER ISLAND WA 98040 NOT EMPLOYED RETIRED 02232015 350 None SA0430158670310 1006120 None None 2043020151242368154
C00257642 N Q1 P 15020148670 15 IND DASCHLE, TOM WASHINGTON DC 20037 THE DASCHLE GROUP CONSULTANT 03312015 1000 None SA0430158670311 1006120 None None 2043020151242368155
C00257642 N Q1 P 15020148670 15 IND DAUDON, MAUD SEATTLE WA 98112 THE SEATTLE METROPOLITAN CHAMBER OF CO PRESIDENT AND CEO 03052015 1000 None SA0430158670312 1006120 None None 2043020151242368156

Some notes on the limit clause:

  1. Not only does it produce a small output, it's quite efficient: the database system stops iterating over the table after producing the first three rows, saving the work of examining the other nearly 40 million rows.
  2. Recall that relations have no intrinsic order, so this is some arbitrary choice of 3 rows. Two issues to keep in mind:
    1. This is a biased choice of rows. Very likely these are the first 3 rows stored in some disk file managed by the database, which may (for example) be the first 3 rows that were entered into the database, so they may not be representative of rows entered later.
    2. The result is non-deterministic. Given that tables are not guaranteed to have an intrinsic order, it is considered correct for an SQL engine to return any 3 rows that satisfy this query, and return a different 3 rows each time depending on the cached data.

Constructing a Bernoulli Sample

As data scientists, we should be concerned about spending much time looking at a biased subset of our data. Instead, we might want an i.i.d. random sample of the rows in the table. There are various methods for sampling from a table. A simple one built into many database systems including PostgreSQL is Bernoulli sampling, in which the decision to return each row is made randomly and independently. As a metaphor, the database engine "flips a coin" for each row to decide whether to return it. We can influence the sampling rate by choosing the probability of a "true" result of the coinflip.

This is done on a per-table basis in the FROM clause of the query like so:

In [58]:
%%sql $postgresql_uri
SELECT *
FROM individual TABLESAMPLE BERNOULLI(.00001) REPEATABLE(42);
2 rows affected.
Out[58]:
cmte_id amndt_ind rpt_tp transaction_pgi image_num transaction_tp entity_tp name city state zip_code employer occupation transaction_dt transaction_amt other_id tran_id file_num memo_cd memo_text sub_id
C00078451 N 12G P 201611039037111950 15 IND BARRETO, MICHAEL RANDOLPH MA 02368 GD INFORMATION TECHNOLOGY MANAGER INFORMATION SYSTEMS 10192016 10 None PR1730464950915 1123284 None P/R DEDUCTION ($10.00 BI-WEEKLY) 4113020161356319420
C00401224 A M5 P 201702089044946205 24T IND THOMAS, ROBERT GAINESVILLE FL 32607 UNIVERSITY OF FLORIDA PROFESSOR 04292016 200 C00042366 SA11AI_47414707 1148088 None EARMARKED FOR DSCC (C00042366) 4032420171384326562

To learn more about the TABLESAMPLE clause checkout out the select docs. Note that there is a second sampling method called block sampling which is a lot like cluster sampling at the level of pages on disk!

Three things to note relative to our previous limit construct:

  1. Bernoulli sampling is slow: it scales linearly with the table size by iterating through every row in the table.
  2. The number of rows returned by Bernoulli sampling is probabilistic. For a table with $n$ rows and a sampling probability $p$, the output size comes from a binomial distribution with mean $np$ and variance ($np(1-p)$). For a very small $p$, the variance means we could easily get 0 rows back when trying our query!
  3. If we don't know the size of the table, it's hard to choose a practical sampling probability. First we want to count up the number of rows $n$ (see the discussion of aggregation queries below), to inform us of a good $p$ to choose to get our desired output size. That means yet another full pass of the table to compute the count before we compute the sample!

For these reasons, if we want a proper i.i.d sample, it's a good idea to compute a nice-sized sample and store it, keeping it reasonably large for more general use. Since we will not be updating and rows in our individual table, we can do this without worrying that the sample will get "out of date" with respect to the context of individual.

We can use the CREATE TABLE AS SELECT ... (a.k.a. CTAS) pattern to do create a table that saves the output of a query:

In [59]:
%%sql $postgresql_uri
DROP TABLE IF EXISTS indiv_sample;

CREATE TABLE indiv_sample AS
SELECT *
     FROM individual TABLESAMPLE BERNOULLI(.1) REPEATABLE(42);
Done.
20523 rows affected.
Out[59]:
[]

Here is a more manual way to construct a random sample of a fixed size. Note that this is not as efficient taking several minutes to complete.

In [60]:
# %%sql $postgresql_uri

# SELECT SETSEED(0.5);

# DROP TABLE IF EXISTS indiv_sample2;

# CREATE TABLE indiv_sample2 AS
# SELECT *, RANDOM() AS u
# FROM individual 
# ORDER BY u
# LIMIT 20000;
In [61]:
%%sql $postgresql_uri

SELECT COUNT(*) FROM indiv_sample2
1 rows affected.
Out[61]:
count
20000
In [62]:
%%sql $postgresql_uri

SELECT * FROM indiv_sample2 LIMIT 5
5 rows affected.
Out[62]:
cmte_id amndt_ind rpt_tp transaction_pgi image_num transaction_tp entity_tp name city state zip_code employer occupation transaction_dt transaction_amt other_id tran_id file_num memo_cd memo_text sub_id u
C00197228 N YE None 201601209004528770 15 IND JOYCE, CHRISTOPHER BRANCHBURG NJ 088763311 THE ANTHEM COMPANIES, INC. V.P. HLTHCARE ANALYTIC PLATFORMS 12312015 50 None 010416-1602 1040846 None None 4012020161260705612 2.46800482273102e-08
C00336768 N 30G P 201612019037590262 15 IND WORTHAM, L.D. SHAWNEE OK 74804 ARVEST BANK SELF-EMPLOYEED 11082016 300 None SA11AI.44880 1127612 None None 4120220161356564307 4.05125319957733e-08
C00011114 N M11 None 201511209003542769 15 IND SCHROEDER, PAUL TRENTON NJ 086100000 AFSCME NJ CN 73 STAFF REPRESENTATIVE 10222015 15 None SA11AI.97674 1033992 None None 4112320151257443570 2.73343175649643e-07
C00000935 A M10 P 201706169056570873 15E IND MARSHALL, BRENDA V. ORO VALLEY AZ 857373483 N/A RETIRED 09112016 6 C00401224 VT4C3RVS9S6 1166132 None * EARMARKED CONTRIBUTION: SEE BELOW EARMARKED THROUGH ACTBLUE 4061620171410060171 3.41329723596573e-07
C00399444 A Q3 P 201701319042196755 15 IND MARTIN, JOSIE SANTA BARBARA CA 93108 SELF AUTHOR 09172016 645 None SA11AI.6774 1144996 None None 4013120171369161826 3.48314642906189e-07

Selecting rows and columns, and calling scalar (per-row) functions.

OK, we already had a peek at the individual table. Now let's look at specific attributes (columns) relates to who is donating how much.

In addition to referencing the columns of individual in the select clause, we can also derive new columns by writing field-level (so-called "scalar") functions. Typically we reference some table columns in those functions.

In our case, let's compute the log of transaction_amt for subsequent plotting. SQL comes with many typical functions you can use in this way, and PostgreSQL is particularly rich on this front; see the PostgreSQL manual for details.

We'll look at indiv_sample rather than individual while we're just exploring.

In [63]:
%%sql $postgresql_uri

SELECT name, state, cmte_id,
       transaction_amt, log(transaction_amt)
FROM indiv_sample
LIMIT 10;
10 rows affected.
Out[63]:
name state cmte_id transaction_amt log
HARRIS, ANN MO C00431304 50 1.69897000433602
MOEN, ERIK P. MR WA C00012880 500 2.69897000433602
DYKHOUSE, DANA J. SD C00476853 5400 3.73239375982297
SAUNDERS, SANFORD MR. VA C00554261 2700 3.43136376415899
DUEHN, BARBARA TX C00513077 15 1.17609125905568
PHILLIPS, JANE K. OK C00513077 5 0.698970004336019
MEDOW, JONATHAN IL C00435099 500 2.69897000433602
CAMPBELL, JOSEPH CA C00499392 1000 3.0
SABIN, ANDREW None C00461046 2700 3.43136376415899
HORENKAMP, ELIZABETH PA C00461046 204 2.3096301674259

We can combine SQL with python in the following way:

In [64]:
query = """
SELECT transaction_amt AS amt
FROM indiv_sample
WHERE transaction_amt > 0;
"""
result = %sql $postgresql_uri $query

sns.distplot(result.DataFrame()['amt'])
20158 rows affected.
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x111ef4240>
In [65]:
query = """
SELECT LOG(transaction_amt) AS log_amt
FROM indiv_sample
WHERE transaction_amt > 0;
"""
result = %sql $postgresql_uri $query

sns.distplot(result.DataFrame()['log_amt'])
scales = np.array([1,10,20,  100, 500,  1000, 5000])
_ = plt.xticks(np.log10(scales), scales)
20158 rows affected.

CASE statements: SQL conditionals in the FROM clause

What about smaller donations?

In [66]:
# %%sql $postgresql_uri

# SELECT name, state, cmte_id,
#        transaction_amt, LOG(transaction_amt)
# FROM indiv_sample
# WHERE transaction_amt < 10
# LIMIT 10;

Uh oh, log is not defined for numbers <= 0! We need a conditional statement in the select clause to decide what function to call. We can use SQL's case construct for that.

In [67]:
%%sql $postgresql_uri

SELECT name, state, cmte_id, transaction_amt,
    CASE WHEN transaction_amt > 0 THEN log(transaction_amt)
         WHEN transaction_amt = 0 THEN 0
         ELSE -1*(log(abs(transaction_amt)))
    END AS log_magnitude
FROM indiv_sample
WHERE transaction_amt < 10
LIMIT 10;
10 rows affected.
Out[67]:
name state cmte_id transaction_amt log_magnitude
PHILLIPS, JANE K. OK C00513077 5 0.698970004336019
TURKAL, MICHAEL OH C00000885 8 0.903089986991944
WILTBERGER, ARLENE CA C00456335 3 0.477121254719662
ARLENE HEYMAN M.D., P.C. NY C00042366 -250 -2.39794000867204
HARRIS, RACQUEL L. AR C00093054 5 0.698970004336019
TAYLOR, FLINT MR. IL C00342907 1 0.0
ROSENBERGER, BONNIE J MS. FL C00030718 1 0.0
MATSUZAKA, ETSUKO None C00498568 5 0.698970004336019
MENDEZ-PADELFORD, DIANE M MRS CA C00540310 -150 -2.17609125905568
O'BRIEN, JOAN IL C00042366 8 0.903089986991944

Selecting rows: more interesting WHERE clauses

We can choose which rows do and do not appear in the query by putting boolean-valued expressions ("predicates") in the WHERE clause, right after the FROM clause. For example, we might be looking for big donations greater than $1000:

In [68]:
%%sql $postgresql_uri

-- Notice that as we are more selective we return to the fulld ata

SELECT name, city, state, transaction_amt
FROM individual 
WHERE transaction_amt > 1000 
limit 10;
10 rows affected.
Out[68]:
name city state transaction_amt
MASON, MARCIA SEATTLE WA 1250
ROSEN, STAN SEATTLE WA 1250
ROUVELAS, EMANUEL L WASHINGTON DC 2500
SMITH, MICHAEL D CHEVY CHASE MD 2000
SMITH, TODD A ESQ CHICAGO IL 2700
MURRAY, TERRENCE PALM BEACH FL 2600
LOVE, MARCENA W WINNETKA IL 2500
MAHLER, IRENE SEATTLE WA 1150
ROMEIN, STEVE SEATTLE WA 1250
MCLANE, THOMAS W SPOKANE WA 1500

We can form more complex predicates using Boolean connectives AND, OR and NOT:

In [69]:
%%sql $postgresql_uri

SELECT name, city, state, transaction_amt
FROM individual
WHERE transaction_amt > 1000
    AND (state = 'WI' OR state = 'IL')
    AND NOT (city = 'CHICAGO')
LIMIT 10;
10 rows affected.
Out[69]:
name city state transaction_amt
LOVE, MARCENA W WINNETKA IL 2500
BUGELAS-BRANDT, PATRICE WINNETKA IL 2700
BUGELAS-BRANDT, PATRICE WINNETKA IL 2700
YOUNG, CARLA WINNETKA IL 2700
YOUNG, CARLA WINNETKA IL 2700
CORBOY, PHILIP H JR WINNETKA IL 2700
BRANDT, WILLIAM A JR WINNETKA IL 2700
BRANDT, WILLIAM A JR WINNETKA IL 2700
KEEFE, THOMAS Q JR BELLEVILLE IL 2700
ONDRA, STEPHEN WILMETTE IL 2700

Order by

Finally by combing ORDER BY and LIMIT we can identify top campaign contributors.

In [70]:
%%sql $postgresql_uri

SELECT name, ROUND(SUM(transaction_amt)/100.0, 2) total_amt
FROM individual
WHERE city = 'SAN FRANCISCO'
GROUP BY name
ORDER BY total_amt DESC
LIMIT 20;
20 rows affected.
Out[70]:
name total_amt
STEYER, THOMAS F. 897269.44
MOSKOVITZ, DUSTIN 141110.00
TUNA, CARI 66230.00
SANDLER, HERBERT M. 54944.00
THIEL, PETER 40667.00
OBERNDORF, WILLIAM E. MR. 24161.06
SCHWAB, HELEN O. MRS. 15027.00
OBERNDORF, WILLIAM E. 13712.00
WENDT FAMILY TRUST 13500.00
AMERICAN PACIFIC INTERNATIONAL CAPITAL INC. 13000.00
WILLIAMS, SARA 12739.60
MOSKOVITZ, DUSTIN A. 10000.00
BUELL, SUSIE T. 9550.00
WILLIAMS, EVAN 8759.31
NEXT GEN 8000.00
BUELL, SUSIE TOMPKINS 7979.71
OBERNDORF, WILLIAM E 7816.00
SCHWAB, CHARLES R. 6534.00
WILSEY, DIANE B. 5543.80
GREER, JIM 5516.15

Note how the combination of ORDER BY and LIMIT 10 gives you the "top 10" results. That's often handy!

What's the granularity of our individual table? Transactions? Examining the schema it doesn't look like there's a key for the donor. Maybe the image_num is a key? Or the file_num?

To determine this, we need to count up the total number of rows, and the number of distinct values that occur in the image_num column. SQL provides a family of [aggregate functions] for use in the select clause. In the simplest form, queries with aggregates in the select clause generate a single row of output, with each aggregate function performing a summary of all the rows of input. You can have many aggregate functions in your select clause:

A list of built-in aggregate functions in PostgreSQL is here. In our case, the query we are looking for is as follows. To start with, we'll run it on our sample for a sanity check:

Named Queries: Views and CTEs

Up to now we've looked at a single query at a time. SQL also allows us to nest queries in various ways. In this section we look at the cleaner examples of how to do this in SQL: views and Common Table Expressions (CTEs).

Views

In earlier examples, we created new tables and populated them from the result of queries over stored tables. There are two main drawbacks of that approach that may concern us in some cases:

  1. The new table uses up storage, even though it is recomputable from other tables.
  2. Out of date. The stored output will not reflect changes in the input.

For this reason, SQL provides a notion of logical views: these are basically named queries that are re-evaluated upon each reference.

The syntax is straightforward:

CREATE VIEW <name> AS
<SELECT statement>;

The resulting view <name> can be used in an SELECT query, but not in an INSERT, DELETE or UPDATE query!

As an example, we might want a view that stores just some summary statistics of transaction_amts for each date:

In [71]:
%%sql $postgresql_uri


DROP VIEW IF EXISTS date_stats;

CREATE VIEW date_stats AS
SELECT 
    to_date(transaction_dt, 'MMDDYYYY') as day, -- Date Parsing
    min(transaction_amt), 
    avg(transaction_amt), 
    stddev(transaction_amt),
    max(transaction_amt)
FROM indiv_sample
GROUP BY transaction_dt
ORDER BY day;
Done.
Done.
Out[71]:
[]
In [72]:
%%sql
SELECT * from date_stats limit 5;
5 rows affected.
Out[72]:
day min avg stddev max
2015-01-02 1 178.6666666666666667 278.801602099653 500
2015-01-04 50 150.0000000000000000 141.421356237310 250
2015-01-05 0 12.5000000000000000 17.6776695296636881 25
2015-01-06 15 36.2500000000000000 42.5000000000000000 100
2015-01-07 25 25.0000000000000000 None 25

Notice that this did not create a table:

In [73]:
!psql ds100 -c "\dt"
             List of relations
 Schema |     Name      | Type  |  Owner   
--------+---------------+-------+----------
 public | boats         | table | jegonzal
 public | ccl           | table | jegonzal
 public | cm            | table | jegonzal
 public | cn            | table | jegonzal
 public | indiv_sample  | table | jegonzal
 public | indiv_sample2 | table | jegonzal
 public | individual    | table | jegonzal
 public | my_matrix     | table | jegonzal
 public | oppexp        | table | jegonzal
 public | pas           | table | jegonzal
 public | profs         | table | jegonzal
 public | reserves      | table | jegonzal
 public | sailors       | table | jegonzal
 public | students      | table | jegonzal
 public | tips          | table | jegonzal
(15 rows)

Instead it created a view:

In [74]:
!psql ds100 -c "\dv"
           List of relations
 Schema |    Name    | Type |  Owner   
--------+------------+------+----------
 public | date_stats | view | jegonzal
 public | rando      | view | jegonzal
(2 rows)

We can list more about the view using the \d+ option:

In [75]:
!psql ds100 -c "\d+ date_stats"
               View "public.date_stats"
 Column |  Type   | Modifiers | Storage | Description 
--------+---------+-----------+---------+-------------
 day    | date    |           | plain   | 
 min    | integer |           | plain   | 
 avg    | numeric |           | main    | 
 stddev | numeric |           | main    | 
 max    | integer |           | plain   | 
View definition:
 SELECT to_date(indiv_sample.transaction_dt::text, 'MMDDYYYY'::text) AS day,
    min(indiv_sample.transaction_amt) AS min,
    avg(indiv_sample.transaction_amt) AS avg,
    stddev(indiv_sample.transaction_amt) AS stddev,
    max(indiv_sample.transaction_amt) AS max
   FROM indiv_sample
  GROUP BY indiv_sample.transaction_dt
  ORDER BY (to_date(indiv_sample.transaction_dt::text, 'MMDDYYYY'::text));

Views are not materialized

Let's create a random table and we will even seed the random number generator.

In [76]:
%%sql $postgresql_uri

SELECT setseed(0.3);

DROP VIEW IF EXISTS rando;

CREATE VIEW rando(rownum, rnd) AS
SELECT rownum, round(random())::INTEGER
FROM generate_series(1,50) AS ind(rownum)
1 rows affected.
Done.
Done.
Out[76]:
[]

What is the sum of the rows in Random:

In [77]:
%%sql $postgresql_uri

SELECT SUM(rnd) FROM rando;
1 rows affected.
Out[77]:
sum
19

What was that value again?

In [78]:
%%sql $postgresql_uri

SELECT SUM(rnd) FROM rando;
1 rows affected.
Out[78]:
sum
26

</br></br></br>

The value changes with each invocation.

Too Many Views

Views can help:

  • Simplify queries
  • Make complex queries more readable
  • Share "sql programs" with others

Problem:

  • Creating a new view for each (exploratory) query will result in a lot of views!
  • views like: temp1, temp1_joey, temp1_joey_fixed, ...

We need a mechanism to decompose query into views for the scope of a single query.

Common Table Expressions (WITH)

Think of these as a view that exists only during the query.

If we're only going to use a view within a single query, it is a little inelegant to CREATE it, and then have to DROP it later to recycle the view name.

Common Table Expressions (CTEs) are like views that we use on-the-fly. (If you know about lambdas in Python, you can think of CTEs as lambda views.) The syntax for CTEs is to use a WITH clause in front of the query:

WITH <name> [(renamed columns)] AS (<SELECT statement>) [, <name2> AS (<SELECT statement>)...]

If you need multiple CTEs, you separate them with commas. We can rewrite our query above without a view as follows:

In [79]:
%%sql $postgresql_uri

WITH per_day_stats AS (
    SELECT 
        to_date(transaction_dt, 'MMDDYYYY') as day, -- Date Parsing
        min(transaction_amt), 
        avg(transaction_amt), 
        stddev(transaction_amt),
        max(transaction_amt)
    FROM indiv_sample
    GROUP BY transaction_dt
)    
SELECT day, stddev
FROM per_day_stats
WHERE stddev IS NOT NULL
ORDER by stddev DESC
LIMIT 1;
1 rows affected.
Out[79]:
day stddev
2015-01-29 156888.68706687

Percentile Queries on States

In [80]:
%%sql $postgresql_uri

SELECT state, 
       percentile_cont(0.25) WITHIN GROUP (ORDER BY transaction_amt) as lower_quartile,
       percentile_cont(0.5) WITHIN GROUP (ORDER BY transaction_amt) as median,
       percentile_cont(0.75) WITHIN GROUP (ORDER BY transaction_amt) as upper_quartile
FROM indiv_sample
GROUP BY state
ORDER BY upper_quartile DESC
LIMIT 10;
10 rows affected.
Out[80]:
state lower_quartile median upper_quartile
LA 25.0 50.0 250.0
WY 25.0 50.0 237.5
DC 25.0 56.0 200.0
MD 25.0 50.0 200.0
SD 26.5 100.0 162.5
NV 25.0 50.0 150.0
AL 25.0 50.0 120.5
PA 19.0 41.0 104.0
TN 20.0 45.0 103.5
AZ 15.0 27.0 100.0