SQL for Data Scientists, Part 2

In [ ]:
%load_ext sql
In [ ]:
%sql postgresql://jmh:@localhost:5432/jmh

Joins

One of the fundamental operations in the relational algebra is the general-purpose theta-join and its more specific variants, equijoin and natural join. Joins are endemic to the relational model.

Consider our FEC schema. Each committee has a single row in the cm table. But each committee is associated with many donations in the individual table. Some models like XML or JSON would let us "nest" many individual donations for a committee into each cm record, but the relational model only allows us to store a single atomic value (i.e. a basic integer, float, string, etc.) in a field of a record. A table with only atomic values in its cells is said to be in first normal form. Nested data is sometimes said to be denormalized, and when you convert it back to flat tables it is normalized.

So how to we combine individual rows and cm rows in SQL? Via an equijoin on the cmte_id columns in each; the result is still a normalized table.

Recall that $R \Join_\theta S = \sigma_\theta(R \times S)$: i.e. theta joins are simple cartesian products followed by a theta selection. SQL's standard syntax is analogous:

SELECT ...
  FROM R, S
 WHERE <theta>;

The same applies to multi-table queries. $R \Join_{\theta_1} S \Join_{\theta_2} T \Join_{\theta_3} U = \sigma_{\theta_1 \wedge \theta_2 \wedge \theta_3} (R \times S \times T \times U)$. Or equivalently, in SQL:

SELECT ...
  FROM R, S, T, U
 WHERE <theta1>
   AND <theta2>
   AND <theta3>;

Most join queries you'll want to write in SQL will have this form; you simply need to match up the columns properly in the WHERE clause.

Basic joins in SQL

To begin, let's join up the indiv_sample and cm tables on cmte_id. We should get one row out for each row of indiv_sample, because cmte_id is a primary key of cm.

In [ ]:
%%sql select *
        from indiv_sample I, cm
       where I.cmte_id = cm.cmte_id
       limit 2;

Note some issues with column naming:

  • The * in the SELECT list returns all the columns of both tables, in the order the tables appear in the FROM clause.
  • In the FROM clause, we introduce the variable I as an alias for individual, and reuse it in the WHERE clause. (As an alternative syntax, we could also have said from individual AS I; the AS is optional.)
  • The column name cmte_id appears in more than one table in the FROM clause, so we always need to preface it with a table variable to disambiguate it. (Try and see what happens if you remove, say, the I from I.cmte_id.)

We can put an aggregate over a join query to count rows in the output:

In [ ]:
%%sql select count(*)
       from indiv_sample I, cm
      where I.cmte_id = cm.cmte_id;

Let's compare to indiv_sample without the join to validate our assumption about the output size:

In [ ]:
%sql SELECT count(*) FROM indiv_sample;

If you look at the FEC schema, you'll see that some committees in cm are associate with candidates in cn via the cand_id field. So we can figure out which individuals donated to which candidates by joining the three tables:

In [ ]:
%%sql
SELECT name, cand_name, transaction_amt
  FROM indiv_sample I, cm, cn
 WHERE I.cmte_id = cm.cmte_id
   AND cm.cand_id = cn.cand_id
 ORDER BY transaction_amt desc
 LIMIT 5;

Some notes here:

  • The three columns in the SELECT list happen to each appear in only one table, so we don't need to disambiguate them via table variables like I.name, cn.cand_name, I.transaction_amt.
  • Note that the usual ORDER BY/LIMIT logic applies as usual.

As we learned in class, joins can be reordered (somewhat). It's interesting to see what the query optimizer chooses for a join order. We can view this by using EXPLAIN in front of our query:

In [ ]:
%%sql
EXPLAIN 
SELECT name, cand_name, transaction_amt
  FROM indiv_sample I, cm, cn
 WHERE I.cmte_id = cm.cmte_id
   AND cm.cand_id = cn.cand_id;

On my machine, it appears that it chose to do indiv_sample $\Join$ (cm $\Join$ cn).

Self-Joins

Sometimes you want to join a table with itself. For example, we may want to pair up candidate records for the same person running for two different offices on two different party affiliations:

In [ ]:
%%sql
SELECT cn1.cand_name,
       cn1.cand_election_yr, cn1.cand_pty_affiliation, cn1.cand_office, cn1.cand_office_st,
       cn2.cand_election_yr, cn2.cand_pty_affiliation, cn2.cand_office, cn2.cand_office_st
  FROM cn AS cn1, cn AS cn2
 WHERE cn1.cand_name = cn2.cand_name
   AND cn1.cand_office != cn2.cand_office
   AND cn1.cand_pty_affiliation != cn2.cand_pty_affiliation
   AND cn1.cand_election_yr <= cn2.cand_election_yr;

A few notes on the query above:

  • Note the table renaming in the FROM clause to cn1 and cn2. This is required.
  • In essence, cn1 and cn2 are table-valued variables; the fact that they're variables over the same table is basically irrelevant to the query semantics.
  • Obviously every column name is ambiguous and requires cn1. or cn2. out front.
  • Why do you suppose the last AND clause is useful?

Outer joins

As discussed, the standard theta join $R \Join_\theta S$ of the relational algebra is a selection over the cross-product, i.e. $\sigma_\theta(R \times S)$. But it's often more intuitive to think of a join as a "lookup", or a nested loop: for each row in one table $R$ (the outer loop), look for theta-matches in another table $S$ (the inner loop).

What if we want to "preserve" rows from the outer table that have no match? This is called an outer join of $R$ and $S$. If an $R$ tuple has no matches in $S$, we preserve it by returning the $R$ tuple padded with NULL values for all the columns in $S$.

The standard version of this is called a left outer join (or LEFT JOIN in SQL), reflecting the idea that the table on the left side of the (infix) join operator is the "outer loop" -- i.e. the one we preserve. This requires introducing an infix notation for joins in SQL, like so:

In [ ]:
%%sql select cm.cmte_id, cm.cmte_nm, cn.cand_id, cn.cand_name
        from cm LEFT JOIN cn 
                  ON cm.cand_id = cn.cand_id
       limit 5;

For convenience, we can also have a RIGHT JOIN that does the same thing, but treats the right-hand-side of the join operator as the table to preserve:

In [ ]:
%%sql select cm.cmte_id, cm.cmte_nm, cn.cand_id, cn.cand_name
        from cn RIGHT JOIN cm
                   ON cm.cand_id = cn.cand_id
       limit 5;

In fact, we can think about join symmetrically, and preserve non-matching rows from both sides! This is called a FULL JOIN in SQL:

In [ ]:
%%sql select cm.cmte_id, cm.cmte_nm, cn.cand_id, cn.cand_name
        from cn FULL JOIN cm
                  ON cm.cand_id = cn.cand_id
       limit 5;

Having introduced the binary syntax for outer joins, SQL also adds binary syntax for theta joins (which it calls just JOIN, or confusingly, INNER JOIN) and NATURAL JOIN (which doesn't need an ON clause for obvious reasons!) So the following three queries are equivalent:

In [ ]:
%%sql select *
       from indiv_sample I JOIN cm ON I.cmte_id = cm.cmte_id
       limit 3;
In [ ]:
%%sql select *
       from indiv_sample I INNER JOIN cm ON I.cmte_id = cm.cmte_id
       limit 3;
In [ ]:
%%sql select *
       from indiv_sample I NATURAL JOIN cm
       limit 3;

As a matter of software engineering I would discourage you from using NATURAL JOIN ... it's not self-contained to read (you have to go consult the schema to see what a query says), and it's not robust to changes in schema over time.

Putting it all together

We've seen a bunch of different query clauses now, and done some mixing and matching. How do they fit together? The order of evaluation should be thought of like this:

  1. The FROM and WHERE clauses are evaluated to compute selections and joins.
  2. The GROUP BY and HAVING clauses are evaluated to for groups resulting from the previous step
  3. The SELECT clause is evaluated, including any aggregates
  4. The ORDER BY clause is evaluated
  5. The LIMIT clause is used to cut off output production.

Named Queries: Views and CTEs

Up to now we've looked at a single query at a time. SQL also allows us to nest queries in various ways. In this section we look at the cleaner examples of how to do this in SQL: views and Common Table Expressions (CTEs).

Views

In earlier examples, we created new tables and populated them from the result of queries over stored tables. There are two main drawbacks of that approach that may concern us in some cases:

  1. The new table uses up storage, even though it is recomputable from other tables.
  2. If the input tables change, the stored output does not reflect the new state of the input.

For this reason, SQL provides a notion of logical views: these are basically named queries that are re-evaluated upon each reference. They are rather like "macros" if you're familiar with that term.

The syntax is straightforward:

CREATE VIEW <name> AS
<SELECT statement>;

The resulting view <name> can be used in an SELECT query, but not in an INSERT, DELETE or UPDATE query!

As an example, we might want a view that stores just some summary statistics of transaction_amts for each date:

In [ ]:
%%sql
DROP VIEW IF EXISTS date_stats;

CREATE VIEW date_stats AS
SELECT transaction_dt, min(transaction_amt), avg(transaction_amt), stddev(transaction_amt),
       max(transaction_amt)
  FROM indiv_sample
 GROUP BY transaction_dt;

SELECT * from date_stats limit 5;

One of the nice things about views is modularity: if we have a complex query, we can break it up into smaller views and the run queries on the views.

For example, now we can ask for the day with the highest variance in donations per state:

In [ ]:
%%sql
SELECT transaction_dt, stddev
  FROM date_stats
 WHERE stddev IS NOT NULL
 ORDER BY stddev DESC
 LIMIT 1;

Common Table Expressions (WITH)

If we're only going to use a view within a single query, it is a little inelegant to CREATE it, and then have to DROP it later to recycle the view name.

Common Table Expressions (CTEs) are like views that we use on-the-fly. (If you know about lambdas in Python, you can think of CTEs as lambda views.) The syntax for CTEs is to use a WITH clause in front of the query:

WITH <name> [(renamed columns)] AS (<SELECT statement>) [, <name2> AS (<SELECT statement>)...]

If you need multiple CTEs, you separate them with commas. We can rewrite our query above without a view as follows:

In [ ]:
%%sql
WITH perdate AS
  (SELECT transaction_dt AS date, stddev(transaction_amt) AS std
     FROM indiv_sample
    GROUP BY transaction_dt)
    
SELECT date, std
  FROM perdate
 WHERE std IS NOT NULL
 ORDER by std DESC
 LIMIT 1;

We can of course use views or CTEs in join queries as well, just as if they were tables. For example, we can compute the "argmax" of transaction_amt for indiv_sample: those rows that have the maximum transaction_amt:

In [ ]:
%%sql
WITH biggest_gifts AS
  (SELECT max(transaction_amt) AS max
     FROM indiv_sample)
    
SELECT I.transaction_dt, I.name, I.state, I.transaction_amt
  FROM indiv_sample I, biggest_gifts B
 WHERE I.transaction_amt = B.max;

Nested Queries

It is also possible to nest a SQL query within the WHERE clause of another SQL query: this is usually called a "subquery" or "nested query". Time prevents us from covering subqueries here. It's best if you can avoid them anyhow: they are relatively confusing, they often lead to poor performance, and in most cases there is some way to achieve the same effect without using them.

If you'd like to learn more, you can read the relevant material in the PostgreSQL manual or look at slides from CS186 (slides 35-41).

Set and Multiset Operators (skip in class)

Like the relational algebra, SQL supports the operators for union, intersect, and difference of relations. Becase SQL is a multiset (i.e. duplicate-aware) language, it distinguishes between the set-based versions of these operators (which remove duplicates) and the multiset versions (which have rules about the number of duplicates in the output.

The syntax is simple:

<SELECT query>
<set operator>
<SELECT query>;

where the two queries are compatible in the sense of schemas, and the set operator is one of:

  • Union: UNION (set) or UNION ALL (multiset)
  • Intersection: INTERSECT (set) or INTERSECT ALL (multiset)
  • Difference: EXCEPT (set) or EXCEPT ALL (multiset).

The set-based versions of these operations are straightforward. Rather than teach the number of duplicates formed for each multiset operator, I'd encourage you to think about what's intuitive, and then test it out yourself!

As an example, you can run the query below to find the individual records that did not make it into our sample. (This query will run slowly).

In [ ]:
%%sql
SELECT * FROM individual
EXCEPT ALL
SELECT * FROM indiv_sample
LIMIT 5;

Getting Fancy with SQL Aggregates: Statistics, Windows and UDAs

Simple Descriptive Statistics in SQL

Statistics doesn't deal with individuals, it deals with groups: distributions, populations, samples and the like. As such, computing statistics in SQL focuses heavily on aggregation functions.

All SQL systems have simple descriptive statistics built in as aggregation functions:

  • min, max
  • count
  • sum
  • avg
  • stddev and variance, the sample standard deviation and variance.

PostgreSQL offers many more. Some handy ones include

  • stddev_pop and var_pop: the population standard deviation and variance, which you should use rather than stddev and variance if you know your data is the full population, not a sample.
  • covar_samp and covar_pop: sample and population covariance
  • corr, Pearson's correlation coefficient

Order Statistics: Aggregates requiring ordered input

You'll notice that a number of handy statistics are missing from this list, including the median and quartiles. That's because those are order statistics: they are defined based on an ordering of the values in a column.

SQL provides for this by allowing what it calls "ordered set functions", which require a WITHIN GROUP (ORDER BY <columns>) clause to accompany the order-statistic aggregate. For example, to compute the 25th percentile, 50th percentile (median) and 75th percentile in SQL, we can use the following:

In [ ]:
%%sql
select percentile_cont(0.25) within group (order by transaction_amt) as lower_quartile,
       percentile_cont(0.5) within group (order by transaction_amt) as median,
       percentile_cont(0.75) within group (order by transaction_amt) as upper_quartile
  from indiv_sample;

GROUP BY vs. WITHIN GROUP

Note the difference between WITHIN GROUP and GROUP BY:

  • WITHIN GROUP is in the FROM clause
  • WITHIN GROUP is associated with a single aggregate function
  • WITHIN GROUP does not affect the number of groups

Side note for database aficionados: If you're clever, you can express order statistics like median in more "purely relational" SQL without resorting to WITHIN GROUP (ORDER BY ...), but (a) it's hard for people to understand, (b) it's very messy to get more than one order statistic in a single query, and (c) it's quite difficult for a query optimizer to understand and make it go fast.

Of course you can combine WITHIN GROUP and GROUP BY to compute order statistics within groups:

In [ ]:
%%sql
select state, 
       percentile_cont(0.25) within group (order by transaction_amt) as lower_quartile,
       percentile_cont(0.5) within group (order by transaction_amt) as median,
       percentile_cont(0.75) within group (order by transaction_amt) as upper_quartile
  from indiv_sample
 group by state
 limit 5;

Window Functions

Sometimes, for each row in the output of a query, you want perform a calculation on a related set of rows in the output—often a "window" of rows that precede or follow in some order. Again, this is not very "set-oriented", but SQL provides a mechanism to do it, called a window function. The most common window functions are row_number in some order, rank in some order (where equivalent values in the ordering get the same rank), and ntile(n) in some order, which reports which n-tile the row is in:

In [ ]:
%%sql
select row_number() over (order by transaction_amt desc),
       rank() over (order by transaction_amt desc) as desc_rank,
       ntile(4) over (order by transaction_amt desc) as desc_ntile,
       transaction_amt
  from indiv_sample
 order by row_number
limit 10;

Try changing around the ORDER BY clauses in that query -- switch one or more from DESC to ASC, or change the columns around. What happens? Now put the word EXPLAIN in front of your query: do you see what the database system has to do to satisfy your query?

Sometimes you want to compute order statistics within groups. Note that we can't use a normal GROUP BY for this since we don't want one row per group; we want to know a property of each individual output row with respect to its group. We do this via a PARTITION BY clause in the windowed aggregate:

In [ ]:
%%sql
select rank() over(partition by state order by transaction_amt desc),
       state,
       transaction_amt
  from indiv_sample
 limit 10;

We can also use standard aggregates over these partitions to append grouped aggregates to each row:

In [ ]:
%%sql
select rank() over(partition by state order by transaction_amt desc),
       state,
       avg(transaction_amt) over(partition by state)
       transaction_amt
  from indiv_sample
 order by state
 limit 10;

As a final example, let's reconsider the humble argmax:

In [ ]:
%%sql
with ranked as (
select rank() over(partition by state order by transaction_amt desc),
       transaction_dt, name, state, transaction_amt
  from indiv_sample
)
select * from ranked 
 where rank = 1;

Does this give the same answer as the implementation at the top of the notebook? Run EXPLAIN on each to see how the database system executes the different forms of the query.

PARTITION BY vs. GROUP BY vs. WITHIN GROUP

When do we use PARTITION BY? Here are some key things to remember about PARTITION BY:

  • PARTITION BY is used within an OVER clause
  • Each OVER clause is associated with a single window function in the FROM clause
  • a window function is not an aggregate -- it does not affect the number of rows in the output. Instead it allows an aggregate function to be calculated over a "window" of rows associated with each row.

By contrast:

  • WITHIN GROUP is used to modify a single aggregate function (not a window function)---specifically an ordered-set aggregate function
  • GROUP BY partitions the input set before aggregation begins, and affects all aggregate functions in the SELECT clause.

More general Windowing

Note that PARTITION BY is a specific way define a window "frame": based on distinct values.

Sometimes you want to define window frames based on position rather than value: e.g. the running 7-day total of transaction_amt, or the rank of the current transaction_amt among 2 days preceding or following, etc.

This can be done in SQL as well, but we won't go into the details here. You can look at the PostgreSQL manual for more information.

User-Defined Aggregates (UDAs)

Earlier we saw how to register Python code as "user-defined functions" (UDFs) in PostgreSQL. Those were scalar functions: they produces one value of output for each input. We can also register "user-defined aggregates" (UDAs), which produce one value of output for a group of inputs.

A user-defined aggregate is based on managing running state per group. This state is determined by registering three things:

  1. An initial condition of the aggregate state before any values are examined
  2. A state transition function, which is a standard UDF that takes each input value and modifies the state accordingly
  3. A final function that computes the aggregate value based on the state function. If the final function is the identity function, it can be omitted.

As an example, we can define a user-defined aggregates for computing GPAs based on letter grades. The initial condition of the state is the pair (0,0) representing the count and sum of grades so far. The state transition function increments the count, and adds to the sum based on decoding the letter grade into a number between 0 and 4. The final function returns the sum divided by the count, unless the count is 0 in which case it returns NULL.

In [ ]:
%%sql
DROP AGGREGATE IF EXISTS gpa(char);
DROP FUNCTION IF EXISTS final_gpa(avg_state);
DROP FUNCTION IF EXISTS transition_gpa(avg_state, char);
DROP TYPE IF EXISTS avg_state;

CREATE TYPE avg_state AS (cnt integer, total integer);

CREATE FUNCTION transition_gpa(state avg_state, grade char) returns avg_state
AS 
$$
def grade_to_num(g):
    lookup = {
        'A': 4,
        'B': 3,
        'C': 2,
        'D': 1,
        'F': 0
    };
    return lookup.get(g, None);
state["cnt"] += 1;
state["total"] += grade_to_num(grade);
return state;
$$ LANGUAGE plpythonu;

CREATE FUNCTION final_gpa(state avg_state) returns float
AS
$$
if (state["cnt"] == 0):
    return None
else:
    return state["total"]*1.0 / state["cnt"];
end
$$ LANGUAGE plpythonu;

CREATE AGGREGATE gpa(char) 
(
    initcond = '(0,0)',
    stype = avg_state,
    sfunc = transition_gpa,
    finalfunc = final_gpa
);
In [ ]:
%%sql
DROP TABLE IF EXISTS gradebook;
CREATE TABLE gradebook(name text, course text, grade char);
INSERT INTO gradebook values
 ('Joe', 'DS100', 'B'),
 ('Mary', 'DS100', 'A'),
 ('Sam', 'DS100', 'D');
select gpa(grade) from gradebook;

Now you can use your UDA wherever you used to use a standard aggregate function, including with GROUP BY and other clauses.

Note that there's a more involved interface for (efficient) implementation of windowed aggregates.

Details on UDAs can be found in the PostgreSQL manual.

Exercises: Statistical Computations in SQL

Given what you've learned so far, here are some things you should be able to do with a table too big to fit into memory in Python:

  1. Compute histogram bins in SQL to feed to a plotting package
  2. Compute 2-dimensional bins in SQL to feed to a heatmap plot
  3. Compute the summary statistics for box plots
  4. Represent a sparse matrix as a table
  5. Implement vector/matrix operations in SQL on sparse matrices

Indexes

Up to this point we have left performance entirely in the hands of the database system. As it turns out, most relational databases still leave a large degree of performance tuning to the user (or "database administrator").

The main way you can influence performance is to ask the database to build "indexes" on columns. This is particularly useful for columns that you use in your WHERE clauses (theta expressions for selection and join).

Here is an example.

In [ ]:
%%sql 
drop index if exists ind_sample_name_ix;

explain analyze 
select *
  from indiv_sample
 where name = 'HATHAWAY, RANDOLPH';
In [ ]:
%%sql 
create index ind_sample_name_ix on indiv_sample (name); 
In [ ]:
%sql explain analyze select * \
       from indiv_sample \
      where name = 'HATHAWAY, RANDOLPH';

See how the execution time went down by 1-2 orders of magnitude? The index makes lookups (both exact-match lookups and small range lookups) very fast compared to scanning through all the records in the table. Of course it also made the query optimizer ("planning time") slower since it had more options to consider.

For a large table like individual this can be an enormous win at query time. Recall that individual is nearly 40,000,000 rows big. I previously built an index on individual.name, and look at the runtime for a lookup query on name now:

In [ ]:
%sql explain analyze select * \
       from individual \
      where name = 'HATHAWAY, RANDOLPH';

Indexes can also improve join performance. For example:

In [ ]:
%%sql 
drop index if exists indiv_sample_cmte;

explain analyze select *
       from indiv_sample I, cm, cn
      where I.cmte_id = cm.cmte_id
        and cm.cand_id = cn.cand_id
      limit 10;
In [ ]:
%%sql 
create index indiv_sample_cmte on indiv_sample(cmte_id);
explain analyze select *
       from indiv_sample I, cm, cn
      where I.cmte_id = cm.cmte_id
        and cm.cand_id = cn.cand_id
      limit 10;

Again, the performance win here would be much more dramatic for the individual table, but I did not take the time and space to build an index on individual.cmte_id yet.

Why not use indexes on all your columns?

  • Indexes take time to build
  • Indexes slow down insertions/deletions/updates
  • Indexes take up disk space
  • In principle you could index any subset of columns -- that's too many indexes!

So the typical practice is to add indexes as you find your work getting too slow.