Syllabus

This syllabus is still under development and is subject to change.

Week Lecture Date Topic
1 1 8/24/2017

Course Overview and The Data Science Life Cycle [Gonzalez]

In this lecture we provide an overview of the class, discuss what it means to be a data scientist by examining recent surveys of data scientists, and then introduce the data science lifecycle spanning question formation, data acquisition and cleaning, exploratory data analysis and visualization, and finally prediction and inference.

[ pptx | pdf | handout | exercise materials | web notebook | optional reading (Data Science and Science) | screencast ]

Homework 1 Released: Python Refresher

2 2 8/29/2017

Data Generation [Nolan]

Fundamentally, (data) science is the study of using data to learn about the world and solve problems. However, how and what data is collected can have a profound impact on what we can learn and the problems we can solve. In this lecture, we will begin to explore various mechanisms for data collection and their implications on our ability to generalize. In particular we will discuss differences between census, surveys, controlled experiments, and observational studies. We will highlight the power of simple randomization and the fallacies of data scale.

[ pptx | pdf | handout | random sampling notes | random sampling notes (typed) | screencast ]

3 8/31/2017

Data Tables, Indexes, and Pandas [Lau]

While data comes in many forms, most data analysis are done on tabular data. Mastering the skills of constructing, cleaning, joining, aggregating, and manipulating tabular data is essential to data science. In this lecture, we will introduce Pandas, the open-source Python data manipulation library widely used by data scientists. In addition to introducing new syntax, we will introduce new concepts including indexes, the role of column operations on system performance, and basic tools to begin visualizing data.

[ slides | notebook html | optional reading (Pandas in 10 Minutes) | notebook file | screencast | discussion slides ]

Vitamin 1 Released

3 4 9/5/2017

EDA and Data Cleaning [Gonzalez]

Whether collected by you or obtained from someone else, raw data is seldom ready for immediate analysis. Through exploratory data analysis we can often discover important anomalies, identify limitations in the collection process, and better inform subsequent goal oriented analysis. In this lecture we will discuss how to identify and correct common data anomalies and their implications on future analysis. We will also discuss key properties of data including structure, granularity, faithfulness, temporality and scope and how these properties can inform how we prepare, analyze, and visualize data.

[ pptx | pdf | handout | optional reading | optional interesting tutorial | web notebook | python | data | screencast ]

Lab 2 Released: pandas

Homework 2 Released: Food Safety Data Cleaning and EDA

5 9/7/2017

EDA Continued [Gonzalez]

Continue exploratory data analysis by digging into police records and basic data visualization.

[ pptx | pdf | handout | Joins notebook (html) | Joins notebook (ipynb) | EDA notebook (html) | EDA notebook (ipynb) | data | screencast | discussion notebook ]

Vitamin 2 Released

4 6 9/12/2017

Visualization and Data Tranformations [Nolan]

A large fraction of the human brain is devoted to visual perception. As a consequence, visualization is a critical tool in both exploratory data analysis and communicating complex relationships in data. However, making informative and clear visualizations of complex concepts can be challenging. In this lecture, we explore good and bad visualizations and describe how to choose visualizations for various kinds of data and goals.

[ pptx | pdf | handout | optional reading (Seaborn Tutorial) | optional reading (Plotting with Pandas) | screencast ]

Lab 3 Released: Data cleaning and seaborn

Homework 3 Released: Bike Sharing and Multivariate Visualization

7 9/14/2017

Visualization, Ctd. [Nolan]

Directly visualizing data can result in less informative for several reaasons: plots as curvilinear relationships can be difficult to assess; large numbers of observations can hide core features; and it can be difficult to visualize large numbers of variables. In this lecture, we discuss techniques of data transformations, smoothing, and dimensionality reduction to address challenges in creating informative visualization. With these additional analytics we can often reveal important and informative patterns in data. We pick up with transformations.

[ pptx | pdf | handout | screencast ]

Vitamin 3 Released

5 8 9/19/2017

Working with text [Nolan]

Whether in documents, tweets, or records in a table, text data is ubiquitous and presents a unique set of challenges for data scientists. How do you extract key phrases from text? What are meaningful aggregate summaries of text? How do you visualize textual data? In this lecture we will introduce a set of techniques (e.g., bag-of-words) to transform text into numerical data and subsequent tabular analysis. We will also introduce regular expressions as a mechanism for cleaning and transforming text data data.

[ pptx | pdf | handout | RegEx notebook (html) | RegEx notebook (ipynb) | data | screencast ]

Project 1 Released: Twitter Analysis

Lab 4 Released: Plotting, smoothing, transformation

9 9/21/2017

Modeling and Estimation [Gonzalez]

How do we pick a number to represent a dataset? What makes a number a good representation? The statistician (and data scientist ahead of his time) George Box once wrote “Essentially, all models are wrong, but some are useful.” A key step in data science is developing models that capture the essential signal in data while providing insight into the phenomena that govern the data and enable prediction effective prediction. In this lecture we address the fundamental question of how to choose a number and more generally a model that reflects the data. We will introduce the concept of loss functions and begin to develop basic models.

[ pptx | pdf | handout | Estimation Notebook (html) | Estimation Notebook (ipynb) | screencast ]

Vitamin 4 Released

6 10 9/26/2017

Modeling and Estimation Continued [Gonzalez]

In this lecture we will continue our development of models within the framework of loss minimization. In the process we will begin to study how loss minimization on the data relates to a broader goal of loss minimization on future predictions. In the process we will review basic concepts in probability (distributions and expectations) and then introduce the concept of empirical risk minimization.

[ pptx | pdf | handout | screencast ]

Project 1 Checkpoint Due

Lab 5 Released: Regular Expression

11 9/28/2017

Population, Sample, and their connection [Nolan]

In this lecture we will study the connections between the population and the sample. We use probability theory, simulation and the bootstrap to make statements about these connections.

[ pptx | pdf | handout | notes on probability and expectation | notes on probability and expectation (typed) | Estimation Notebook (html) | Estimation Notebook (ipynb) | screencast ]

7 12 10/3/2017

Probability theory, Monte Carlo Simulation, and Bootstraping [Nolan]

We saw in the last lecture that we can study parameter estimators using theoretical and computational approaches. In this lecture, we will delve deeper into the bootstrap to study the behvaior of the empirical 75th percentile as an estimator for its population counterpart. We will derive the empirical quantile through optimization of a loss function, show that the population parameter minimizes the expected loss, bootstrap the sampling distribution of the empirical 75th percentile, and use the bootstrapped distribution to provide interval estimates for the population parameter. In addition, we will provide a more comprehensive review of basic probability.

[ pptx | pdf | handout | screencast | Bootstrap Reading (Inferential Thinking) | Confidence Intervals Reading (Inferential Thinking) | Solving Loss Functions Notes ]

Project 1 Due

Lab 6 Released: Modeling and Estimation

13 10/5/2017

Hypothesis Testing [Nolan]

A key step in inference is often answering a question about the world. We will consider 4 such questions to varying degrees of detail:
1) Is there enough evidence to bring someone to trial? 2) Is there evidence of an earth-like planet orbiting a star? 3) Do female TAs get lower teaching evaluations than male TAs? We use hypothesis testing to answer these questions.
In particular, we examine a collection non-parametric hypothesis tests. These powerful procedures build on the basic idea of random simulation to help quantify the rarity of a particular phenomenon. In the process of using these procedures we will also touch on the challenges of false discovery and multiple testing.

[ pptx | pdf | handout | screencast ]

8 14 10/10/2017

Midterm Review [Gonzalez]

This lecture will review key topics from the course that will be covered on the midterm.

[ pptx | pdf | handout | practice midterm | practice midterm with solutions | screencast ]

15 10/12/2017

Midterm [Gonzalez and Nolan]

There is NO LECTURE today. The midterm will be held at 7:00 in Dwinelle 145 and 155.

[ midterm | midterm solutions ]

9 16 10/17/2017

Introduction to SQL and the Relational Model [Gonzalez]

SQL is the most widely used language for accessing and manipulating data. In this lecture we introduce the SQL language and more the Relational model of data. We will describe some of the basic SQL operations and provide some motivation behind the relational model of tabular data. (Lecture updated 10/18/2017.)

[ pptx | pdf | handout | Notebook HTML | Notebook ipynb | PostgreSQL for Mac | PostgreSQL for Windows | PostgreSQL for Linux | W3C Tutorial | screencast ]

Lab 7 Released: Bootstrap

Homework 4 Released: The Bootstrap

17 10/19/2017

More Advanced SQL [Gonzalez]

In this lecture we continue to dig into more advanced SQL expressions. We will introduce common table expressions, group by, and join operations and explore FEC data.

[ pptx | pdf | handout | Notebook HTML | Notebook ipynb | The FEC SQL Data (Students and Sailors inlcuded) | Loading the SQL Dump | screencast ]

Vitamin 7 Released

10 18 10/24/2017

Big Data Part 1 [Gonzalez]

Data management at the level of big organizations can be confusing and often relies on many different technologies. In this lecture we will provide an overview of organizational data management and introduce some of the key technologies used to store and compute on data at scale. Time permitting we will dive into some basic Spark programming.

[ pptx | pdf | handout | Spark Notebook HTML (tentative) | Spark Notebook ipynb (tentative) | screencast ]

Lab 8 Released: SQL and Database Setup

19 10/26/2017

P-Hacking [Guest Lecturer Aaditya Ramdas]

Almost daily, internet companies make decisions to change and update their website or app. As simple examples, Google may run tests to see if users prefer their ads on the top or on the side of the page, where the Facebook like button should be placed or what it’s size should be, whether the default setting when you open the Uber/Lyft app should be “Pool” or “Solo” or “Luxury”, and so on.

After running “A/B” tests at some prescribed level of confidence, companies sometimes make decisions to alter their current website or app, in the hope of more money or clicks. What proportion of these decisions (to change some feature, based on an A/B test) are “mistakes”? What might cause these mistakes in the first place? Is it even possible to track the number or fraction of bad decisions? We will gently introduce ideas in sequential testing, false discovery rate, and discuss why some of the answers can be quite subtle.

[ pdf | screencast ]

Homework 5 Released: SQL, FEC Data, and Small Donors

11 20 10/31/2017

Finish P-Hacking and Spark [Ramdas, Gonzalez]

Wrap-up: Our guest lecturer will finish discussing p-hacking (despite the very cool name this actually a bad thing.) We will also finish covering Spark and big-data processing.

[ previous pdf | Previous Spark Notebook HTML (tentative) | Previous Spark Notebook ipynb (tentative) | screencast ]

Lab 9 Released: Spark

21 11/2/2017

Linear Models [Nolan]

Linear regression is at the foundation of most machine learning and statistical methods. In this lecture we introduce least squares linear regression and derive the normal equations from a geometric perspective.

[ Linear models pdf | Linear models ppptx | chalk board | Handwritten notes | screencast ]

12 22 11/7/2017

Linear Regression and Feature Engineering [Gonzalez]

How do we fit linear models to non-linear and often categorical data?
In this lecture we will continue our discussion of least squares linear regression. We will revist the normal equations from the perspective of optimization and discuss some of the computational issues around solving the normal equations. We will then transition to the task of feature engineering and describe a range of techniques for transforming data to enable linear models to fit complex relationships.

[ pptx | pdf | handout | Linear Regression Notebook HTML | Feature Engineering Notebook HTML | Raw ipynb Files and Data | screencast ]

Lab 10 Released: Least Squares Regression

23 11/9/2017

Over fitting, Cross-validation, and Data Science Life Cycle [Nolan]

In this lecture we will follow the data science life cycle as we address a problem through fitting a linear model. Along the way, we touch on the following topics: how to select a loss function; the test-train split to avoid over fitting; selecting a Box-Cox transformation;
best subset regression; coding, collapsing, and fitting dummy variables; cross-validation for model selection. In later lecture, we will continue the topic of model selection and address issues that arise when there are a large number of variables in the model.

[ pptx | pdf | screencast | Data Science Life Cycle Notebook HTML | Raw ipynb Files and Data ]

Homework 6 Released: Spark and Least Squares Linear Regression

13 24 11/14/2017

The Bias Variance Tradeoff and Regularization [Gonzalez]

There is a fundamental tension in predictive modeling between our ability to fit the data and to generalize to the world. In this lecture we characterize this tension through the tradeoff between bias and variance. We will derive the bias and variance decomposition of the least squares objective. We then discuss how to manage this tradeoff by augmenting our objective with a regularization penalty.

[ pptx | pdf | handout | Feature Engineering Part 2 (HTML) | Bias, Variance, and Regularization (HTML) | Raw Notebooks (ipynb.zip) | screencast ]

Lab 11 Released: Feature Engineering & Cross Validation

25 11/16/2017

Classification and Logistic Regression [Nolan]

We consider the case where our response is categorical, in particular, we focus on the simple case where the response has two categories. We begin by using least squares to fit the binary response to categorical explanatory variables and find that the predictions are proportions. Next, we consider more complex model that is linear in quantitative explanatory variables, which is called the linear probability model, and we uncover limitations of this model. We motivate an alternative model, the logistic, by examining a local linear fit and matching its shape. We also draw connections between the logistic and log odds. Lastly, we introduce an alternative loss function that is more appropriate for working with probabilities, the Kullback-Leibler Divergence. We derive a representation of the K-L divergence for binary response variables.

[ pptx | pdf | screencast ]

Project 2 Released: Spam/Ham Prediction

14 26 11/21/2017

Logistic Regression and Gradient Descent [Gonzalez]

Losistic regression is perhaps one of the most widely used models for classification and a key building block in neural networks. In this lecture we review the logistic regression model and the KL-divergence loss function. We then introduce the gradient descent and stochastic gradient descent algorithms and discuss how these can be used to minimize the KL-divergence loss function. Finally, we extend the logistic regression model to the multi-class classification setting through the soft-max function and briefly discuss the connection to deep learning.

[ pptx | pdf | handout | Logistic Regression Notebook Part 1 (html) | Logistic Regression Notebook Part 2 (html) | Raw Notebooks | screencast ]

27 11/23/2017

Holiday!!

15 28 11/28/2017

HTTP requests, REST, and XPath [Nolan]

Data are available on the Internet - they can be embedded in web pages, provided after a form submission, and accessible through a REST API. We can access data from these various sources using HTTP - the HyperText Transfer Protocol. We will cover HTTP at a high level, and provide examples of how to use it to access web pages, submit forms, and crate REST requests. Furthermore, when the data are embedded in an HTML document, we can use XPath to locate and extract it. XPath is a powerful tool developed for locating elements and content in an XML document (much better than regular expressions!). We cover the basics of XPath, which is enough to write complex expressions.

[ pptx | pdf | handout | screencast ]

Lab 12 Released: TensorFlow & Logistic/Softmax Regression

29 11/30/2017

Ethics in Data Science [Guest Speaker Joshua Kroll]

Data science is being used in growing number of settings to make decisions that impact peoples lives. In this lecture we will discuss just a few of the many ethical and legal considerations in the application of data science to real-world problems.

Our guest speaker Joshua Kroll is a computer scientist and researcher interested in the governance of automated decision-making systems, especially those built with machine learning. He is currently Postdoctoral Research Scholar at UC Berkeley’s School of Information, working with Deirdre Mulligan. Before that he received a PhD in Computer Science in the Security Group at Princeton University. His dissertation on Accountable Algorithms was advised by Edward W. Felten and supported by the Center for Information Technology Policy where he studied topics in security, privacy, and how technology informs policy decisions. Joshua was the program chair of this year’s edition of the successful workshop series “Fairness, Accountability, and Transparency in Machine Learning (FAT/ML)”.

We will also have short presentation from Jake Soloff one of our distinguished TAs who recently won the Citadel Data Science Competition studying the impact of charter schools on education.

[ pptx | pdf | handout | DS Major Info. | screencast ]

16 30 12/5/2017

RRR Review [Gonzalez]

[ pptx (v2) | pdf (v2) | handout (v2) | screencast ]

Lab 13 Released: XPath

31 12/7/2017

RRR Review [Gonzalez]

[ practice questions | practice questions with solutions | pptx | pdf | handout | screencast ]

17 32 12/12/2017

Study!

33 12/14/2017

Final Exam

The final exam is currently in exam group 13 and therefore will be from 8:00AM to 11:00AM on Thursday 12/14/2017.

  • When: 8:00 AM to 11:00 AM – SET ALL YOUR ALARMS
  • Where: Valley Life Sciences 2050