This schedule is still under development and is subject to change.
Week  Lecture  Date  Topic 


1  01/17/2017 
Course Overview [Gonzalez]In this lecture we define and motivate the study of data science and outline the key ideas covered throughout the class. Lecture Notes
Homework 1 Released 
2  01/19/2017 
The Data Science Lifecycle [Gonzalez]In this lecture we introduce the datascience lifecycle and explore each stage by analyzing tweets from the 2016 presidential election. Lecture Notes 


3  01/24/2017 
Problem Formulation and Experimental Design [Yu]In this lecture we provide an overview of how to formulate hypothesis, identify sources of data, and construct basic experiments to collect data. Lecture NotesHomework 2 ReleasedHomework 1 Due 
4  01/26/2017 
Data Wrangling [Hellerstein]In this lecture we explore the challenges of data preparation (e.g., assessing, structuring, cleaning, and rolling up data) and the kinds of errors commonly found in the real world. Lecture Notes



5  01/31/2017 
Exploratory Data Analysis [Nolan]In this lecture we provide an overview of exploratory data analysis (EDA). Lecture Notes 
6  02/02/2017 
Visualization and Communication [Nolan]This lecture covers how to effectively visualize and communicate complex results to a broader audience. Lecture Notes: 


7  02/07/2017 
Advanced Python Data Science Tools [Gonzalez]In this lecture we will introduce Pandas, dataframe manipulation, python visualization, and some of the batch oriented philosophy of scalable data processing. Lecture Notes:Homework 3 ReleasedHomework 2 Due 
8  02/09/2017 
Prediction and Inference [Yu]In this lecture we will explore the key types and challenges of inference and predictions. We will provide an overview of the categories of prediction problems and introduce some of the key machine learning tools in python. Lecture Notes: 


9  02/14/2017 
Relational Algebra and SQL [Hellerstein]In this lecture we introduce SQL and the relational model. Lecture Notes: 
10  02/16/2017 
SQL Continued [Hellerstein]In this lecture we will introduce data analysis techniques with a focus on aggregation and summary statistics. Lecture Notes:



11  02/21/2017 
Advanced SQL [Hellerstein]In this lecture we will cover SQL joins, views, and CTEs, as well as advanced aggregation including order statistics, window functions and userdefined aggregates.
Homework 4 ReleasedHomework 3 Due 
12  02/23/2017 
Basic Modeling using Statistical Distributions [Nolan]In this lecture we provide an overview of several basic distributions and discuss some of the challenges of working with skewed data. Lecture Notes: 


13  02/28/2017 
Maximum Likelihood Estimation [Nolan]In this lecture we fit basic models to data by applying the method of maximum likelihood estimation. 
14  03/02/2017 
Maximum Likelihood Estimation Continued [Nolan]This lecture will continue discussion on the method of maximum likelihood. 


15  03/07/2017 
Midterm Review [Gonzalez] 
16  03/09/2017 
MidtermThis may change in the weeks before class starts as we adjust the schedule. 


17  03/14/2017 
Least Squares Regression and Hypothesis Testing [Yu]In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions. Homework 4 DueHomework 5 Released 
18  03/16/2017 
Least Squares Regression and Hypothesis Testing [Yu]In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions.



19  03/21/2017 
Feature Engineering, Overfitting, and Cross Validation [Gonzalez]In this lecture we will begin to do some machine learning. We will explore how simple linear techniques can be used to address complex nonlinear relationships on a wide range of data types. We will start to use scikitlearn to build and visualize models in higher dimensional spaces. We will address a key challenge in machine learning – overfitting and discuss how crossvalidation can be used to address overfitting. The following interactive (html) notebooks walk through the concepts we use in lecture and are suggested reading materials. 
20  03/23/2017 
Feature Engineering, Overfitting, and Cross Validation Continued [Gonzalez]In this lecture we continue the discussion from the last lecture pushing further into feature engineering. 


21  03/28/2017 
Spring Break 
22  03/30/2017 
Spring Break 


23  04/04/2017 
Regularization and the Bias Variance tradeoff [Gonzalez]In this lecture will continue our exploration of overfitting and derive the fundamental bias variance tradeoff for the least squares model. We will then introduce the concept of regularization and explore the commonly used L1 and L2 regularization functions.
Homework 5 DueHomework 6 Released 
24  04/06/2017 
Logistic Regression [Gonzalez]In this lecture we will finish our discussion on regularization and begin to study how linear models can be used to build classifiers through logistic regression. 


25  04/11/2017 
Finish Logistic Regression and Start KMeans [Gonzalez and Yu]In this lecture we will finish our discussion on logistic regression and begin to explore unsupervised learning techniques. In particular we will start with Kmeans work towards the more general EM algorithm.
Additional Reading:

26  04/13/2017 
Clustering and Expectation Maximization (EM) [Yu]This lecture will continue to cover EM and more general mixed membership clustering techniques. Optional Reading:



27  04/18/2017 
MapReduce, Spark, and Big Data [Gonzalez]In this lecture we will introduce the MapReduce model of distributed computation and then dive into the Apache Spark MapReduce system developed at Berkeley. We will talk about how to use the computational frameworks to scale data processing.
Additional Reading:
Homework 6 DueHomework 7 Released 
28  04/20/2017 
Guest Lecturer on Data Science and Ethics [Charis Thompson] 


29  04/25/2017 
Finish Discussion on Spark and ClassificationIn the previous lectures we moved quickly through some important concepts in distributed data processing and classification. Because both of these ideas are critical in many data science applications, we will return to the discussion on Spark and review how the relational operators we learned earlier in the class enable scalable distributed computing. We will then return to the topic of classification and review logistic regression and how it can be made to run in a distributed computing environment. Time permitting we will touch on Deep Learning as a generalization of the ideas in logistic regression. 
30  04/27/2017 
PCA and the Berkeley Data Science Major [Nolan and Cathryn Carson]In this lecture we will provide an overview of dimensionality reduction and discuss the PCA method. We will conclude with a discussion from Cathryn Carson on the development and status of the Berkeley Data Science Major.
Homework 7 Due 


31  05/02/2017 
RRR Review [Hellerstein and Yu]This will be part one of a two part exam review lecture to be held during the regular lecture slot. Homework 7 Due (optional extension) 
32  05/04/2017 
RRR Review [Gonzalez and Nolan]This will be part two of a two part exam review lecture to be held during the regular lecture slot. 


33  05/11/2017 
Final ExamThe final exam will be from 3:00 to 6:00 PM on Thursday, May 11, in 100 GPB (Genetics and Plant Biology). For details about exam scheduling visit the Berkeley Exam Calendar. 