This schedule is still under development and is subject to change.

Week Lecture Date Topic
1
1 01/17/2017

Course Overview [Gonzalez]

In this lecture we define and motivate the study of data science and outline the key ideas covered throughout the class.

Lecture Notes

Homework 1 Released

2 01/19/2017

The Data Science Lifecycle [Gonzalez]

In this lecture we introduce the data-science life-cycle and explore each stage by analyzing tweets from the 2016 presidential election.

Lecture Notes
2
3 01/24/2017

Problem Formulation and Experimental Design [Yu]

In this lecture we provide an overview of how to formulate hypothesis, identify sources of data, and construct basic experiments to collect data.

Lecture Notes

Homework 2 Released

Homework 1 Due

4 01/26/2017

Data Wrangling [Hellerstein]

In this lecture we explore the challenges of data preparation (e.g., assessing, structuring, cleaning, and rolling up data) and the kinds of errors commonly found in the real world.

Lecture Notes

3
5 01/31/2017

Exploratory Data Analysis [Nolan]

In this lecture we provide an overview of exploratory data analysis (EDA).

Lecture Notes
6 02/02/2017

Visualization and Communication [Nolan]

This lecture covers how to effectively visualize and communicate complex results to a broader audience.

Lecture Notes:
4
7 02/07/2017

Advanced Python Data Science Tools [Gonzalez]

In this lecture we will introduce Pandas, dataframe manipulation, python visualization, and some of the batch oriented philosophy of scalable data processing.

Lecture Notes:

Homework 3 Released

Homework 2 Due

8 02/09/2017

Prediction and Inference [Yu]

In this lecture we will explore the key types and challenges of inference and predictions. We will provide an overview of the categories of prediction problems and introduce some of the key machine learning tools in python.

Lecture Notes:

5
9 02/14/2017

Relational Algebra and SQL [Hellerstein]

In this lecture we introduce SQL and the relational model.

Lecture Notes:

10 02/16/2017

SQL Continued [Hellerstein]

In this lecture we will introduce data analysis techniques with a focus on aggregation and summary statistics.

Lecture Notes:

6
11 02/21/2017

Advanced SQL [Hellerstein]

In this lecture we will cover SQL joins, views, and CTEs, as well as advanced aggregation including order statistics, window functions and user-defined aggregates.

Homework 4 Released

Homework 3 Due

12 02/23/2017

Basic Modeling using Statistical Distributions [Nolan]

In this lecture we provide an overview of several basic distributions and discuss some of the challenges of working with skewed data.

Lecture Notes:

7
13 02/28/2017

Maximum Likelihood Estimation [Nolan]

In this lecture we fit basic models to data by applying the method of maximum likelihood estimation.

14 03/02/2017

Maximum Likelihood Estimation Continued [Nolan]

This lecture will continue discussion on the method of maximum likelihood.

8
15 03/07/2017

Midterm Review [Gonzalez]

16 03/09/2017

Midterm

This may change in the weeks before class starts as we adjust the schedule.

9
17 03/14/2017

Least Squares Regression and Hypothesis Testing [Yu]

In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions.

Homework 4 Due

Homework 5 Released

18 03/16/2017

Least Squares Regression and Hypothesis Testing [Yu]

In this lecture dives into the details of least squares regression through the lens of empirical risk minimization while discussing some of the key modeling assumptions.

10
19 03/21/2017

Feature Engineering, Over-fitting, and Cross Validation [Gonzalez]

In this lecture we will begin to do some machine learning. We will explore how simple linear techniques can be used to address complex non-linear relationships on a wide range of data types. We will start to use scikit-learn to build and visualize models in higher dimensional spaces. We will address a key challenge in machine learning – over-fitting and discuss how cross-validation can be used to address over-fitting.

The following interactive (html) notebooks walk through the concepts we use in lecture and are suggested reading materials.

  • Least-Squares Linear Regression: (html, ipynb)

  • Feature Engineering Part 1: (html, ipynb, data)

  • An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures.

  • Optional reading: Chapter 3.1, 3.2.

20 03/23/2017

Feature Engineering, Over-fitting, and Cross Validation Continued [Gonzalez]

In this lecture we continue the discussion from the last lecture pushing further into feature engineering.

  • Feature Engineering Part 1: (html, ipynb)

  • Feature Engineering Part 2: (html, ipynb, data)

  • An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures.

  • Optional reading Chapter 2.1, 2.2

11
21 03/28/2017

Spring Break

22 03/30/2017

Spring Break

12
23 04/04/2017

Regularization and the Bias Variance tradeoff [Gonzalez]

In this lecture will continue our exploration of over-fitting and derive the fundamental bias variance tradeoff for the least squares model. We will then introduce the concept of regularization and explore the commonly used L1 and L2 regularization functions.

  • Slides: (pptx, pdf, handout)

  • Interactive Notebook on Cross Validation and the Bias Variance Tradeoff: (html, ipynb)

  • An archive zip file of all notebooks, data, and figures for regression and subsequent over-fitting lectures.

  • An alternative derivation of the Bias Variance Trade-Off provided by Professor Yu (pdf)

Homework 5 Due

Homework 6 Released

24 04/06/2017

Logistic Regression [Gonzalez]

In this lecture we will finish our discussion on regularization and begin to study how linear models can be used to build classifiers through logistic regression.

13
25 04/11/2017

Finish Logistic Regression and Start K-Means [Gonzalez and Yu]

In this lecture we will finish our discussion on logistic regression and begin to explore unsupervised learning techniques. In particular we will start with K-means work towards the more general EM algorithm.

  • Part 2 of Logistic Regression Slides: (pptx, pdf, handout)

  • We will continue to use the previous notebook on logistic regression.

  • K-Means Slides: (pptx, pdf, handout)

Additional Reading:

26 04/13/2017

Clustering and Expectation Maximization (EM) [Yu]

This lecture will continue to cover EM and more general mixed membership clustering techniques.

Optional Reading:

14
27 04/18/2017

Map-Reduce, Spark, and Big Data [Gonzalez]

In this lecture we will introduce the Map-Reduce model of distributed computation and then dive into the Apache Spark Map-Reduce system developed at Berkeley. We will talk about how to use the computational frameworks to scale data processing.

Additional Reading:

Homework 6 Due

Homework 7 Released

28 04/20/2017

Guest Lecturer on Data Science and Ethics [Charis Thompson]

15
29 04/25/2017

Finish Discussion on Spark and Classification

In the previous lectures we moved quickly through some important concepts in distributed data processing and classification. Because both of these ideas are critical in many data science applications, we will return to the discussion on Spark and review how the relational operators we learned earlier in the class enable scalable distributed computing. We will then return to the topic of classification and review logistic regression and how it can be made to run in a distributed computing environment. Time permitting we will touch on Deep Learning as a generalization of the ideas in logistic regression.

30 04/27/2017

PCA and the Berkeley Data Science Major [Nolan and Cathryn Carson]

In this lecture we will provide an overview of dimensionality reduction and discuss the PCA method. We will conclude with a discussion from Cathryn Carson on the development and status of the Berkeley Data Science Major.

Homework 7 Due

16
31 05/02/2017

RRR Review [Hellerstein and Yu]

This will be part one of a two part exam review lecture to be held during the regular lecture slot.

Homework 7 Due (optional extension)

32 05/04/2017

RRR Review [Gonzalez and Nolan]

This will be part two of a two part exam review lecture to be held during the regular lecture slot.

17
33 05/11/2017

Final Exam

The final exam will be from 3:00 to 6:00 PM on Thursday, May 11, in 100 GPB (Genetics and Plant Biology). For details about exam scheduling visit the Berkeley Exam Calendar.