Syllabus
This syllabus is still under development and is subject to change.
Week  Lecture  Date  Topic  Lab  Discussion  Homework 

1  1  8/23/18 
Course Overview, Data Design and Sources of Bias [slides]In this lecture we provide an overview of what data science is at its root and the components that make data science a large field with endless possibilities. Fundamentally, (data) science is the study of using data to learn about the world and solve problems. However, how and what data is collected can have a profound impact on what we can learn and the problems we can solve. Along the way, we will also touch on what it means to be a data scientist by examining recent surveys of data scientists. We will begin to explore various mechanisms for data collection and their implications on our ability to generalize. In particular, we will discuss differences between censuses, surveys, controlled experiments, and observational studies and will also highlight the power of simple randomization and the fallacies of data at scale. Welcome to Data 100!




2  2  8/28/18 
Data Manipulation with Pandas I [slides]While data comes in many forms, most data analyses are performed on tabular data. Mastering the skills of constructing, cleaning, joining, aggregating, and manipulating tabular data is essential to data science. In this lecture we will introduce Pandas, the opensource Python data manipulation and analysis library widely used by data scientists. Through introducing useful Pandas operations and paradigms, we will also bring to light new concepts including indices, column operations (and their effect on system performance), grouping operations, and basic data visualization tools built to accompany Pandas.



3  8/30/18 
Data Manipulation with Pandas II [slides]Continued discussion of material in the previous lecture.


3  4  9/4/18 
Data Cleaning & EDA [slides]Whether collected by you or obtained from someone else, raw data is seldom ready for immediate analysis. Data cleaning is an important skill every data scientist should master and it starts with understanding key aspects of the data. Through exploratory data analysis we can often discover important anomalies, identify limitations in the collection process, and better inform subsequent goaloriented analysis. In this lecture we will discuss how to identify and correct common data anomalies and analyze their implications on future analysis. We will also discuss key properties of data including structure, granularity, faithfulness, temporality, and scope; these properties can inform how we prepare, analyze, and visualize data.


Lab2 

5  9/6/18 
EDA and Visualization [slides]In this lecture we will continue our discussion of EDA and important features we should be identifying when given a dataset. Along the way, we will start to work through a realworld exercise in EDA using public crime data for the city of Berkeley. Through this, we will also introduce tools for data visualization using Pandas, Seaborn, and Matplotlib.


4  6  9/11/18 
Visualization and Data Transformations [slides]A large fraction of the human brain is devoted to visual perception. As a consequence, visualization is a critical tool in both exploratory data analysis and the communication of complex relationships in data. However, making informative and clear visualizations of complex concepts can be challenging. In this lecture we explore good and bad visualizations and describe how to choose visualizations for various kinds of data and goals. We will also go into detail on how to identify issues with certain visualizations and ways to fix these issues to properly convey the message you are trying to show. However, in some cases, directly visualizing data can be uninformative. Some examples of these cases include plots with curvilinear relationships, large numbers of similar observations hiding core trends in the data, and visualizing data with a large number of variables. In this lecture we discuss data transformations, smoothing, and dimensionality reduction to address the challenges in creating informative visualizations. The TukeyMosteller Bulge Diagram will come in handy when talking about transformations and is a great tool for identifying when data needs to be transformed. With these additional analytics we can often reveal important and informative patterns in data.



7  9/13/18 
Working with Text [slides]Whether in documents, tweets, or records in a table, text data is ubiquitous and presents a unique set of challenges for data scientists. How do you extract key phrases from text? What are meaningful aggregate summaries of text? How do you visualize textual data? In this lecture we will introduce a set of techniques (e.g. bagofwords) to transform text into numerical data for subsequent tabular analysis. We will also introduce regular expressions as a mechanism for cleaning and transforming text data.



5  8  9/18/18 
Modeling and Estimation [slides]How do we pick a number to represent a dataset? A key step in data science is developing models that capture the essential signal in data while providing insight into the phenomena that govern the data and enable effective prediction. In this lecture we address the fundamental question of choosing a number and more generally a model that reflects the data. We will introduce the concept of loss functions and begin to develop basic models. we will explore how calculus can be used to analytically and minimize loss functions.



9  9/20/18 
Modeling and Estimation II [slides]In this lecture we will continue our development of models within the framework of loss minimization. In particular, we will explore how to numerically minimize loss functions. We will also introduce multidimensional models and define the notion of the gradient of a function. To minimize functions, we will introduce the widely used gradient descent algorithm.


6  10  9/25/18 
Generalization and Empirical Risk Minimization [slides]So far, we have focused on how we can estimate a descriptive statistic or more generally the parameters of a model that reflects our data. What does this say about the population? How can we generalize beyond what we observe? In this lecture we recast our loss minimization approach in the context of empirical risk minimization. In the process we will review basic probability concepts including expectation, bias, and variance.




11  9/27/18 
Linear Regression and Feature Engineering [slides]Linear regression is at the foundation of most machine learning and statistical methods. We have already introduced linear models in an informal way; in this lecture we formalize the setup of a linear model as a parametric description of a dataset whose parameters can be estimated computationally. We study the normal equations from the perspective of optimization and discuss some of the computational issues around solving the normal equations. We will then transition to the task of feature engineering and describe a range of techniques for transforming data to enable linear models to fit complex relationships.


7  12  10/2/18 
BiasVariance Tradeoff and Regularization [slides]There is a fundamental tension in predictive modeling between our ability to fit the data and to generalize to the world. In this lecture we characterize this tension through the tradeoff between bias and variance. We will derive the bias and variance decomposition of the least squares objective. We then discuss how to manage this tradeoff by augmenting our objective with a regularization penalty.


13  10/4/18 
CrossValidation and Regularization [slides]In this lecture we will recap our discussion of linear regression by reviewing how to use the scikit learn regression package. We will then explore the challenges of overfitting and review how regularization can be used to address overfitting. We will introduce crossvalidation as a mechanism to estimate the test error and to select the regularization parameters.


8  14  10/9/18 
Ethics [slides]Data science is being used in growing number of settings to make decisions that impact people's lives. In this lecture we will discuss just a few of the many ethical and legal considerations in the application of data science to realworld problems. Our guest speaker Joshua Kroll is a computer scientist and researcher interested in the governance of automated decisionmaking systems, especially those built with machine learning. He is currently a Postdoctoral Research Scholar at UC Berkeley’s School of Information, working with Deirdre Mulligan. Before that he received a PhD in Computer Science in the Security Group at Princeton University. His dissertation on Accountable Algorithms was advised by Edward W. Felten and supported by the Center for Information Technology Policy where he studied topics in security, privacy, and technology’s impact on policy decisions. Joshua was the program chair of this year’s edition of the successful workshop series “Fairness, Accountability, and Transparency in Machine Learning (FAT/ML)”.




15  10/11/18 
Midterm Review Part 1 [slides]This lecture will review key topics from the course that will be covered on the midterm.


9  16  10/16/18 
Midterm Review Part 2 [slides]The midterm will take place on 10/17 from 810 PM.

Midterm Review (Lab8) 
Midterm OH 

17  10/18/18 
Classification and Logistic Regression I [slides]We consider the case in which our response is categorical; in particular, we focus on the simple case in which the response has two categories. We begin by using least squares to fit the binary response to categorical explanatory variables and find that the predictions are proportions. Next, we consider a more complex model (the linear probability model) that is linear in quantitative explanatory variables, and we uncover the limitations of this model. We motivate an alternative model, the logistic, by examining a local linear fit and matching its shape. We also draw connections between the logistic and log odds. Lastly, we introduce an alternative loss function (the KullbackLeibler divergence) that is more appropriate for working with probabilities. We derive a representation of the KL divergence for binary response variables.


10  18  10/23/18 
Classification and Logistic Regression II [slides]Continued discussion of material in the previous lecture.

Project 1 OH 

19  10/25/18 
Probability theory, Monte Carlo, Bootstrapping [slides]We saw previously that we can study parameter estimators using theoretical and computational approaches. In this lecture we will delve deeper into the bootstrap to study the behavior of the empirical 75th percentile as an estimator for its population counterpart. We will derive the empirical quantile through the optimization of a loss function, show that the population parameter minimizes the expected loss, bootstrap the sampling distribution of the empirical 75th percentile, and use the bootstrapped distribution to provide interval estimates for the population parameter.



11  20  10/30/18 
Hypothesis Testing I [slides]A key step in inference is often answering a question about the world. We will consider two such questions to varying degrees of detail. 1) Is there enough evidence to bring someone to trial? 2) Do female TAs get lower teaching evaluations than male TAs? We use hypothesis testing to answer these questions. In particular, we examine a collection of nonparametric hypothesis tests. These powerful procedures build on the basic idea of random simulation to help quantify the rarity of a particular phenomenon. In the process of using these procedures we will also touch on the challenges of false discovery and multiple testing.



21  11/1/18 
Numerical issues, condition numbers, higher dimensionsThis is a new lecture for this semester.


12  22  11/6/18 
SQL [slides]Much of the important data in the world is stored in relational database management systems. In this lecture we will introduce the key concepts in relational databases including the relational data model, basic schema design, and data independence. We will then begin to dig into the SQL language for accessing and manipulating relational data.



23  11/8/18 
Advanced SQL [slides]In this lecture we review more advanced SQL queries including joins and common table expressions, and we discuss how we can combine computation in a database with Python.



13  24  11/13/18 
Big Data [slides]Data management at the level of big organizations can be confusing and often relies on many different technologies. In this lecture we will provide an overview of organizational data management and introduce some of the key technologies used to store large amounts of data. We will introduce various data representation techniques for database design, and we will discuss the tradeoffs between different methods of enterprise data management.


Lab11 

25  11/15/18 
Distributed Computing [slides]Distributed computing is the process in which multiple computers work together to accomplish a computational task. In this lecture we will discuss various distributed computing methods that we can use to work with data at scale. In particular, we will introduce programming with Spark, a parallel execution engine for big data processing.


14  26  11/20/18 
A/B Testing [slides]It is now commonplace for organizations with websites or mobile apps to run randomized controlled experiments, or “A/B tests” as they’re often called in industry. Such experiments provide a reliable way to determine which product changes lead to the most successful user interactions. In this lecture we will discuss why randomized experiments are so important, talk about some of the key design choices that go into A/B tests, and get a brief introduction to sequential monitoring of experimental results.

Project 2 OH 
Break 

27  11/22/18 
Thanksgiving BreakEnjoy your break! 

15  28  11/27/18 
Data Commons [slides]There will be a guest lecturer on this day.


Lab12 
HW5 Due 
29  11/29/18 
Conclusion [slides]This is the last lecture.

Proj2A Due 

16  30  12/4/18 
RRR weekThis review lecture goes over material in the second half of the course.

Proj2B Due 

31  12/6/18 
RRR weekThis review lecture goes over problems in the Spring 2018 Final.

HW6 Due, Grad Project Due 

17  32  12/11/18 


33  12/13/18 
Final Exam (11:30am2:30pm)
