Anyone can choose to complete the final project. Students enrolled in C200A, the graduate version of the course, are required to complete the final project. For students enrolled in C100, the final project is optional but allows for an alternate grading option. See the grading page for details. This final project is additional to the required course projects released during the semester.
Due Date: Your final project must be submitted by 11:59pm Wednesday, May 8.
Checkpoint: You must fill out this form describing your project by 11:59pm Wednesday April 24 (@berkeley.edu login required). If you are working with a partner, you should both fill out the form.
Project Report: Your project submission should be a single notebook that has the format of a research paper. It should include a title, list authors, abstract, introduction, description of data, description of methods, summary of results, and discussion. The notebook should also includes all code and visualizations. Make sure to number figures and tables and include informative captions.
Partners: You may complete the project on your own or with one other classmate.
Presentations: The Data 100/200 Project Fair will be held 2pm-4pm on Thursday May 9 in Wozniak Lounge, 430 Soda Hall. (Details to be confirmed.) Presenting your project is not required, but strongly encouraged.
Scoring: Your project will be scored based on the submitted report. If you present your project, the person who will score your project will also attend your presentation for additional context.
There are two options for the final project: pick your own question and data set or follow the recommendations we have provided.
Option 1: Design a Project
The purpose of this project is to carry through a data science workflow and put into practice what you have learned in this course in a more open-ended setting than the assignments. Specifically, the project should involve the following steps.
- Frame a question of your choice that can be addressed by identifying, collecting, and analyzing relevant data.
- Describe and obtain the data.
- Perform exploratory data analysis (EDA) and include in your report at least two (but probably many more) data visualizations.
- Describe any data cleaning or transformations that you perform and why they are motivated by your EDA.
- Apply relevant inference or prediction methods (e.g., linear regression, logistic regression, or classification and regression trees), including, if appropriate, feature engineering and regularization. Use cross-validation or test data as appropriate for model selection and evaluation. Make sure to carefully describe the methods you are using and why they are appropriate for the question to be answered.
- Summarize and interpret your results (including visualization). Provide an evaluation of your approach and discuss any limitations of the methods you used.
- Describe any surprising discoveries that you made and future work.
In order to ensure that you have applied the course materials in sufficient scope, we impose the following two additional requirements.
- The analysis should involve at least one of the inference or prediction methods presented in this course.
- The dataset should have at least six distinct variables (i.e., columns) and a sample size (i.e., rows) of 50 or more. Much larger datasets are encouraged. Smaller datasets must be approved by the instructors via e-mail.