Graduate Project

Introduction
Deliverables
- Teamwork
Timeline and Grading Breakdown
- Late Policy
Datasets
Group Formation + Research Proposal
Checkpoint 1: EDA + Internal Peer Review
Checkpoint 2: Mandatory Check-In
Checkpoint 3: Project Report First Draft + Internal Peer Review
Final Project Report
Rubrics
Extra Resources: Causal Inference

Introduction

The graduate project is offered only to students enrolled in Data C200, CS C200A, or Data 200S. Other students are welcome to explore the questions and datasets in the project for personal learning, but their work will not be graded or counted towards their final grades.

The purpose of the project is to give students experience in both open-ended data science analysis and research in general. In this project, you will work with one or any combination of the following datasets provided to you to explore research questions that you define.

Deliverables

There are six deliverables in the graduate project element of the course.

Group Formation + Research Proposal: You will form a project group and will submit a google form stating your research proposal. Please see below for more information.
Checkpoint 1: EDA + Internal Peer Review: You will need to submit a write-up + code for Exploratory Data Analysis on your dataset. You will also have to submit an internal peer review. Please see below for more information.
Checkpoint 2: Mandatory Check-In: You will need to write a one-pager of your progress (with a focus on modeling approaches your team explored) and review it with a course staff member. Please see below for more information.
Checkpoint 3: Project Report First Draft + Internal Peer Review: This will be your first draft; you will be required to submit a report of your EDA and modeling along with any code necessary to reproduce your results. You will also have to submit an internal peer review. Please see below for more information.
External Peer-Review: You will need to provide other project teams with feedback on their projects.
Final Project Report: You will submit the final project report. You will need to submit a report (as well as all necessary code), ensuring you incorporate all relevant feedback from the first draft and external peer review. You will also be required to make a brief 5-7 minute YouTube video recording of the project. Please see below for more information.

Teamwork

You must work in a group with one or two other students. In order to give everyone experience in collaborating on a data science project, individual projects are not allowed. We have an Ed post for teammate search. Everyone in the same group will receive the same grade (except for exceptional circumstances).

Timeline and Grading Breakdown

Deadline (at 11:59pm Pacific)	Event / Deliverable	Link	Grading Weight
10/06	Group Formation + Research Proposal	Google Form	5%
10/22	Checkpoint 1: EDA + Internal Peer Review	Checkpoint 1, Internal Peer Review	10%
Week of 11/6	Checkpoint 2: Mandatory Check-In	Ed Post, Gradescope Submission	7.5%
11/27	Checkpoint 3: Project Report First Draft + Internal Peer Review	Gradescope Submission, Internal Peer Review	20%
12/01	External Peer-Review	Gradescope Submission	7.5%
12/08	Final Project Report (including the final YouTube video)	Gradescope Submission	50%

Late Policy

You may submit the first draft, final report, and the presentation video late with a 10% penalty (applying only to that portion of your project grade) for each day it is late. You may submit up to two days late. Submission times are rounded up to the next day. That is, 2 minutes late = 1 day late.
Internal and external peer reviews as well as other project deliverables must be completed on time (there is no grace period).

Datasets

This section contains the datasets we will provide to you to explore your research questions.

You must incorporate at least one of the provided datasets.
You are welcome to bring in additional datasets to complement the datasets provided here, but you must cite the sources and clearly describe the content of any additional data you use in the final report.

In general, if you’re drawing any conclusions regarding causality, please be sure to consult the extra resources on causal inference.

Accessing Datasets

All the datasets provided by us can be found inside the following link on Google Drive:

Graduate Project Datasets Google Drive

If you wish to work on Datahub, we’ve provided some instructions on how to move the data from Google Drive onto Datahub. However, your Datahub kernel can often only manage 2GB of memory at maximum. Given this limitation (and the size of most datasets), we recommend instead using Google Drive + Google Colaboratory. If you instead wish to work on the project locally, you can also download the files containing the datasets for each topic.

How to Pull Data from Google Drive directly onto Datahub

Get the Google Drive ID of the file. To do this, first get the URL of the file. You can do this by right-clicking on the file in Google Drive and pressing ‘Share -> Copy Link’. Once you have the URL, you can find the ID by looking for the set of characters after the /d/ in the URL. For example, in the following URL: https://drive.google.com/file/d/16-4O_lJGioPC5G9il4vR_XrCgJ3J9_zK/view?usp=sharing, the Google Drive ID would be 16-4O_lJGioPC5G9il4vR_XrCgJ3J9_zK.
Download the data. Once you have the Google Drive ID of the file, you can use the utils.py file inside the grad_proj directory on your Datahub. This file has a number of useful functions for downloading data. You’ll want to use fetch_and_cache_gdrive. You will call the function in a notebook. The function takes in two arguments: (1) Google Drive ID that you got in the previous step, and (2) name of the file. Calling the function will generate a data folder and place the file into that folder, using the name you came up with as the second argument of the function.

Hopefully, the above steps help you to access the data on Google Drive. There are other ways to move the data onto Datahub. Consider looking into gdown or just downloading the data from Google Drive and uploading it to Datahub manually.

Take a look at the other functions in utils.py if you’d like to use other data sources to supplement your project.

Topic 1: COVID-19

Dataset A: Testing and Mortality Statistics

This dataset contains US reports on COVID-19 testing and cases from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University and CDC (Centers for Disease Control and Prevention). You can access all the data within the Topic 1/Dataset A directory on Google Drive:

csse_covid_19_daily_reports_us.csv contains US daily reports (documentation)
cdc_death_counts_by_sex_age_state.csv contains US reports on deaths involving COVID-19, pneumonia, and influenza reported to NCHS by sex, age, group, and state. (documentation)
cdc_death_counts_by_conditons.csv contains US weekly reports on health conditions and contributing causes mentioned in conjunction with deaths involving COVID-19. (documentation)

You must choose to work with at least 2 of the reports above in your analysis.

Topic 2: Climate and the Environment

Dataset A: General Measurements and Statistics

This dataset contains some general statistics and measurements of various aspects of the climate and the environment. You can access all the data within the Topic 2/Dataset A directory on Google Drive. It includes the following reports:

daily_global_weather_2020.csv contains data on daily temperature and precipitation measurements. To learn how to use the data from this file, please read the following section on the first report.
us_greenhouse_gas_emissions_direct_emitter_facilities.csv and us_greenhouse_gas_emission_direct_emitter_gas_type.csv contain data reported by EPA (Environment Protection Agency) on greenhouse gas emissions, detailing the specific types of gas reported by facilities and general information about the facilities themselves. The dataset is made available through EPA’s GHGRP (Greenhouse Gas Reporting Program).
us_air_quality_measures.csv contains data from the EPA’s AQS (Air Quality System) that measures air quality on a county level from approximately 4000 monitoring stations around the country. (source)
aqi_data contains more data from the EPA from a number of sites across a multitude of different metrics. (source)

The following subsection contains more details on how to work with the first report on global daily temperature and precipitation:

The first report on daily temperature and precipitation is measured by weather stations in the Global Historical Climatology Network for January to December 2020.

The data in daily_global_weather_2020.csv is derived from the source file at https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/2020.csv.gz.

To help you get started with a dataset of manageable size, we have preprocessed the GHCN dataset to include only the average temperature and precipitation measurements from stations that have both measurements. Each row in the preprocessed dataset contains both the average temperature and precipitation measurements for a given station on a given date.

If you wish to explore the climate data for a different year, you can use the GHCN_data_preprocessing.ipynb notebook to download and perform the preprocessing described above. Please be advised that depending on the dataset size for a given year, GHCN_data_preprocessing.ipynb may not run on DataHub.

The data contains only the (latitude, longitude) coordinates for the weather stations. To map the coordinates to geographical locations, the reverse-geocoder package mentioned in the References section might be helpful.

Dataset B: Biodiversity in the Ecosystem

This dataset contains studies focused specifically on the impact of environmental and climate changes on biodiversity and the local ecosystems. You can access all the data within the Topic 2/Dataset B directory on Google Drive. It includes the following reports:

bioCON_plant_diversity.csv contains data collected as part of an ecological experiment, BioCON (Biodiversity, CO2, and Nitrogen), that started in 1997 and focused on studying biodiversity within the plant species at Cedar Creek Ecosystem Science Preserve. (documentation)
plant_pollinator_diversity_set1.csv and plant_pollinator_diversity_set2.csv contain ecological data collected from a long-term observation study from 2011 to 2018 that focuses on plant-pollinator interaction and its impact on local biodiversity. (documentation)
national_parks_biodiversity_parks.csv and national_parks_biodiversity_species.csv contain data published by the National Park Service on animal and plant species identified in individual national parks.

Topic 3: Recommender Systems

A recommender system is an information filtering system that focuses on predicting the preference a user would give to an item by predicting its rank; it is used in a variety of areas, such as search engines, online shopping platforms, etc. This dataset contains a set of reports on various tools using a recommender system.

You can access all the data within the Topic 3/Dataset A through the Topic 3/Dataset C directory on Google Drive.

Dataset A: Fitness Recommendations

These datasets consist of user sports records collected from Endomondo. They include a rich variety of sequential sensor data, such as metrics like heart rate, speed, GPS coordinates. Additionally, the datasets contain information about the type of sport, user gender, and weather conditions, which encompass temperature and humidity.

Relevant data can be found in Topic 3/Dataset A on the Google Drive. You may also visit the main page for documentation and links to download the dataset.

Dataset B: Amazon Recommendations

These datasets comprise Amazon reviews, which encompass ratings, textual content, and helpfulness votes for a wide variety of Amazon categories, such as fashion, electronics, and pet supplies. They also contain product metadata, including descriptions, category information, price, brand, and image features.

Instructions on how to access the data is located in Topic 3/Dataset B on the Google Drive. You may also directly visit the main page, which includes general information about the dataset, such as metadata and categories, as well as the dataset request process.

Dataset C: Application Usage Recommendation

The frappe dataset contains a context-aware app usage log consisting of 96,203 entries by 957 users for 4,082 apps used in various contexts. These contexts include factors such as time of day, country, number of downloads, and cost.

frappe.csv and meta.csv contain data on mobile app usage for users in various contexts. For general information about the dataset, please refer to frappe_README.txt and stats.ipynb.

Group Formation + Research Proposal

The first deliverable of your group project is just to form your group, choose a dataset and submit your research proposal to this google form by 11:59 pm on 10/06. Along with your research proposal, you are required to briefly explore your chosen dataset and describe it in one paragraph. You may form groups of 2 or 3 people with any Data 200A/200A/200S student. If you are having trouble finding a group, we can assign you to a group if you fill out this form by 11:59pm on 9/30.

Checkpoint 1: EDA + Internal Peer Review

The checkpoint is intended to keep you on track to meet your project goals. You will need to submit exploratory data analysis results on Gradescope. This will include submitting both a report of your results so far as well as all code necessary to replicate your results. Your submission should include:

Project Introduction and Goals: Please briefly introduce your project. Think about introducing your project to someone who has a background in data science but does not know the dataset and your research question. This part should not exceed 500 words. Here are some components to help you get started:
- What is the dataset about? How was the data collected? What are the available features and information? What is the size of the dataset?
- What questions do you plan to ask about the dataset? Why do we care about such a problem?
- What is your workflow for the project? Your first step, second step…
- What are the models you plan to use? Why would the model be a good fit for your project? What are potential pitfalls you could run into?
- What is your goal for the project? What are the expected deliverables?
EDA: Show the results from your EDA work. You should include:
- Data Sampling and Collection
  - How was the data collected?
  - Was there any potential bias introduced in the sampling process?
- Data Cleaning
  - What type of data are you currently exploring?
  - What is the granularity of the data?
  - What does the distribution of the data look like? Are there any outliers? Are there any missing or invalid entries?
- Exploratory Data Analysis
  - Is there any correlation between the variables you are interested in exploring?
  - How would you cleanly and accurately visualize the relationship among variables?
  - What are your EDA questions? (For example, are there any relationships between A and B? What is the distribution of A?).
  - Do you need to perform data transformations?
- Figures(tables, plots, etc.)
  - Descriptions of your figures. Takeaways from the figures.
  - These figures must be of good quality (i.e. they must include axes, titles, labels, etc) and they must be relevant to your proposed analysis.
Other Preliminary Results (optional): Please optionally post any other preliminary results here for our information.

Checkpoint 2: Mandatory Check-In

The purpose of this checkpoint is to ensure you are making progress and on schedule to submit the first draft of the project in 2 weeks time. You will be required to make a one-page document summarizing all of your progress so far, and you will have to bring the document to a one-on-one meeting with a staff member. Please look at the rubric for the checkpoint and what you need to include in the Final Project Report when determining what to include in your one-page document; the document should be a brief summary of all your progress so far. The staff member will quickly skim the document and give you guidance on the project as a whole. More details about submitting the one-page document and signing up for the staff member meeting will be announced on Ed soon.

Checkpoint 3: Project Report First Draft + Internal Peer Review

The first draft of your final report, please see below for more information on what you should aim to submit. You do not need to submit the video component for checkpoint 3, but you are expected to submit a comprehensive written report that summarizes your analysis.

Final Project Report

The project submission should include the following two components, as well as the YouTube video recording (more information to be announced later).

[Component 1] Analysis Notebooks

This component includes all the Jupyter Notebook(s) containing all the analyses that you performed on the datasets to support your claims in your write-up. Make sure that all references to datasets are done as data/[path to data files]. By running these notebooks, we should be able to replicate all the analysis/figures done in your write-up.

Your analysis notebook(s) should address all of the following components in the data science lifecycle. Please note that a thorough explanation of your thought process and approach is as important as your work. Unreadable/uncommented code will lose points. Along with the code for the EDA portion (which also has to be included), we have provided a few additional preliminary questions/tips you can consider for the modelling portion of the project:

What are the research questions that you are answering through your analysis? What type of machine learning problem are you investigating?
Which model(s) do you use and why?
How do you use your data for training and testing?
Does your model require hyperparameter tuning? If so, how do you approach it?
How do you engineer the features for your model? What are the rationales behind selecting these features?
How do you perform cross-validation on your model?
What loss metrics are you using to evaluate your model? Why?
From a bias-variance tradeoff standpoint, how do you assess the performance of your model? How do you check if it is overfitting?
How would you improve your model based on the outcome?
Are there any further extensions to your model that would be worth exploring?

[Component 2] Project Write-Up

This is a single PDF that summarizes your workflow and what you have learned. It should be structured as a research paper and include a title, list of authors, abstract, introduction, description of data, methodology, summary of results, discussion, conclusion, and references. Make sure to number figures and tables, include informative captions and ensure you include the provenance of the figures in the main narrative. We encourage you to render the PDF using LaTeX, but we will not be able to provide assistance with LaTeX-related issues.

Specifically, you should ensure you address the following in the narrative:

Clearly state the research questions and why they are interesting and important.
Introduction: ensure you include a brief survey of related work on the topic(s) of your analysis. Be sure to reference current approaches/research in the context of your project, as well as how your project differs from or complements existing research. You must cite all the references you discuss in this section.
Description of data: ensure you outline the summary of the data and how the data was prepared for the modeling phase (summarizing your EDA work). If applicable, descriptions of additional datasets that you gathered to support your analysis may also be included.
Methodology: carefully describe the methods/models you use and why they are appropriate for answering your research questions. You must include a detailed description of how modeling is done in your project, including inference or prediction methods used, feature engineering and regularization if applicable, and cross-validation or test data as appropriate for model selection and evaluation.
Summary of results: analyze your findings in relation to your research question(s). Include/reference visualizations and specific results. Discuss any interesting findings from your analysis. You are encouraged to compare the results using different inference or prediction methods (e.g. linear regression, logistic regression, or classification and regression trees). Can you explain why some methods performed better than others?
Discussion: evaluate your approach and discuss any limitations of the methods you used. Also, briefly describe any surprising discoveries and whether there are any interesting extensions to your analysis.

The narrative PDF should include figures sparingly to support specific claims. It can include a few runnable code components, but it should not have large amounts of code. The length of the report should be 8 ± 2 pages when it is printed as a PDF, excluding figures and code.

Tip: if you need to write a large amount of LaTeX on markdown, you may want to use the %%latex cell magic. However, we also encourage you to explore Overleaf for easily writing clean LaTeX documents.

Please submit everything as a zip file to the final report submission portal on Gradescope. Please make sure the folder in the zip file has the following structure:

[your studentIDs joined by _]/
    data/[all datasets used]
    analysis/[analysis notebooks]
    narrative/[narrative PDF]
    figures/[figures included in the narrative PDF]

Please use student IDs joined by _ as the name for the top-level directory. The analysis notebooks must be runnable within this directory structure. If the narrative PDF includes any figures that are created in the analysis notebooks, the figures should be saved to figures/ by the analysis notebooks.

[Component 3] Presentation Video

The presentation video should provide an overview of your project, highlighting the main points outlined in the write-up. The video should be approximately 5-7 minutes long but can extend up to 10 minutes. You should upload your video to YouTube and include the YouTube link in your final project write-up.

Rubrics

This section includes a rubric for how different project deliverables are going to be graded. This section will be updated as we get further along the project timeline.

Group formation + Research Proposal (5%)

One paragraph description of the data that will be used in the project (1.5%).
List of research questions and their alignment with the given datasets (1.5%).
Forming teams by the deadline (2%).

Checkpoint 1: EDA + Internal Peer Review (10%)

Project Introduction and Goals (0.5%).
Data Sampling and Collection (0.5%).
Data Cleaning (3%).
Exploratory Data Analysis (3%).
Figures (tables, plots, etc.) (2.5%).
Internal Peer Review (0.5%).

Checkpoint 2: Mandatory Check-In (7.5%)

Research Questions (1.5%).
Feature Engineering (2%).
Modelling Approaches (3%).
Preliminary Results (1%).

Checkpoint 3: Project Report First Draft + Internal Peer Review (20%)

Please refer to the section on the Final Project Report for more information on how your first draft will roughly be graded. Your first draft will be graded more leniently than your final submission, but we’re still looking for largely the same elements. You do not need to submit the video component for checkpoint 3, but you are expected to submit a comprehensive written report that summarizes your analysis.

External Peer Review (7.5%)

Each group will peer review the project of another group. The review will be graded by staff out of a total of 7.5 points. Each review should include the following components:

(1.5%) A summary of the report. The summary should address at least the following:
- What research questions does the group propose? Why is it important?
- How does the dataset relate to the research question?
- What data modeling/inference techniques do the group primarily use to gain insights into their research question? Why are these techniques suitable for the task?
- What are the next steps a researcher can take if they want to investigate the question further based on the work in the project?
(6%, 1% for each component) An evaluation of the report based on the Data Science Lifecycle. The review should include at least one strong point and one suggestion for improvement for each of the following components in the project:
- Research Questions
- Introduction
- Description of Data
- Methodologies
- Summary of Results
- Discussion

The external peer review is also a great chance to learn from other people’s work and reflect on the work of your own.

Final Project Report (50%)

Analysis Notebooks (10%).
Project Write-up (30%).
Presentation Video (10%).

Extra Resources: Causal Inference

When studying the relationship between datasets, you might want to consult the following references on causality vs. correlation. Oftentimes, it is tempting to make claims about causal relationships when there is not enough evidence from the data to support such claims. Please review the following references, or other reputable references that you find on the topic to familiarize yourself with relevant concepts and methods.