Applied Data Analysis (ADA)

The project is an important grading item of the course (30% of the grade). It will allow you to choose a dataset and a question of interest, run analyses, and communicate your results. Project Milestone P1 is submitted by filling a Google Form, while project Milestones P2 and P3 are submitted by having a GitHub repository with the required deliverables at the date of the deadline. The repositories will be automatically collected.

Schedule

The schedule for the projects is as follows:

Milestone P1, due 23:59 CET, 4 Oct 2024 (10% of the project): To be done individually, where each student submits an outline of project ideas of up to 500 words by filling a Google Form. We will grade the creativity and clarity of the proposed ideas.
Milestone P2, due 23:59 CET, 15 Nov 2024 (20% of the project): To be done as a team, where the team submits a GitHub repository that includes: (1) a well-organized README containing the detailed project proposal (up to 1000 words) and (2) code containing initial analyses and data handling pipelines. We will grade the correctness, quality of code, and quality of textual descriptions.
Milestone P3, due 23:59 CET, 20 Dec 2024 (70% of the project): To be done as a team, where the team submits a data story using a platform of their choice, and the project GitHub repository containing your final code. We will grade the overall datastory and the associated code for correctness and quality, and quality of textual descriptions.

The bulk of your work should be over before Christmas, in order for you to focus on the exam (and exams of other classes). Note: Additional details about each project milestone are available below.

P1: First glimpse at the data

For Milestone P1, the first task for each team will be to select a dataset. We provide a variety of datasets that you can choose from. After selecting a dataset, each team member will individually perform the following tasks:

Read the paper(s) relevant to the chosen dataset. Please see Column G of the dataset Google sheet. If you don’t fully grasp the technical details of the proposed methods, that’s totally fine. What matters is that you understand what the dataset is and how it was derived.
Familiarize yourself with the chosen dataset. The best way to do this is by playing around with it, for example, by extracting summary statistics and going through different small samples of the dataset. Note that there is no need to load and perform an in-depth analysis of the entire dataset for Milestone P1.
Once you have explored the dataset, propose exactly three bold and creative ideas for proposals of projects that could be done with your chosen dataset. At this stage, it does not matter whether the ideas are easily feasible or not, but you should still consider the data you would (potentially) need to realize the proposed ideas. Also, the ideas proposed in Milestone P1 may not necessarily turn out to be the project you will eventually do. The idea of this first milestone is to get the juices flowing, get you in a creative mode, and, at the same time, get your hands dirty! For each idea, it is important to clearly state: (1) the overall goal (title) of the project (2) high-level research questions (3) high-level steps for solution for each research question (no need for precise method implementations).

P1 deliverable: An outline of project ideas of up to 500 words (done individually). The outline of project ideas is submitted by filling a Google Form. We will grade the creativity and clarity of the proposed ideas. Note that for this first milestone we are not going to grade any code.

P2: Project proposal and initial analyses

In Milestone P2, together with your team members, you will agree on and refine your project proposal. Your first task is to select a project. Even though we provide the datasets for you to use, at this juncture, it is your responsibility to perform initial analyses and verify that what you propose is feasible given the data (including any additional data you might bring in yourself), which is crucial for the success of the project.

The goal of this milestone is to intimately acquaint yourself with the data, preprocess it, and complete all the necessary descriptive statistics tasks. We expect you to have a pipeline in place, fully documented in a notebook, and show us that you have clear project goals.

When describing the relevant aspects of the data, and any other datasets you may intend to use, you should in particular show (non-exhaustive list):

That you can handle the data in its size.
That you understand what’s in the data (formats, distributions, missing values, correlations, etc.).
That you considered ways to enrich, filter, transform the data according to your needs.
That you have a reasonable plan and ideas for methods you’re going to use, giving their essential mathematical details in the notebook.
That your plan for analysis and communication is reasonable and sound, potentially discussing alternatives to your choices that you considered but dropped.

We will evaluate this milestone according to how well these steps have been done and documented, the quality of the code and its documentation, the feasibility and critical awareness of the project. We will also evaluate this milestone according to how clear, reasonable, and well thought-through the project idea is. Please use the second milestone to really check with us that everything is in order with your project (idea, feasibility, etc.) before you advance too much with the final Milestone P3! There will be project office hours dedicated to helping you.

You will work in a public GitHub repository dedicated to your project, which can be created by following this link. The repository will automatically be named ada-2025-project-. By the Milestone P2 deadline, each team should have a single public GitHub repo under the epfl-ada GitHub organization (https://github.com/epfl-ada), containing the project proposal and initial analysis code.

P2 deliverable (done as a team): GitHub repository with the following:

Readme.md file containing the detailed project proposal (up to 1000 words). Your README.md should contain:
- Title
- Abstract: A 150 word description of the project idea and goals. What’s the motivation behind your project? What story would you like to tell, and why?
- Research Questions: A list of research questions you would like to address during the project.
- Proposed additional datasets (if any): List the additional dataset(s) you want to use (if any), and some ideas on how you expect to get, manage, process, and enrich it/them. Show us that you’ve read the docs and some examples, and that you have a clear idea on what to expect. Discuss data size and format if relevant. It is your responsibility to check that what you propose is feasible.
- Methods
- Proposed timeline
- Organization within the team: A list of internal milestones up until project Milestone P3.
- Questions for TAs (optional): Add here any questions you have for us related to the proposed project.
GitHub repository should be well structured and contain all the code for the initial analyses and data handling pipelines. For structure, please use this repository as a template
Notebook presenting the initial results to us. We will grade the correctness, quality of code, and quality of textual descriptions. There should be a single Jupyter notebook containing the main results. The implementation of the main logic should be contained in external scripts/modules that will be called from the notebook.

P3: Final project and the datastory

In Milestone P3 you will execute the project you proposed. For the final milestone, you will be expected to execute your project proposal and describe your project in a data story.

Data stories are a blog post or short article, with an important visual component, using data to tell a story and illustrate it effectively. You can be less formal here (although methods and math should then appear in the notebook), but more visual. Please look at the top-k projects from 2023, 2022, 2021, 2020, and 2019 as a reference for designing good data stories. You can pick your preferred platform option, but we encourage you to use Jekyll. We prepared a short tutorial for creating a website with GitHub pages outlining how you can host the datastory. You submit the story by providing a URL to it in your README file.

A Jupyter notebook extending the one delivered for Milestone P2 is also expected and will be graded. The README in Milestone P3 shall be updated. It should also detail the contributions of all group members.

Example:

John: Plotting graphs during data analysis, crawling the data, preliminary data analysis
Mary: Problem formulation, coming up with the algorithm
Chris: Coding up the algorithm, running tests, tabulating final results
Eve: Writing up the report or the data story, preparing the final presentation

There will be project office hours dedicated to helping you finish the project successfully. The bulk of your work will be over before the winter break, so you can focus on the exam (and exams of other classes) during that time.

P3 deliverables (done as a team):

The final project repository containing your final code and results notebook. We will grade the correctness, quality of code, and quality of textual descriptions. There should be a single Jupyter notebook presenting the results. The implementation of the main logic should be contained in external scripts/modules that will be called from the notebook.
Data story