Your hometown mayor just created a new data analysis team to give policy advice, and the administration recruited you via LinkedIn to join it. Unfortunately, due to budget constraints, for now the "team" is just you...
The mayor wants to start a new initiative to move the needle on one of two separate issues: high school education outcomes, or drug abuse in the community.
Also unfortunately, that is the entirety of what you've been told. And the mayor just went on a lobbyist-funded fact-finding trip in the Bahamas. In the meantime, you got your hands on two national datasets: one on SAT scores by state, and one on drug use by age. Start exploring these to look for useful patterns and possible hypotheses!
This project is focused on exploratory data analysis, aka "EDA". EDA is an essential part of the data science analysis pipeline. Failure to perform EDA before modeling is almost guaranteed to lead to bad models and faulty conclusions. What you do in this project are good practices for all projects going forward, especially those after this course!
Spend your time trying to understand your data, through both summary statistics and visualization. By the end, you will want to be familiar enough with the datasets that you can think of testable hypotheses that could point in specific policy directions.
We will be looking for the following things:
- For statistics questions, Python code -- using pandas, numpy, scipy, and/or other libraries -- to calculate correct answers, with Markdown explaining your results
- For plotting questions, labeled seaborn or matplotlib plots displayed within your notebook, with Markdown interpreting the results
- Materials must be in a clearly commented Jupyter notebook.
- Students should demonstrate the ability to:
- Analyze diverse datasets & explicitly state your assumptions.
- Form hypotheses and justify them with solid statistical testing in NumPy.
- Visualize and interpret your plots using Matplotlib and Seaborn.
For all projects, students will be evaluated on a simple 3 point scale (0, 1, or 2). Instructors will use this rubric when scoring student performance on each of the core project requirements:
Score | Expectations |
---|---|
0 | Incomplete |
1 | Does not meet expectations |
2 | Meets expectations, good job! |
-
Here's a cheatsheet of descriptive statistics methods in Pandas.
-
Making good plots can take a lot of trial-and-error (especially with matplotlib). The seaborn example gallery may help you find the right code, and decide what you want to do in the first place.
-
Inferential statistics and hypothesis testing can get very nuanced. It is okay to violate some of the assumptions underlying the methods you've learned. But be explicit about why you've chosen a particular method, and what the drawbacks may be.