- Course: Data Science Project 1 COMP-4447-1
- Class time: M, Wed 07:00 PM - 08:50 PM |Engineering & Computer Science | Room 410
- Instructor: Pooran Singh Negi, pooran.negi@du.edu webpage
- Office: 470
- Office Hours: Tue, Thu, 3.30 p.m. - 5.30 p.m. Email for 1-on-1 help.
- GTA: Mitchell Wright, GTA office hours ECS 126, Mon 4-6 p.m, Fri 3-5 p.m
- Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 2nd Edition by Wes McKinney. It is available online from library
- Mastering Python Regular Expressions by Félix López (Author), Víctor Romero
- Think Stats: Exploratory Data Analysis in Python
- w3school for python, html, sql
- python warrior
- regular expression 101
- debuggex
It is recommended that you consult this github page often for material related to this course. You should check your e-mail periodically for messages. Assignments will be upload here and in the canvas.
The main objective of data science tools 1 is to learn various tools to perform data analysis. The focus in tools 1 is data cleanup, summarization, and visualization. It is more like a hacking skill set but our primary focus will be on the scientific python and Linux ecosystem. We’ll use jupyter notebook/lab for in the class and homeworks. This should make our learning interactive.
For the final project, students will work through individual or team projects applying course-work to the data lifecycle within a particular domain. The focus will also be on best data science/software engineering practices and reproducible work.
Please select a project by January 20th as per your preference. You are allowed to have a group of 2 to 3 students but project work must justify team count. There will be a homework asking about the detail of your final project. We’ll provide feedback about feasibility of the final project. Final project, can be based on initial capstone work?. Please let us know if this is the case. We need to go over details.
This syllabus is subject to change at the discretion of the instructor.
- Jupyter Notebook for reproducible workflow.
- Data science and EDA.
- Git tools work flow.
- Data science at command prompt. Linux command line, bash, basic awk and sed.
- Data collection and ingestion(web scraping and reading datasets + pandas).
- Data cleanup and imputation + Pandas.
- Data summarization and visualization+ panda(groupby, apply, aggregate etc).
- Go over some some topics as per students demands.
- more to come
Linux command line and scientific python ( primarily numpy, matplotlib, request, seaborn, basic pandas) will be used throughout the course.
There will be coding/analysis homework assignments, midterm and a final project. We’ll drop one of your worst assignment grade.
There will be a final presentation of the final project. You will be required to submit a final project report in the jupyter notebook format.
coding Homework | 50% |
---|---|
midterm, 13 Feb in class | 15% |
Comprehensive final 13 March. We’ll use best of your midterm or final marks | |
final project presentation, 10 minutes, 18 March in class | 15% |
final project report, due 18 March, please refer to above final report format for submission guideline | 20% |
grade range [(‘A’, >=93), (‘A_minus’, >=89), (‘B_plus’, >=85), (‘B’, >=81), (‘B_minus’, >=77), (‘C_plus’, >=73), (‘C’, >=69), (‘C_minus’, >=65), (‘D_plus’, >61), (‘D’, >=57), (‘D_minus’, >=53), (‘F’, < 53)])
All members of the University of Denver community are expected to uphold the values of Integrity, Respect, and Responsibility. These values embody the standards of conduct for students, faculty, staff, and administrators as members of the University community. Our institutional values are defined as:
Integrity: acting in an honest and ethical manner;
Respect: honoring differences in people, ideas, experiences, and opinions;
Responsibility: accepting ownership for one’s own behavior and conduct.
Please respect DU Honor Yourself, Honor the Code
Students with recognized disabilities will be provided reasonable accommodations, appropriate to the course, upon documentation of the disability with a Student Accommodation Form from the Disability Services Program. To receive these accommodations, you must request the specific accommodations, by submitting them to the instructor in writing, by the end of first week of classes. Visit CAMPUS LIFE & INCLUSIVE EXCELLENCE webpage for details.
Please see registrar calender for Academic deadlines. We’ll strictly follow the deadlines.
- You can collect the dataset for your project.
- Web scraping, web API (for natural language processing one can use the New York Times, twitter etc.)
- I am looking around to find noisy dataset for practice.
- See Datasets for data cleaning practice by Rachael Tatman
- Datasets for Data Mining and Data Science
- The EU Open Data Portal
- World Bank Open Data
- The home of the U.S. Government’s open data
We need to know your project/dataset, before we approve it for final project.
More to come.
We want everybody to have same experience using computational tools in data science tools 1. Please follow steps as per your operating system.
Please install Windows Subsystem for Linux (WSL) on window 10. Follow the instruction in this post Using Windows Subsystem for Linux for Data Science by Hugo Ferreira for installing Linux. **ignore install Anaconda part.**
You can also watch this video to see installation of Windows 10 Bash & Linux Subsystem Setup.
You can run echo $0 to check current shell. Change to bash shell using chsh -s /bin/bash
One you are in Linux/Mac bash command prompt, Please follow following instructions
Please follow instructions here to install python3 if it is not installed in your system. This link also lists Windows Subsystem for Linux (WSL) for window 10(Windows 10 Creators or Anniversary Update). I am using python 3.5.2. Hopefully any version of python 3 should work.
Run following commands from command prompt.
- apt-get install python3-venv
- Using command line(cd command), go to the folder where you want to keep python file, notebooks related to this course.
- run **python3 -m venv /path/to/new/virtual/environment**
- e.g. I ran python3 -m venv dst1_env
- To activate your environment run source /path/to/new/virtual/environment/bin/activate
- e.g From this course directory I run, source dst1_env/bin/activate
- run python3 -m pip install – upgrade pip. Note that there are 2 dashes in upgrade option.
- run wget https://raw.githubusercontent.com/psnegi/data_science_tools1/master/requirements.txt
- run pip install -r requirements.txt
- run jupyter notebook or jupyter lab.
- In the browser you should see your current files.
- Click on the notebook you want to run.
- click on RISE slideshow extension in notebook, if you want to see notebook as slideshow.
To deactivate python virtual environment, run deactivate
You can also go to my python for reproducible research github repository and start by running pythonBasic.ipynb notebook. I will go over basic of python and jupyter notebook.
- try python notebook online without installing anything
- Runs and visualizes your python code
- The Python Tutorial
- more to come
No late hw will be accepted
HW no | description and links | solution | |
Due date | |||
---|---|---|---|
Monday 21 th Jan 11.59 p.m | 1 | Complete questions in this notebooks | |
Friday 25 th Jan 11.59 p.m | 2 | Complete questions in this notebook | |
Thursday 31 Jan 11.59 p.m | 3 | Complete questions in this notebook | |
Friday 8 th Feb 11.59 p.m | 4 | Complete question in this bash file | |
Friday 15 Feb, 11.59 p.m | 5 | Complete questions in this notebook | |
Friday 23 Feb, 11.59 p.m | 6 | Complete questions in this notebook | |
Friday 1 st March 11.59 p.m. | 7 | Complete question in this notebook | |
Monday 11 th March 11.59 p.m | 8 | Complete the this hw notebook | |
Date | Reading/Coding Assignments | class activity |
---|---|---|
7 Jan | Install jupyter environment | Mitchell covered Jupyter introduction notebook |
also helped with installation | ||
Python Virtual Environments | Covered jupyter introduction and data science notebook. | |
9 Jan | Resources to learn git | It may not be time consuming to wait for notebook to get started via binder every time. |
We’ll also go over data science | Go to the folder for this course in your computer and run git clone https://github.com/psnegi/data_science_tools1.git. | |
Run command ls. You should see data_science_tools1 folder. Activate your virtual environment. | ||
Navigate to course directory using cd data_science_tools1. change to the notebook directory using command cd notebooks. | ||
Now run jupyter notebook. You should see all the notebooks in a browser window. Click on the notebook you want to run. | ||
To run a cell in the notebook press alt+enter or ctr+enter. | ||
Note that whenever a new content is posted, you must run git pull origin master from data_science_tools1 directory to make sure you have the latest | ||
content. Don’t worry about above git commands. We’ll start git in next class. Please start with git notebook. | ||
I don’t like notebooks.- Joel Grus video provide by Laura Atkinson | ||
14 Jan | Covered git for managing local project and git work flow in team. | |
If you are using Mac, you may need to install Xcode Command Line Tools or install git. | ||
If you haven’t setup window subsystem for Linux and want to use git in window see this How to Install GIT client on Windows | ||
I use emacs but use any editor you like for coding python. ATOM is good choice. | ||
16 Jan | Will work on git tool part 2 | Covered work flow in a team, when to push a branch to the remote(you don’t have integration setup, other team members wants to |
look at the feature code for review etc.), merge conflict, tagging. Started with “forget to work on a feature branch”. | ||
23 Jan | Data science at command prompt | Finished how to move changes to feature branch. Not that when cleaning the master branch using soft or mixed reset, the master branch |
will still contain your changes. If you use hard reset changes will be lost in master. **HEAD detached** will contain the changes if required. | ||
Finished Linux over view, basic commands, redirection and pipe. | ||
28 Jan | Practice posted notebooks | Finished regular expression. Using basic Linux commands and regular expression (curl, grep, sort, uniq) found top k words in a Gutenberg book. |
See notebooks in notebooks section | Finished basic awk and sed. | |
30 Jan | See notebooks in notebooks section | Finished positional parameters and command substitution in bash scripting. Note that to use bc command to do floating point arithmetics |
numpy library for scientific computation. | ||
In the jupyter notebook use ? or ?? to read about a function(like np.array?). Press shit tab to get tool tip for function arguments(like np.ones( and press shift+tab). | ||
Started with REST API. /Please install chrome/ so that we have same options to click when inspecting https messages. | ||
See 4 th feb notebooks | Covered REST API. Will cover how to create REST API in tool2 using AWS api gateway and lambda function. | |
4 Feb | Web Scraping in class version | Finished scraping Fry electronics website for telescopes. |
6 Feb | Pandas basic see notebook section | |
11 Feb | Data ingestion and cleaning | Covered basic data ingestion API and cleanup functionality. see pd.qcut Quantile-based discretization too. |
13 Feb | in class midterm | |
18 Feb | python re library and data wrangling | |
20 th Feb | Basic on NLP and normalization of text data | |
25 th Feb | Text clean up, contraction, using wordnet for synonyms, antonyms, hypernyms, hyponyms and edit distance | |
There will be a comprehensive final in class exam. We’ll use best your midterm or final marks(15% weight). | ||
27 th Feb | Extracting text and tables from pdf files. Concept of split-apply and combine. Pandas group by. | |
If you had issue installing pdf miner in Mac, It can Java related. | ||
Install JDK using this link https://www.oracle.com/technetwork/java/javase/downloads/jdk11-downloads-5066655.html | ||
and also: sudo R CMD javareconf otherwise other packages that use java will fail | ||
(provide by Chris Haddad) | ||
4 th march | Covered matplotlib theory, hierarchical organization(tree structure) of figure components. | |
Started seaborn. | ||
6 th March | Seaborn when some variables are categorical, scatter , swarm(concept of hue, jitter). For big data plotting statistical summary | |
distplot, jointplot, pairplot boxplot, bar plot(uni/bi variate). Linear relationships using regplot. | ||
Touched upon geo plot(choropleth map) using folium. | ||
11 th March | Time series, Timestamp and period concepts. Feature engineering(shift, rolling, weighted feature summary) and started time series analysis. | |