This repository hosts the teaching materials for processing text data in the UvA Data Science course.
All the content in this repository is licensed under CC BY 4.0.
Credit: this teaching material is created by Robert van Straten and revised by Alejandro Monroy under the supervision of Yen-Chia Hsu.
When contributing code to this repository, please follow the guidelines below:
- The primary language for this repository is set to English. Please use English when writing comments and docstrings in the code. Please also use English when writing git issues, pull requests, wiki pages, commit messages, and the README file.
- Follow the Git Feature Branch Workflow. The master branch preserves the development history with no broken code. When working on a system feature, create a separate feature branch.
- Always create a pull request before merging the feature branch into the main branch. Doing so helps keep track of the project history and manage git issues.
- NEVER perform git rebasing on public branches, which means that you should not run "git rebase [FEATURE-BRANCH]" while you are on a public branch (e.g., the main branch). Doing so will badly confuse other developers since rebasing rewrites the git history, and other people's works may be based on the public branch. Check this tutorial for details.
- NEVER push credentials to the repository, for example, database passwords or private keys for signing digital signatures (e.g., the user tokens).
- Request a code review when you are not sure if the feature branch can be safely merged into the main branch.
- Use the functional programming style (check this Python document for the concept). It means that each function is self-contained and does NOT depend on a state that may change outside the function (e.g., global variables). Avoid using the object-oriented programming style unless necessary. In this way, we can accelerate the development progress while maintaining code reusability.
- Minimize the usage of global variables, unless necessary, such as system configuration variables. For each function, avoid modifying its input parameters. In this way, each function can be independent, which is good for debugging code and assigning coding tasks to a specific collaborator.
- Use a consistent coding style.
- For Python, follow the PEP 8 style guide, for example, putting two blank lines between functions, using the lower_snake_case naming convention for variable and function names. Please use double quote (not single quote) for strings.
- Document functions and script files using docstrings.
- For Python, follow the numpydoc style guide. Here is an example. More detailed numpydoc style can be found on LSST's docstrings guide.
- Always comment the code, which helps others read the code and reduce our pain in the future when debugging or adding new features.
Below are the steps to update and build this book. First, clone this repository to your local machine. Assume that you already have miniconda installed. Next, install the jupyter-book package:
$ conda create -n jupyterbook
$ conda activate jupyterbook
$ conda install python
$ pip install -U jupyter-book
$ pip install -U ghp-import
$ jupyter-book --help
Then, clone this repository and build the book:
$ git clone https://github.com/MultiX-Amsterdam/text-data-module
$ cd text-data-module
$ jupyter-book build .
To rebuild the entire book, use the following:
$ jupyter-book build --all .
Once it is done, you can view the book in the html content folder using a web browser. To update the book online in this GitHub repository (in the gh-pages branch), run the following:
$ ghp-import -n -p -f _build/html
The above steps will update the gh-pages branch, which hosts the website. Finally, follow the normal git flow to commit the changes and push the code to the main branch in this repository.