Analysis of the ICPSR 36404 dataset using descriptive machine learning. This work was produced as our final project for the Descriptive Learning discipline in Universidade Federal de Minas Gerais.
The paper (portuguese) for this work can be found under the paper
directory.
Author: Gabriel Bastos gabriel.s.b@live.com
First, download the delimited
version of the dataset. It is a tsv file, which is used as
input for the analysis program.
Then, install the Rust stable toolchain.
Compile this project with cargo build --release
. No additional steps should be necessary
in order to compile.
The produced program provides the following usage:
analyzer 0.1.0
gahag <gabriel.s.b@live.com>
USAGE:
icpsr-36404-analysis [SUBCOMMAND]
FLAGS:
-h, --help Prints help information
-V, --version Prints version information
SUBCOMMANDS:
distribution load the original dataset from stdin and display the data distribution
help Prints this message or the help of the given subcommand(s)
load load the serialized matrix from stdin and run the algorithm
run runs the entire pipeline
save load the original dataset from stdin and output the serialized matrix to stdout
icpsr-36404-analysis-run
runs the entire pipeline
USAGE:
icpsr-36404-analysis run [FLAGS] [OPTIONS] <min_sup>
FLAGS:
-h, --help Prints help information
--recidivists whether to include only recidivists
-V, --version Prints version information
OPTIONS:
--admission-type <admission_type> include only the given admission type [possible values: parole, new, other]
--race <race> include only the given race [possible values: black, white, hispanic,
other]
--sex <sex> include only the given sex [possible values: male, female]
ARGS:
<min_sup> the minimum support ratio ([0, 1.0])
Author: Fernanda fernandaguimaraes28@gmail.com
First, install the necessary python packages to run the notebook:
pip install scikit-learn
pip install pandas
pip install datetime
pip install numpy
pip install pysugbroup
After, it is necessary to put the data file on the same directory, or update the path in the notebook:
data_path = "36404-0001-Data.tsv"
That's it. Now just run the notebook with Jupyter. You can also select the subgroup
max_size
by altering the depth
parameter in the Subgroup Discovery
section.
The following Rust crates were developed in order to support this work:
This project is licenced under the MIT Licence.