- Introduction
- Requirements
- Repository Contents
- Feature Extraction
- Classifier Usage
- Repository Contents
- Credits
- License
This repository contains feature extraction tools as well as the classifiers used for CookieBlock. Note that this repository contains the Python variant of the feature extraction, which differs from the Javascript version.
Its outputs should not be used with the CookieBlock extension directly!
The feature extractor takes as input a json file containing the extracted cookie data, and outputs a sparse matrix of numerical features, where each row is a training sample, and each column represents a single feature of the cookie.
The resulting sparse matrix can be used as input for training the classifiers. Currently implemented are XGBoost, LightGBM and Catboost. There is also an attempt at an RNN classifier, but it is incomplete and has not been used for the final report and paper.
The required libraries are listed inside requirements.txt
placed in the base folder. Tensorflow is also required for the RNN classifier file, but has been excluded due to version compatibility issues.
No special setup or install is needed, only the libraries need to be installed.
In order to perform the feature extraction, some external resource files are used.
These files are provided within the folder resources/
, but can also be recomputed as desired through the scripts located in the folder resource_construction/
.
For more information on the contents of the resource folder, see the README contained within.
Each cookie should be stored as a JSON object, structured as follows:
{
"cookie_id": {
"name": "<name>",
"domain": "<domain>",
"path": "/path",
"first_party_domain": "http://first-party-domain",
"label": 0,
"cmp_origin": 0,
"variable_data": [
{
"value": "<cookie content>",
"expiry": "<expiration in seconds>",
"session": "<true/false>",
"http_only": true,
"host_only": true,
"secure": true,
"same_site": "<no restriction/lax/strict>"
}
]
}
}
Each object uniquely identifies a cookie.
The first_party_domain
field is optional and may be left empty.
The variable_data
attribute contains a list of cookie properties that may change with each update.
The attribute label
is the cookie's purpose, and is only required if the data is to be used for training a classifier.
For prediction purposes it is not needed.
Scripts to gather the training data, as well as to generate the above JSON format from collected cookies can be found in the Consent Crawler repository:
https://github.com/dibollinger/CookieBlock-Consent-Crawler
The feature extraction takes as input the cookie information in JSON format, and computes from it a sparse matrix of numerical data. A large selection of feature extraction steps is supported. For the full list of them, refer to the features.json file, which contains a short description for each feature.
From each cookie, we extract three distinct categories of features:
- Features extracted once per cookie: Only computed once for each unique cookie, over all updates. These features use information that cannot be altered through any changes to that cookie. This includes name, domain, path and other properties like the host_only flag.
- Features extracted once per update: These features are based on variable cookie data, such as the payload of the cookie itself. Since these values may change, they are hence extracted once for each observed update to a cookie.
- Features stemming from update differences: These features are computed when at least 2 updates are present, and are computed from the changes that were made between updates. Examples for this are the difference in expiration date, or the edit distance between the values of two cookie updates.
The extracted features are entirely numerical, i.e. no text or categorical data remains. However, the feature vector may contain boolean, ordinal or even missing data. Missing data is moreover represented as zero entries in the per-update or per-diff features.
The features.json file acts as a config file to the extractor. It defines which features exist, what functions are used to extract them, what arguments are to be provided, whether they require any resources to be loaded in advance, and whether they are enabled or disabled for the next feature extraction run. Note that the "vector_size" key indicates how many feature columns in the sparse matrix a single function produces.
Each run of the feature extraction will also produce statistics on the duration of the extraction process.
To extract data with labels for the purpose of training a classifier, run the script
prepare_training_data.py
with the desired inputs. The resulting data matrix will be
stored in the subfolder processed_features/
.
prepare_training_data.py <tr_data>... [--format <FORMAT>] [--out <OFPATH>]
Options:
-f --format <FORMAT> Output format. Options: {libsvm, sparse, debug, xgb} [default: sparse]
-o --out <OFPATH> Filename for the output. If not specified, will reuse input filename.
The results can be output as either libsvm text format, as a pickled sparse matrix, or in the form of an XGB data matrix. In either case, the script also produces a list of labels, weights and feature names. For the XGB output these are already integrated into the binary, while for the other formats they are output as separate files.
The basic usage of each classifier script is as follows:
Usage:
train_xgb.py <train_data> <mode>
train_lightgbm.py <train_data> <mode>
train_catboost.py <train_data> <mode>
Inside the subfolder classifiers
one will find a number of scripts that will train the
corresponding type of classifier. Generally, these scripts support the following modes:
simple_train
: Trains the classifier without any performance validation. This mode is useful to prepare a model trained on the full data, which will then predict purposes for unseen cookies (no validation required).split
: Perform a simple 80/20 train/test split, training the classifier on 80% of the data while validating on the remaining 20%. In addition to producing validation statistics during training, this will also output a confusion matrix, and an accuracy score, taking the purpose with greatest probability as the prediction.cross_validate
: Performs 5-fold stratified cross-validation on the training data. Unlike the train-test split, this will not output a confusion matrix, and it will not output a final model, but it does provide added guarantees on the performance of a classifier, such as mean and standard deviation over 5 folds.grid_search
: Performs hyperparameter search using the gridsearch approach.random_search
: Performs hyperparameter search using random combination of parameters.
For XGBoost, the classifiers/
directory contains a number of additional scripts to produce feature importance and other similar statistics.
Please refer to the documentation in the individual script files for more information.
Finally, there exist two scripts predict_class.py
and feature_matrix_statistics.py
in the base folder.
The former is used as a simple wrapper to compute predictions with given cookie data,
while the latter can be used to output the most common and least commonly extracted features in the cookie data.
To produce the model used for the CookieBlock extension, one first needs to run the feature extraction found at:
https://github.com/dibollinger/CookieBlock
The Javascript feature extractor works in the same way as the Python variant. It exists to prevent minor differences in the cookie feature extraction process between Python and Javascript from negatively affecting the quality of predictions.
Once the sparse matrix has been produced, provide it as an input to the XGBoost script at:
classifiers/xgboost/train_xgb.py
Then, with the resulting model as input, run the script
classifiers/xgboost/create_small_dump.py
This produces four tree model files in JSON format, named forest_classX.json
.
This contains a compressed representation of the tree model that can be read by the browser extension and used to make decision. Each file corresponds to a specific class, and should not be renamed.
./classifiers/
: Contains python scripts to train the various classifier types../feature_extraction/
: Contains all python code that handle feature extraction../feature_extraction/features.json
: Configuration where one can define, configure, enable or disable individual features../resource_construction/
: Contains the Python scripts that were used to construct the feature extraction resources../processed_features/
: Directory where the extracted feature matrices are stored../resources/
: Contains external resource files used for the feature extraction../training_data/
: Contain some examples for the JSON-formatted training data../feature_matrix_statistics.py
: Produces the most and least commonly used features in a tree model, sorted by occurrence../predict_class.py
: Using a previously constructed classifier model, and given JSON cookie data as input, predicts labels for each cookie../prepare_training_data.py
: Main script to transform input cookie data (in JSON format) into a sparse feature matrix. The feature selection and parameters are configured byfeatures_extraction/features.json
.
This repository was created as part of the master thesis "Analyzing Cookies Compliance with the GDPR", which can be found at:
https://www.research-collection.ethz.ch/handle/20.500.11850/477333
as well as the paper "Automating Cookie Consent and GDPR Violation Detection", which can be found at:
https://karelkubicek.github.io/post/cookieblock.html
Thesis supervision and co-authors:
- Karel Kubicek
- Dr. Carlos Cotrini
- Prof. Dr. David Basin
- Information Security Group at ETH Zürich
See also the following repositories for other components that were developed as part of the thesis:
- CookieBlock Browser Extension
- OpenWPM-based Consent Crawler
- Violation Detection
- Prototype Consent Crawler
- Collected Data
This repository uses the XGBoost, LightGBM and CatBoost algorithms, as well as Tensorflow.
They can be found at:
- XGBoost: https://github.com/dmlc/xgboost/
- LightGBM: https://github.com/microsoft/LightGBM
- CatBoost: https://github.com/catboost
- Tensorflow: https://www.tensorflow.org/
Copyright © 2021-2022 Dino Bollinger, Department of Computer Science at ETH Zürich, Information Security Group
MIT License, see included LICENSE file