Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
sanash43 committed Apr 12, 2024
2 parents 8acfa22 + 5c20767 commit adae5ff
Show file tree
Hide file tree
Showing 7 changed files with 206 additions and 28 deletions.
91 changes: 90 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,94 @@
# CHANGELOG



## v0.1.3 (2024-04-12)

### Fix

* fix: added missing documentation to test_preprocessing.py ([`afb208f`](https://github.com/DSCI-310-2024/py_predpurchase/commit/afb208f4b609dedf0872d3a03d09e30534511222))


## v0.1.2 (2024-04-12)

### Fix

* fix: fixing syntax error in metrics function ([`d65e84e`](https://github.com/DSCI-310-2024/py_predpurchase/commit/d65e84e9ecb5395696ed09b6cbbfd8a1f7b1cc59))


## v0.1.1 (2024-04-11)

### Fix

* fix: updating docstrings in functions ([`ad585b7`](https://github.com/DSCI-310-2024/py_predpurchase/commit/ad585b7aa7ae829f50f6a92d799caf06fa40bc1b))

### Unknown

* other packages docs added ([`550f39f`](https://github.com/DSCI-310-2024/py_predpurchase/commit/550f39feba0b52085b5155a4dc1b95eaacb9613e))

* minor update in packages for usage example ([`4cb880e`](https://github.com/DSCI-310-2024/py_predpurchase/commit/4cb880ea85e09bbecaade2ae05ea0e78d947eefa))

* added python to fenced code block ([`ee4f9d9`](https://github.com/DSCI-310-2024/py_predpurchase/commit/ee4f9d90aef244e9093db78e5e4be42475ecabaa))

* included usage example of calculate_classification_metrics function ([`edd382f`](https://github.com/DSCI-310-2024/py_predpurchase/commit/edd382f681dd95ddd3b5f72f062a68a979057f94))

* Reran all code blocks to fix import errors ([`9f864ee`](https://github.com/DSCI-310-2024/py_predpurchase/commit/9f864eed7dd43501912636b0080250b84e470741))

* Added the Read the Docs link for full documentation. ([`51f00c3`](https://github.com/DSCI-310-2024/py_predpurchase/commit/51f00c3a17143e4b2dbbcb4b3fe1861969e73f5c))

* change python version to string, removed requirements.txt for now ([`1a2cafa`](https://github.com/DSCI-310-2024/py_predpurchase/commit/1a2cafa10ecd5c2f6b2c17baaf63b68357f58a5e))

* change python version to non string ([`d21b3d3`](https://github.com/DSCI-310-2024/py_predpurchase/commit/d21b3d31e262ae0f881f6e596dd0cad198907ad4))

* includeded installation for requirements.txt ([`7b9b664`](https://github.com/DSCI-310-2024/py_predpurchase/commit/7b9b664941798a21e001bb68cd9ef8e715af6db5))

* Merge pull request #9 from DSCI-310-2024/documentation

Documentation ([`0a0b5bb`](https://github.com/DSCI-310-2024/py_predpurchase/commit/0a0b5bb83b9717e42a9abc07c6dba633b65270cc))

* Fixed italicization ([`cbc2247`](https://github.com/DSCI-310-2024/py_predpurchase/commit/cbc224795c7bedc186e819958ad44fc6b788adb1))

* Updated usage section to redirect to the Example usage page. ([`4bd1007`](https://github.com/DSCI-310-2024/py_predpurchase/commit/4bd1007e772592cf9ba281bdc0e21c2e4b2de277))

* Fixed header ordering ([`ffbd248`](https://github.com/DSCI-310-2024/py_predpurchase/commit/ffbd24881d6a47c5dc450e472ec7836773e4185d))

* Deleted some irrelevant imports for usage example ([`6f1cf06`](https://github.com/DSCI-310-2024/py_predpurchase/commit/6f1cf06100a520850ab79114502c36412cb961fc))

* Outlined usage examples for all functions. ([`81f881d`](https://github.com/DSCI-310-2024/py_predpurchase/commit/81f881d53f866ad5008771ad6ce15358892f84b4))

* added badge markdown code to README ([`944d049`](https://github.com/DSCI-310-2024/py_predpurchase/commit/944d049537292a878d5f4669f45c962ef16a1bd8))

* fixing badge error in ci-cd.yml file ([`077dddb`](https://github.com/DSCI-310-2024/py_predpurchase/commit/077dddb005080737798f54020a805feeed2aecf4))

* fixed spelling error ([`0133f4f`](https://github.com/DSCI-310-2024/py_predpurchase/commit/0133f4fae9665b5a74aa6145c44646ebaf8c9976))

* updated function paths ([`86315ca`](https://github.com/DSCI-310-2024/py_predpurchase/commit/86315ca90b7950c09f75b38a958a3be7b41c0352))

* adjusting from statment to reference package name in all test files ([`669bebc`](https://github.com/DSCI-310-2024/py_predpurchase/commit/669bebc330ac0a8bc9a490c3632fd649fe579c0a))

* Merge pull request #7 from DSCI-310-2024/documentation

Documentation; changes in README.md for example usage, added documentation to all functions and test files. ([`1c5d281`](https://github.com/DSCI-310-2024/py_predpurchase/commit/1c5d281e9eabaf019a395b9d556401495bcb0ca3))

* Merge branch 'main' into documentation ([`c3f6617`](https://github.com/DSCI-310-2024/py_predpurchase/commit/c3f6617e3c3d73ff562b531c87ba008f2e9cd347))

* Merge pull request #8 from DSCI-310-2024/badge-cov

Badge cov ([`5ba7eae`](https://github.com/DSCI-310-2024/py_predpurchase/commit/5ba7eae87ff490c6078ec791b87f9bbc05eddcec))

* adding code for badge into ci-cd.yml file ([`5241c98`](https://github.com/DSCI-310-2024/py_predpurchase/commit/5241c98b8042cde96bfa1bbfaac47937a156c929))

* adding code for badge into ci-cd.yml file ([`6e5d4de`](https://github.com/DSCI-310-2024/py_predpurchase/commit/6e5d4dec0fca3838c03948bdc3c03215537b999c))

* Updating with documentation for each test ([`eb5f7c2`](https://github.com/DSCI-310-2024/py_predpurchase/commit/eb5f7c2c8c98f12a6d53c83170e4f5c84983924b))

* Changed for @calvinyhchoi's updates in analysis repository ([`2e56bf3`](https://github.com/DSCI-310-2024/py_predpurchase/commit/2e56bf3c7abfe16a6d11a5bdae2df813fa065752))

* Added in updated function from the analysis repository ([`3d88b29`](https://github.com/DSCI-310-2024/py_predpurchase/commit/3d88b2957ce424d37121726791093362aec5c47a))

* Updated Usage section to include all functions in the package ([`f778959`](https://github.com/DSCI-310-2024/py_predpurchase/commit/f77895969e1a2e8f2cc2115438b40204403e1f89))


## v0.1.0 (2024-04-11)

### Build
Expand All @@ -23,6 +111,8 @@ Model cross val test updated ([`1a4de5f`](https://github.com/DSCI-310-2024/py_pr

* adding @calvinyhchoi 's changes to model_cross_val function and test files ([`76ebe27`](https://github.com/DSCI-310-2024/py_predpurchase/commit/76ebe27402f71ff2ee14cd604afbf8ea65761918))

* Updated to include feature writeup ([`d2933e8`](https://github.com/DSCI-310-2024/py_predpurchase/commit/d2933e86b2388d987e30646e4a4880d4d6f783fa))

* fixing description ([`f108b0e`](https://github.com/DSCI-310-2024/py_predpurchase/commit/f108b0ed2d98781918d90b64471dc450616cf636))

* Merge branch 'main' of https://github.com/DSCI-310-2024/py_predpurchase ([`7b58fbb`](https://github.com/DSCI-310-2024/py_predpurchase/commit/7b58fbba609a4a04c03354d4590d0f2c08fc1e78))
Expand All @@ -46,4 +136,3 @@ Clarified purchasing intentions in the introduction ([`bbb2231`](https://github.
* removing ci/cd yml file for now ([`95dbb01`](https://github.com/DSCI-310-2024/py_predpurchase/commit/95dbb01f2147361fb3c187d50ffd2ad85d109a97))

* initial package setup ([`6ea0b59`](https://github.com/DSCI-310-2024/py_predpurchase/commit/6ea0b59d454aa997515ef49452f152e4edb50045))

16 changes: 11 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[![codecov](https://codecov.io/gh/DSCI-310-2024/py_predpurchase/graph/badge.svg?token=ykj5GDrW0K)](https://codecov.io/gh/DSCI-310-2024/py_predpurchase)

```py_predpurchase``` is a package for predicting online shopper purchasing intentions, whether an online shopper will make a purchase from their current browsing session or not. This package contains functions to aid with the data analysis processes including conducting data preprocessing as well as calculating classification metrics, cross validation scores and feature importances.
```py_predpurchase``` is a package for predicting online shopper purchasing intentions, whether an online shopper will purchase their current browsing session or not. This package contains functions to aid with the data analysis processes including conducting data preprocessing as well as calculating classification metrics, cross-validation scores and feature importances.

**Full Documentation hosted on Read the Docs**: https://py-predpurchase.readthedocs.io/en/latest/index.html

Expand All @@ -17,11 +17,11 @@ $ pip install py_predpurchase
```py_predpurchase``` can be used to:

* Apply preprocessing transformations to the data, including scaling, encoding, and passing through features as specified.
* Calculate the cross validation results for a four common off-the-shelf models (Dummy, KNN, SVM and RandomForests)
* Fit a given model, and extract feature importances, sorted in descending order, and returns them as a DataFrame.
* Calculate the cross-validation results for four common off-the-shelf models (Dummy, KNN, SVM and RandomForests)
* Fit a given model, extract feature importances, sort in descending order, and return them as a DataFrame.
* Calculate the classification metrics for model predictions including precision, recall, accuracy and F1 scores.

*Please refer to the 'Example usage' page on the [Read the Docs](https://py-predpurchase.readthedocs.io/en/latest/index.html) package documentation for a step by step, demonstration of each function in this package.*
*Please refer to the 'Example usage' page on the [Read the Docs](https://py-predpurchase.readthedocs.io/en/latest/index.html) package documentation for a step-by-step, demonstration of each function in this package.*

Below is an example usage for one of our functions, `calculate_classification_metrics`

Expand All @@ -41,12 +41,18 @@ calculate_classification_metrics(y_true, y_pred)

## Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a [Code of Conduct](https://github.com/DSCI-310-2024/py_predpurchase/blob/main/CONDUCT.md). By contributing to this project, you agree to abide by its terms.

## License

`py_predpurchase` was created by Nour Abdelfattah, Sana Shams, Calvin Choi, Sai Pusuluri. It is licensed under the terms of the MIT license.

## Other packages

`pandas`: Pandas is an extensive tool for data manipulation, py_predpurchase specializes in applying machine learning with basic data manipulation, offering functionalities to utilize off-the-shelf machine learning models. When comparing it to something like the [E-Commerce Tools Package](https://pypi.org/project/ecommercetools/0.42.9/), our use of `pandas` along with `sklearn` allows us to manipulate and analyze data in a more primitive setting. The E-Commerce Tools Package is catered more towards transactional data with tools and functions for stock management and ledger items. `pandas` provides a simpler solution suited for the dataset used in py_predpurchase as the dataset pertains to consumer behaviour and E-Commerce marketing metrics which are less sophisticated.

`scikit-learn`: Scikit-learn excels in model building, but py_predpurchase extends its offerings by providing advanced tools for interpreting model outcomes. Unlike scikit-learn's broader approach, our package includes specific methods for detailing the impact of each predictor on the purchasing decision, allowing for a deeper understanding of model dynamics and more accurate validation scores. py_predpurchase benefits from these specialized insights and improves your model's predictive performance in the context of online shopping.

## Credits

`py_predpurchase` was created with [`cookiecutter`](https://cookiecutter.readthedocs.io/en/latest/) and the `py-pkgs-cookiecutter` [template](https://github.com/py-pkgs/py-pkgs-cookiecutter).
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "py_predpurchase"
version = "0.1.0"
version = "0.1.3"
description = "```py_predpurchase```is a package for predicting online shopper purchasing intentions, containing functions to aid with data analysis processes including conducting data preprocessing as well as calculating classification metrics, cross validation scores and feature importances.The package features functions that focus mainly on analyzing the data and evaluating model performance."
authors = ["Nour Abdelfattah, Sana Shams, Calvin Choi, Sai Pusuluri"]
license = "MIT"
Expand Down
23 changes: 19 additions & 4 deletions src/py_predpurchase/function_classification_metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,29 @@
def calculate_classification_metrics(y_true, y_pred):
"""
Calculates classification metrics for model predictions including precision,
recall, accuracy and F1 scores.
recall, accuracy, and F1 scores.
Parameters:
- y_true: pd.Series, true target values in a dataset
- y_pred: pd.Series, predicted target values by the model.
----------
y_true : array-like or pd.Series
True target values in a dataset.
y_pred : array-like or pd.Series
Predicted target values by the model.
Returns:
- dict, containing precision, recall, accuracy, and F1 score.
----------
dict
Contains precision, recall, accuracy, and F1 score.
Examples:
--------
Assume `y_true` and `y_pred` are as follows:
>>> y_true = [0, 1, 2, 0, 1]
>>> y_pred = [0, 2, 1, 0, 0]
>>> calculate_classification_metrics(y_true, y_pred)
"""

if not all(isinstance(y, (int, float, np.number)) for y in np.concatenate([y_true, y_pred])):
Expand Down
39 changes: 30 additions & 9 deletions src/py_predpurchase/function_model_cross_val.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,36 @@ def model_cross_validation(preprocessed_training_data, preprocessed_testing_data
using preprocessed and cleaned training and testing datasets. Random forests and Dummy hyperparameters are fixed for simplicity sake.
Parameters:
- preprocessed_training_data: DataFrame, cleaned and preprocessed training data
- preprocessed_testing_data: DataFrame, cleaned and preprocessed testing data
- target: str target column name
- k: k value hyperparameter for KNearestNeighbours Int
- gamma: gamma value hyperparameter for SVM
Returns:
- dictionary, containing cross validation results (mean and std of scores) from specified model
"""
----------
preprocessed_training_data : DataFrame
Cleaned and preprocessed training data.
preprocessed_testing_data : DataFrame
Cleaned and preprocessed testing data.
target : str
Target column name in the dataset.
k : int
Hyperparameter 'k' value for KNearestNeighbours.
gamma : float
Hyperparameter 'gamma' value for SVM.
Returns:
----------
dict
Contains cross-validation results (mean and std of scores) for each specified model.
Examples:
--------
Assuming dataset is preprocessed and split into training and testing sets, with 'target' as the target column:
>>> results = model_cross_validation(preprocessed_training_data, preprocessed_testing_data, 'target', k=5, gamma=0.1)
>>> pd.DataFrame(results)
This will output the cross-validation results for each model, displaying the mean and standard deviation of the scores (also includes train scores).
Notes:
-------
The function assumes that the input data is already scaled and encoded.
"""

train_data = preprocessed_training_data
test_data = preprocessed_testing_data
Expand Down
40 changes: 32 additions & 8 deletions src/py_predpurchase/function_preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,40 @@ def numerical_categorical_preprocess(X_train, X_test, y_train, y_test, numeric_f
This function requires target data to be provided and includes it in the output DataFrames.
Parameters:
- X_train: DataFrame, training feature data
- X_test: DataFrame, testing feature data
- y_train: DataFrame, training target data
- y_test: DataFrame, testing target data
- numeric_features: list, names of numeric features to scale
- categorical_features: list, names of categorical features to encode
----------
X_train : DataFrame
Training feature data.
X_test : DataFrame
Testing feature data.
y_train : DataFrame or Series
Training target data.
y_test : DataFrame or Series
Testing target data.
numeric_features : list
Names of numeric features to scale.
categorical_features : list
Names of categorical features to encode.
Returns:
- Tuple containing preprocessed training and testing DataFrames including target data, and transformed column names
----------
Tuple
Contains preprocessed training and testing DataFrames including target data,
and transformed column names.
Examples:
--------
Assume you want to transform the following features and your data set has already been split
into train and test
>>> numeric_features = ['feature1', 'feature2']
>>> categorical_features = ['feature3', 'feature4']
>>> train_transformed, test_transformed, transformed_columns = numerical_categorical_preprocess(
X_train, X_test, y_train, y_test, numeric_features, categorical_features)
The function will transform feature1,2,3,4 accordingly, carrying out scaling and one-hot encoding and
storing the preprocessed data in 'train_transformed' and 'test_transformed'. Column names will also be stored in
'transformed_columns'.
"""


Expand Down
23 changes: 23 additions & 0 deletions tests/test_preprocessing.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,34 +53,57 @@ def sample_data():
return X_train, X_test, y_train, y_test, numeric_features, categorical_features

def test_shape(sample_data):
"""
Tests that the transformed training and testing data retain the same number of rows
to ensure no data loss occured during preprocessing.
"""
X_train, X_test, y_train, y_test, numeric_features, categorical_features = sample_data
train_transformed, test_transformed, _ = numerical_categorical_preprocess(X_train, X_test, y_train, y_test, numeric_features, categorical_features)

assert train_transformed.shape[0] == X_train.shape[0], "Train data row count mismatch after transformation."
assert test_transformed.shape[0] == X_test.shape[0], "Test data row count mismatch after transformation."

def test_null_values(sample_data):
"""
Ensures there are no null values in the transformed datasets
"""
X_train, X_test, _, _, numeric_features, categorical_features = sample_data
train_transformed, test_transformed, _ = numerical_categorical_preprocess(X_train, X_test, None, None, numeric_features, categorical_features)
assert not train_transformed.isnull().any().any(), "Null values found in transformed training data"
assert not test_transformed.isnull().any().any(), "Null values found in transformed test data"


def test_revenue_preservation(sample_data):
"""
Tests that the 'Revenue' target column is kept unaltered after preprocessing.
"""
X_train, X_test, y_train, y_test, numeric_features, categorical_features = sample_data
train_transformed, test_transformed, _ = numerical_categorical_preprocess(X_train, X_test, y_train, y_test, numeric_features, categorical_features)

assert np.array_equal(train_transformed['Revenue'], y_train), "Revenue data altered in training set."
assert np.array_equal(test_transformed['Revenue'], y_test), "Revenue data altered in testing set."

def test_numerical_features_transformation(sample_data):
"""
Tests that all specified numeric features are included in the transformed data
with the correct application of scaling.
"""
X_train, X_test, y_train, y_test, numeric_features, _ = sample_data
train_transformed, test_transformed, transformed_columns = numerical_categorical_preprocess(X_train, X_test, y_train, y_test, numeric_features, [])

for feature in numeric_features:
assert any(col.startswith(f'numeric__{feature}') for col in transformed_columns), f"Numeric feature '{feature}' not found in transformed columns."

def test_categorical_features_transformation(sample_data):
"""
Tests that categorical features are correctly one-hot encoded and included in the transformed
data, which will be indicted by the presence of transformed column names.
"""
X_train, X_test, y_train, y_test, _, categorical_features = sample_data
train_transformed, test_transformed, transformed_columns = numerical_categorical_preprocess(X_train, X_test, y_train, y_test, [], categorical_features)

Expand Down

0 comments on commit adae5ff

Please sign in to comment.