Merge branch 'main' of https://github.com/DSCI-310-2024/py_predpurchase

DSCI-310-2024 · Apr 12, 2024 · adae5ff · adae5ff
2 parents 8acfa22 + 5c20767
commit adae5ff
Show file tree

Hide file tree

Showing 7 changed files with 206 additions and 28 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,6 +1,94 @@
 # CHANGELOG
 
 
+
+## v0.1.3 (2024-04-12)
+
+### Fix
+
+* fix: added missing documentation to test_preprocessing.py ([`afb208f`](https://github.com/DSCI-310-2024/py_predpurchase/commit/afb208f4b609dedf0872d3a03d09e30534511222))
+
+
+## v0.1.2 (2024-04-12)
+
+### Fix
+
+* fix: fixing syntax error in metrics function ([`d65e84e`](https://github.com/DSCI-310-2024/py_predpurchase/commit/d65e84e9ecb5395696ed09b6cbbfd8a1f7b1cc59))
+
+
+## v0.1.1 (2024-04-11)
+
+### Fix
+
+* fix: updating docstrings in functions ([`ad585b7`](https://github.com/DSCI-310-2024/py_predpurchase/commit/ad585b7aa7ae829f50f6a92d799caf06fa40bc1b))
+
+### Unknown
+
+* other packages docs added ([`550f39f`](https://github.com/DSCI-310-2024/py_predpurchase/commit/550f39feba0b52085b5155a4dc1b95eaacb9613e))
+
+* minor update in packages for usage example ([`4cb880e`](https://github.com/DSCI-310-2024/py_predpurchase/commit/4cb880ea85e09bbecaade2ae05ea0e78d947eefa))
+
+* added python to fenced code block ([`ee4f9d9`](https://github.com/DSCI-310-2024/py_predpurchase/commit/ee4f9d90aef244e9093db78e5e4be42475ecabaa))
+
+* included usage example of calculate_classification_metrics function ([`edd382f`](https://github.com/DSCI-310-2024/py_predpurchase/commit/edd382f681dd95ddd3b5f72f062a68a979057f94))
+
+* Reran all code blocks to fix import errors ([`9f864ee`](https://github.com/DSCI-310-2024/py_predpurchase/commit/9f864eed7dd43501912636b0080250b84e470741))
+
+* Added the Read the Docs link for full documentation. ([`51f00c3`](https://github.com/DSCI-310-2024/py_predpurchase/commit/51f00c3a17143e4b2dbbcb4b3fe1861969e73f5c))
+
+* change python version to string, removed requirements.txt for now ([`1a2cafa`](https://github.com/DSCI-310-2024/py_predpurchase/commit/1a2cafa10ecd5c2f6b2c17baaf63b68357f58a5e))
+
+* change python version to non string ([`d21b3d3`](https://github.com/DSCI-310-2024/py_predpurchase/commit/d21b3d31e262ae0f881f6e596dd0cad198907ad4))
+
+* includeded installation for requirements.txt ([`7b9b664`](https://github.com/DSCI-310-2024/py_predpurchase/commit/7b9b664941798a21e001bb68cd9ef8e715af6db5))
+
+* Merge pull request #9 from DSCI-310-2024/documentation
+
+Documentation ([`0a0b5bb`](https://github.com/DSCI-310-2024/py_predpurchase/commit/0a0b5bb83b9717e42a9abc07c6dba633b65270cc))
+
+* Fixed italicization ([`cbc2247`](https://github.com/DSCI-310-2024/py_predpurchase/commit/cbc224795c7bedc186e819958ad44fc6b788adb1))
+
+* Updated usage section to redirect to the Example usage page. ([`4bd1007`](https://github.com/DSCI-310-2024/py_predpurchase/commit/4bd1007e772592cf9ba281bdc0e21c2e4b2de277))
+
+* Fixed header ordering ([`ffbd248`](https://github.com/DSCI-310-2024/py_predpurchase/commit/ffbd24881d6a47c5dc450e472ec7836773e4185d))
+
+* Deleted some irrelevant imports for usage example ([`6f1cf06`](https://github.com/DSCI-310-2024/py_predpurchase/commit/6f1cf06100a520850ab79114502c36412cb961fc))
+
+* Outlined usage examples for all functions. ([`81f881d`](https://github.com/DSCI-310-2024/py_predpurchase/commit/81f881d53f866ad5008771ad6ce15358892f84b4))
+
+* added badge markdown code to README ([`944d049`](https://github.com/DSCI-310-2024/py_predpurchase/commit/944d049537292a878d5f4669f45c962ef16a1bd8))
+
+* fixing badge error in ci-cd.yml file ([`077dddb`](https://github.com/DSCI-310-2024/py_predpurchase/commit/077dddb005080737798f54020a805feeed2aecf4))
+
+* fixed spelling error ([`0133f4f`](https://github.com/DSCI-310-2024/py_predpurchase/commit/0133f4fae9665b5a74aa6145c44646ebaf8c9976))
+
+* updated function paths ([`86315ca`](https://github.com/DSCI-310-2024/py_predpurchase/commit/86315ca90b7950c09f75b38a958a3be7b41c0352))
+
+* adjusting from statment to reference package name in all test files ([`669bebc`](https://github.com/DSCI-310-2024/py_predpurchase/commit/669bebc330ac0a8bc9a490c3632fd649fe579c0a))
+
+* Merge pull request #7 from DSCI-310-2024/documentation
+
+Documentation; changes in README.md for example usage, added documentation to all functions and test files. ([`1c5d281`](https://github.com/DSCI-310-2024/py_predpurchase/commit/1c5d281e9eabaf019a395b9d556401495bcb0ca3))
+
+* Merge branch &#39;main&#39; into documentation ([`c3f6617`](https://github.com/DSCI-310-2024/py_predpurchase/commit/c3f6617e3c3d73ff562b531c87ba008f2e9cd347))
+
+* Merge pull request #8 from DSCI-310-2024/badge-cov
+
+Badge cov ([`5ba7eae`](https://github.com/DSCI-310-2024/py_predpurchase/commit/5ba7eae87ff490c6078ec791b87f9bbc05eddcec))
+
+* adding code for badge into ci-cd.yml file ([`5241c98`](https://github.com/DSCI-310-2024/py_predpurchase/commit/5241c98b8042cde96bfa1bbfaac47937a156c929))
+
+* adding code for badge into ci-cd.yml file ([`6e5d4de`](https://github.com/DSCI-310-2024/py_predpurchase/commit/6e5d4dec0fca3838c03948bdc3c03215537b999c))
+
+* Updating with documentation for each test ([`eb5f7c2`](https://github.com/DSCI-310-2024/py_predpurchase/commit/eb5f7c2c8c98f12a6d53c83170e4f5c84983924b))
+
+* Changed for @calvinyhchoi&#39;s updates in analysis repository ([`2e56bf3`](https://github.com/DSCI-310-2024/py_predpurchase/commit/2e56bf3c7abfe16a6d11a5bdae2df813fa065752))
+
+* Added in updated function from the analysis repository ([`3d88b29`](https://github.com/DSCI-310-2024/py_predpurchase/commit/3d88b2957ce424d37121726791093362aec5c47a))
+
+* Updated Usage section to include all functions in the package ([`f778959`](https://github.com/DSCI-310-2024/py_predpurchase/commit/f77895969e1a2e8f2cc2115438b40204403e1f89))
+
+
 ## v0.1.0 (2024-04-11)
 
 ### Build
@@ -23,6 +111,8 @@ Model cross val test updated ([`1a4de5f`](https://github.com/DSCI-310-2024/py_pr
 
 * adding @calvinyhchoi &#39;s changes to model_cross_val function and test files ([`76ebe27`](https://github.com/DSCI-310-2024/py_predpurchase/commit/76ebe27402f71ff2ee14cd604afbf8ea65761918))
 
+* Updated to include feature writeup ([`d2933e8`](https://github.com/DSCI-310-2024/py_predpurchase/commit/d2933e86b2388d987e30646e4a4880d4d6f783fa))
+
 * fixing description ([`f108b0e`](https://github.com/DSCI-310-2024/py_predpurchase/commit/f108b0ed2d98781918d90b64471dc450616cf636))
 
 * Merge branch &#39;main&#39; of https://github.com/DSCI-310-2024/py_predpurchase ([`7b58fbb`](https://github.com/DSCI-310-2024/py_predpurchase/commit/7b58fbba609a4a04c03354d4590d0f2c08fc1e78))
@@ -46,4 +136,3 @@ Clarified purchasing intentions in the introduction ([`bbb2231`](https://github.
 * removing ci/cd yml file for now ([`95dbb01`](https://github.com/DSCI-310-2024/py_predpurchase/commit/95dbb01f2147361fb3c187d50ffd2ad85d109a97))
 
 * initial package setup ([`6ea0b59`](https://github.com/DSCI-310-2024/py_predpurchase/commit/6ea0b59d454aa997515ef49452f152e4edb50045))
-
diff --git a/README.md b/README.md
@@ -2,7 +2,7 @@
 
 [![codecov](https://codecov.io/gh/DSCI-310-2024/py_predpurchase/graph/badge.svg?token=ykj5GDrW0K)](https://codecov.io/gh/DSCI-310-2024/py_predpurchase)
 
-```py_predpurchase``` is a package for predicting online shopper purchasing intentions, whether an online shopper will make a purchase from their current browsing session or not. This package contains functions to aid with the data analysis processes including conducting data preprocessing as well as calculating classification metrics, cross validation scores and feature importances.
+```py_predpurchase``` is a package for predicting online shopper purchasing intentions, whether an online shopper will purchase their current browsing session or not. This package contains functions to aid with the data analysis processes including conducting data preprocessing as well as calculating classification metrics, cross-validation scores and feature importances.
 
 **Full Documentation hosted on Read the Docs**: https://py-predpurchase.readthedocs.io/en/latest/index.html
 
@@ -17,11 +17,11 @@ $ pip install py_predpurchase
 ```py_predpurchase``` can be used to:
 
 * Apply preprocessing transformations to the data, including scaling, encoding, and passing through features as specified.
-* Calculate the cross validation results for a four common off-the-shelf models (Dummy, KNN, SVM and RandomForests)
-* Fit a given model, and extract feature importances, sorted in descending order, and returns them as a DataFrame.
+* Calculate the cross-validation results for four common off-the-shelf models (Dummy, KNN, SVM and RandomForests)
+* Fit a given model, extract feature importances, sort in descending order, and return them as a DataFrame.
 * Calculate the classification metrics for model predictions including precision, recall, accuracy and F1 scores.
 
-*Please refer to the 'Example usage' page on the [Read the Docs](https://py-predpurchase.readthedocs.io/en/latest/index.html) package documentation for a step by step, demonstration of each function in this package.*
+*Please refer to the 'Example usage' page on the [Read the Docs](https://py-predpurchase.readthedocs.io/en/latest/index.html) package documentation for a step-by-step, demonstration of each function in this package.*
 
 Below is an example usage for one of our functions, `calculate_classification_metrics` 
 
@@ -41,12 +41,18 @@ calculate_classification_metrics(y_true, y_pred)
 
 ## Contributing
 
-Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.
+Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a [Code of Conduct](https://github.com/DSCI-310-2024/py_predpurchase/blob/main/CONDUCT.md). By contributing to this project, you agree to abide by its terms.
 
 ## License
 
 `py_predpurchase` was created by Nour Abdelfattah, Sana Shams, Calvin Choi, Sai Pusuluri. It is licensed under the terms of the MIT license.
 
+## Other packages
+
+`pandas`: Pandas is an extensive tool for data manipulation, py_predpurchase specializes in applying machine learning with basic data manipulation, offering functionalities to utilize off-the-shelf machine learning models. When comparing it to something like the [E-Commerce Tools Package](https://pypi.org/project/ecommercetools/0.42.9/), our use of `pandas` along with `sklearn` allows us to manipulate and analyze data in a more primitive setting. The E-Commerce Tools Package is catered more towards transactional data with tools and functions for stock management and ledger items. `pandas` provides a simpler solution suited for the dataset used in py_predpurchase as the dataset pertains to consumer behaviour and E-Commerce marketing metrics which are less sophisticated.  
+
+`scikit-learn`: Scikit-learn excels in model building, but py_predpurchase extends its offerings by providing advanced tools for interpreting model outcomes. Unlike scikit-learn's broader approach, our package includes specific methods for detailing the impact of each predictor on the purchasing decision, allowing for a deeper understanding of model dynamics and more accurate validation scores. py_predpurchase benefits from these specialized insights and improves your model's predictive performance in the context of online shopping.
+
 ## Credits
 
 `py_predpurchase` was created with [`cookiecutter`](https://cookiecutter.readthedocs.io/en/latest/) and the `py-pkgs-cookiecutter` [template](https://github.com/py-pkgs/py-pkgs-cookiecutter).
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "py_predpurchase"
-version = "0.1.0"
+version = "0.1.3"
 description = "```py_predpurchase```is a package for predicting online shopper purchasing intentions, containing functions to aid with data analysis processes including conducting data preprocessing as well as calculating classification metrics, cross validation scores and feature importances.The package features functions that focus mainly on analyzing the data and evaluating model performance."
 authors = ["Nour Abdelfattah, Sana Shams, Calvin Choi, Sai Pusuluri"]
 license = "MIT"

diff --git a/src/py_predpurchase/function_classification_metrics.py b/src/py_predpurchase/function_classification_metrics.py
@@ -4,14 +4,29 @@
 def calculate_classification_metrics(y_true, y_pred):
     """
     Calculates classification metrics for model predictions including precision, 
-    recall, accuracy and F1 scores. 
+    recall, accuracy, and F1 scores. 
     
     Parameters:
-    - y_true: pd.Series, true target values in a dataset
-    - y_pred: pd.Series, predicted target values by the model.
+    ----------
+    y_true : array-like or pd.Series
+        True target values in a dataset.
+    y_pred : array-like or pd.Series
+        Predicted target values by the model.
     
     Returns:
-    - dict, containing precision, recall, accuracy, and F1 score.
+    ----------
+    dict
+        Contains precision, recall, accuracy, and F1 score.
+    
+    Examples:
+    --------
+
+    Assume `y_true` and `y_pred` are as follows:
+    
+    >>> y_true = [0, 1, 2, 0, 1]
+    >>> y_pred = [0, 2, 1, 0, 0]
+    >>> calculate_classification_metrics(y_true, y_pred)
+
     """
 
     if not all(isinstance(y, (int, float, np.number)) for y in np.concatenate([y_true, y_pred])):

diff --git a/src/py_predpurchase/function_model_cross_val.py b/src/py_predpurchase/function_model_cross_val.py
@@ -12,15 +12,36 @@ def model_cross_validation(preprocessed_training_data, preprocessed_testing_data
 	using preprocessed and cleaned training and testing datasets. Random forests and Dummy hyperparameters are fixed for simplicity sake.
 	
 	Parameters:
-	- preprocessed_training_data: DataFrame, cleaned and preprocessed training data 
-	- preprocessed_testing_data: DataFrame, cleaned and preprocessed testing data
-	- target: str target column name
-	- k: k value hyperparameter for KNearestNeighbours Int
-	- gamma: gamma value hyperparameter for SVM
-	
-	Returns:
-	- dictionary, containing cross validation results (mean and std of scores) from specified model
-	"""
+    ----------
+    preprocessed_training_data : DataFrame
+        Cleaned and preprocessed training data.
+    preprocessed_testing_data : DataFrame
+        Cleaned and preprocessed testing data.
+    target : str
+        Target column name in the dataset.
+    k : int
+        Hyperparameter 'k' value for KNearestNeighbours.
+    gamma : float
+        Hyperparameter 'gamma' value for SVM.
+
+    Returns:
+    ----------
+    dict
+        Contains cross-validation results (mean and std of scores) for each specified model.
+
+    Examples:
+    --------
+    Assuming dataset is preprocessed and split into training and testing sets, with 'target' as the target column:
+
+    >>> results = model_cross_validation(preprocessed_training_data, preprocessed_testing_data, 'target', k=5, gamma=0.1)
+    >>> pd.DataFrame(results)
+
+    This will output the cross-validation results for each model, displaying the mean and standard deviation of the scores (also includes train scores).
+
+    Notes:
+    -------
+    The function assumes that the input data is already scaled and encoded.
+    """
 
 	train_data = preprocessed_training_data
 	test_data = preprocessed_testing_data

diff --git a/src/py_predpurchase/function_preprocessing.py b/src/py_predpurchase/function_preprocessing.py
@@ -8,16 +8,40 @@ def numerical_categorical_preprocess(X_train, X_test, y_train, y_test, numeric_f
     This function requires target data to be provided and includes it in the output DataFrames.
 
     Parameters:
-    - X_train: DataFrame, training feature data
-    - X_test: DataFrame, testing feature data
-    - y_train: DataFrame, training target data
-    - y_test: DataFrame, testing target data
-    - numeric_features: list, names of numeric features to scale
-    - categorical_features: list, names of categorical features to encode
-    
+    ----------
+    X_train : DataFrame
+        Training feature data.
+    X_test : DataFrame
+        Testing feature data.
+    y_train : DataFrame or Series
+        Training target data.
+    y_test : DataFrame or Series
+        Testing target data.
+    numeric_features : list
+        Names of numeric features to scale.
+    categorical_features : list
+        Names of categorical features to encode.
     
     Returns:
-    - Tuple containing preprocessed training and testing DataFrames including target data, and transformed column names
+    ----------
+    Tuple
+        Contains preprocessed training and testing DataFrames including target data, 
+        and transformed column names.
+   
+    Examples:
+    --------
+    Assume you want to transform the following features and your data set has already been split
+    into train and test
+
+    >>> numeric_features = ['feature1', 'feature2']
+    >>> categorical_features = ['feature3', 'feature4']
+    >>> train_transformed, test_transformed, transformed_columns = numerical_categorical_preprocess(
+            X_train, X_test, y_train, y_test, numeric_features, categorical_features)
+    
+    The function will transform feature1,2,3,4 accordingly, carrying out scaling and one-hot encoding and 
+    storing the preprocessed data in 'train_transformed' and 'test_transformed'. Column names will also be stored in 
+    'transformed_columns'.
+    
     """
 
 

diff --git a/tests/test_preprocessing.py b/tests/test_preprocessing.py
@@ -53,34 +53,57 @@ def sample_data():
     return X_train, X_test, y_train, y_test, numeric_features, categorical_features
 
 def test_shape(sample_data):
+    """
+    Tests that the transformed training and testing data retain the same number of rows
+    to ensure no data loss occured during preprocessing.
+    
+    """
     X_train, X_test, y_train, y_test, numeric_features, categorical_features = sample_data
     train_transformed, test_transformed, _ = numerical_categorical_preprocess(X_train, X_test, y_train, y_test, numeric_features, categorical_features)
 
     assert train_transformed.shape[0] == X_train.shape[0], "Train data row count mismatch after transformation."
     assert test_transformed.shape[0] == X_test.shape[0], "Test data row count mismatch after transformation."
 
 def test_null_values(sample_data):
+    """
+    Ensures there are no null values in the transformed datasets
+
+    """
     X_train, X_test, _, _, numeric_features, categorical_features = sample_data
     train_transformed, test_transformed, _ = numerical_categorical_preprocess(X_train, X_test, None, None, numeric_features, categorical_features)
     assert not train_transformed.isnull().any().any(), "Null values found in transformed training data"
     assert not test_transformed.isnull().any().any(), "Null values found in transformed test data"
 
 
 def test_revenue_preservation(sample_data):
+    """
+    Tests that the 'Revenue' target column is kept unaltered after preprocessing.
+
+    """
     X_train, X_test, y_train, y_test, numeric_features, categorical_features = sample_data
     train_transformed, test_transformed, _ = numerical_categorical_preprocess(X_train, X_test, y_train, y_test, numeric_features, categorical_features)
 
     assert np.array_equal(train_transformed['Revenue'], y_train), "Revenue data altered in training set."
     assert np.array_equal(test_transformed['Revenue'], y_test), "Revenue data altered in testing set."
 
 def test_numerical_features_transformation(sample_data):
+    """
+    Tests that all specified numeric features are included in the transformed data
+    with the correct application of scaling.
+
+    """
     X_train, X_test, y_train, y_test, numeric_features, _ = sample_data
     train_transformed, test_transformed, transformed_columns = numerical_categorical_preprocess(X_train, X_test, y_train, y_test, numeric_features, [])
 
     for feature in numeric_features:
         assert any(col.startswith(f'numeric__{feature}') for col in transformed_columns), f"Numeric feature '{feature}' not found in transformed columns."
 
 def test_categorical_features_transformation(sample_data):
+    """
+    Tests that categorical features are correctly one-hot encoded and included in the transformed
+    data, which will be indicted by the presence of transformed column names.
+    
+    """
     X_train, X_test, y_train, y_test, _, categorical_features = sample_data
     train_transformed, test_transformed, transformed_columns = numerical_categorical_preprocess(X_train, X_test, y_train, y_test, [], categorical_features)