Merge pull request #364 from dyvenia/dev

Release 0.4.0
dyvenia · Apr 7, 2022 · 1e281f6 · 1e281f6
2 parents 8026c4f + df4270a
commit 1e281f6
Show file tree

Hide file tree

Showing 45 changed files with 2,374 additions and 244 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,17 +3,73 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+
 ## [Unreleased]
+
+## [0.4.0] - 2022-04-07
+### Added
+- Added `custom_mail_state_handler` function that sends mail notification using custom smtp server.
+- Added new function `df_clean_column` that cleans data frame columns from special characters
+- Added `df_clean_column` util task that removes special characters from a pandas DataFrame
+- Added `MultipleFlows` flow class which enables running multiple flows in a given order.
+- Added `GetFlowNewDateRange` task to change date range based on Prefect flows
+- Added `check_col_order` parameter in `ADLSToAzureSQL`
+- Added new source `ASElite` 
+- Added KeyVault support in `CloudForCustomers` tasks
+- Added `SQLServer` source
+- Added `DuckDBToDF` task
+- Added `DuckDBTransform` flow
+- Added `SQLServerCreateTable` task
+- Added `credentials` param to `BCPTask`
+- Added `get_sql_dtypes_from_df` and `update_dict` util tasks
+- Added `DuckDBToSQLServer` flow
+- Added `if_exists="append"` option to `DuckDB.create_table_from_parquet()`
+- Added `get_flow_last_run_date` util function
+- Added `df_to_dataset` task util for writing DataFrames to data lakes using `pyarrow`
+- Added retries to Cloud for Customers tasks
+- Added `chunksize` parameter to `C4CToDF` task to allow pulling data in chunks
+- Added `chunksize` parameter to `BCPTask` task to allow more control over the load process
+- Added support for SQL Server's custom `datetimeoffset` type
+- Added `AzureSQLToDF` task
+- Added `AzureSQLUpsert` task
+
+### Changed
+- Changed the base class of `AzureSQL` to `SQLServer`
+- `df_to_parquet()` task now creates directories if needed
+- Added several more separators to check for automatically in `SAPRFC.to_df()`
+- Upgraded `duckdb` version to 0.3.2
+
+### Fixed
+- Fixed bug with `CheckColumnOrder` task
+- Fixed OpenSSL config for old SQL Servers still using TLS < 1.2
+- `BCPTask` now correctly handles custom SQL Server port 
+- Fixed `SAPRFC.to_df()` ignoring user-specified separator
+- Fixed temporary CSV generated by the `DuckDBToSQLServer` flow not being cleaned up
+- Fixed some mappings in `get_sql_dtypes_from_df()` and optimized performance
+- Fixed `BCPTask` - the case when the file path contained a space
+- Fixed credential evaluation logic (`credentials` is now evaluated before `config_key`)
+- Fixed "$top" and "$skip" values being ignored by `C4CToDF` task if provided in the `params` parameter
+- Fixed `SQL.to_df()` incorrectly handling queries that begin with whitespace
+
+### Removed
+- Removed `autopick_sep` parameter from `SAPRFC` functions. The separator is now always picked automatically if not provided.
+- Removed `dtypes_to_json` task to task_utils.py
+
+
 ## [0.3.2] - 2022-02-17
 ### Fixed
 - fixed an issue with schema info within `CheckColumnOrder` class. 
+
+
 ## [0.3.1] - 2022-02-17
 ### Changed
 -`ADLSToAzureSQL` - added `remove_tab`  parameter to remove uncessery tab separators from data. 
 
 ### Fixed
 - fixed an issue with return df within `CheckColumnOrder` class. 
 
+
 ## [0.3.0] - 2022-02-16
 ### Added
 - new source `SAPRFC` for connecting with SAP using the `pyRFC` library (requires pyrfc as well as the SAP NW RFC library that can be downloaded [here](https://support.sap.com/en/product/connectors/nwrfcsdk.html)
@@ -37,6 +93,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - C4C connection with url and report_url optimization
 - column mapper in C4C source
 
+
 ## [0.2.15] - 2022-01-12
 ### Added
 - new option to `ADLSToAzureSQL` Flow - `if_exists="delete"`
@@ -50,10 +107,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 
 ## [0.2.14] - 2021-12-01
-
 ### Fixed
 - authorization issue within `CloudForCustomers` source
 
+
 ## [0.2.13] - 2021-11-30
 ### Added
 - Added support for file path to `CloudForCustomersReportToADLS` flow
@@ -67,6 +124,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - `Sharepoint` and `CloudForCustomers` sources will now provide an informative `CredentialError` which is also raised early. This will make issues with input credenials immediately clear to the user.
 - Removed set_key_value from `CloudForCustomersReportToADLS` flow
 
+
 ## [0.2.12] - 2021-11-25
 ### Added
 - Added `Sharepoint` source
@@ -80,18 +138,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Added `df_to_parquet` task to task_utils.py
 - Added `dtypes_to_json` task to task_utils.py
 
-## [0.2.11] - 2021-10-30
 
+## [0.2.11] - 2021-10-30
 ### Fixed
 - `ADLSToAzureSQL` - fixed path to csv issue. 
 - `SupermetricsToADLS` - fixed local json path issue. 
 
+
 ## [0.2.10] - 2021-10-29
 ### Release due to CI/CD error
 
+
 ## [0.2.9] - 2021-10-29
 ### Release due to CI/CD error
 
+
 ## [0.2.8] - 2021-10-29
 ### Changed
 - CI/CD: `dev` image is now only published on push to the `dev` branch
@@ -124,6 +185,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Fixed `ADLSToAzureSQL` breaking in `"append"` mode if the table didn't exist (#145).
 - Fixed `ADLSToAzureSQL` breaking in promotion path for csv files. 
 
+
 ## [0.2.6] - 2021-09-22
 ### Added
 - Added flows library docs to the references page
@@ -249,14 +311,12 @@ specified in the `SUPERMETRICS_DEFAULT_USER` secret
 - Tasks now use secrets for credential management (azure tasks use Azure Key Vault secrets)
 - SQL source now has a default query timeout of 1 hour
 
-
 ### Fixed
 - Fix `SQLite` tests
 - Multiple stability improvements with retries and timeouts
 
 
 ## [0.1.12] - 2021-05-08
-
 ### Changed
 - Moved from poetry to pip
 

diff --git a/README.md b/README.md
@@ -108,10 +108,42 @@ However, when developing, the easiest way is to use the provided Jupyter Lab con
 2. Set up locally
 3. Test your changes with `pytest`
 4. Submit a PR. The PR should contain the following:
-- new/changed functionality
-- tests for the changes
-- changes added to `CHANGELOG.md`
-- any other relevant resources updated (esp. `viadot/docs`)
+    - new/changed functionality
+    - tests for the changes
+    - changes added to `CHANGELOG.md`
+    - any other relevant resources updated (esp. `viadot/docs`)
+
+The general flow of working for this repository in case of forking:
+1. Pull before making any changes
+2. Create a new branch with 
+```
+git checkout -b <name>
+```
+3. Make some work on repository
+4. Stage changes with 
+```
+git add <files>
+```
+5. Commit the changes with 
+```
+git commit -m <message>
+``` 
+__Note__: See out Style Guidelines for more information about commit messages and PR names
+
+6. Fetch and pull the changes that could happen while working with 
+```
+git fetch <remote> <branch>
+git checkout <remote>/<branch>
+```
+7. Push your changes on repostory using 
+```
+git push origin <name>
+```
+8. Use merge to finish your push to repository 
+```
+git checkout <where_merging_to>
+git merge <branch_to_merge>
+```
 
 Please follow the standards and best practices used within the library (eg. when adding tasks, see how other tasks are constructed, etc.). For any questions, please reach out to us here on GitHub.
 

diff --git a/docker/Dockerfile b/docker/Dockerfile
@@ -18,14 +18,17 @@ RUN echo "Acquire::Check-Valid-Until \"false\";\nAcquire::Check-Date \"false\";"
 
 
 # System packages
-RUN apt update && yes | apt install vim unixodbc-dev build-essential \
+RUN apt update -q && yes | apt install -q vim unixodbc-dev build-essential \
     curl python3-dev libboost-all-dev libpq-dev graphviz python3-gi sudo git
 RUN pip install --upgrade cffi
 
 RUN curl http://archive.ubuntu.com/ubuntu/pool/main/g/glibc/multiarch-support_2.27-3ubuntu1_amd64.deb \
     -o multiarch-support_2.27-3ubuntu1_amd64.deb && \
     apt install ./multiarch-support_2.27-3ubuntu1_amd64.deb
 
+# Fix for old SQL Servers still using TLS < 1.2
+RUN chmod +rwx /usr/lib/ssl/openssl.cnf && \
+    sed -i 's/SECLEVEL=2/SECLEVEL=1/g' /usr/lib/ssl/openssl.cnf
 
 # ODBC -- make sure to pin driver version as it's reflected in odbcinst.ini
 RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - && \

diff --git a/requirements.txt b/requirements.txt
@@ -1,5 +1,6 @@
 azure-core==1.20.1
 azure-storage-blob==12.9.0
+click==8.0.1
 black==21.11b1
 mkdocs-autorefs==0.3.0
 mkdocs-material-extensions==1.0.3
@@ -17,7 +18,7 @@ openpyxl==3.0.9
 jupyterlab==3.2.4
 azure-keyvault==4.1.0
 azure-identity==1.7.1
-great-expectations==0.13.44
+great-expectations==0.14.12
 matplotlib
 adlfs==2021.10.0
 PyGithub==1.55
@@ -26,4 +27,7 @@ imagehash==4.2.1
 visions==0.7.4
 sharepy==1.3.0
 sql-metadata==2.3.0
-duckdb==0.3.1
+duckdb==0.3.2
+google-cloud==0.34.0
+google-auth==2.6.2
+sendgrid==6.9.7
diff --git a/tests/integration/flows/test_adls_to_azure_sql.py b/tests/integration/flows/test_adls_to_azure_sql.py
@@ -1,4 +1,5 @@
 import pandas as pd
+import os
 from viadot.flows import ADLSToAzureSQL
 from viadot.flows.adls_to_azure_sql import df_to_csv_task
 
@@ -53,5 +54,14 @@ def test_df_to_csv_task():
     df = pd.DataFrame(data=d)
     assert df["col1"].astype(str).str.contains("\t")[1] == True
     task = df_to_csv_task
-    task.run(df, "result.csv")
+    task.run(df, path="result.csv", remove_tab=True)
     assert df["col1"].astype(str).str.contains("\t")[1] != True
+
+
+def test_df_to_csv_task_none(caplog):
+    df = None
+    task = df_to_csv_task
+    path = "result_none.csv"
+    task.run(df, path=path, remove_tab=False)
+    assert "DataFrame is None" in caplog.text
+    assert os.path.isfile(path) == False
diff --git a/tests/integration/flows/test_aselite_to_adls.py b/tests/integration/flows/test_aselite_to_adls.py
@@ -0,0 +1,64 @@
+import logging
+import pandas as pd
+import os
+from typing import Any, Dict, List, Literal
+from prefect import Flow
+from prefect.tasks.secrets import PrefectSecret
+from prefect.run_configs import DockerRun
+from viadot.task_utils import df_to_csv, df_converts_bytes_to_int
+from viadot.tasks.aselite import ASELiteToDF
+from viadot.tasks import AzureDataLakeUpload
+from viadot.flows.aselite_to_adls import ASELiteToADLS
+
+
+TMP_FILE_NAME = "test_flow.csv"
+MAIN_DF = None
+
+df_task = ASELiteToDF()
+file_to_adls_task = AzureDataLakeUpload()
+
+
+def test_aselite_to_adls():
+
+    credentials_secret = PrefectSecret("aselite").run()
+    vault_name = PrefectSecret("AZURE_DEFAULT_KEYVAULT").run()
+
+    query_designer = """SELECT TOP 10 [ID]
+        ,[SpracheText]
+        ,[SpracheKat]
+        ,[SpracheMM]
+        ,[KatSprache]
+        ,[KatBasisSprache]
+        ,[CodePage]
+        ,[Font]
+        ,[Neu]
+        ,[Upd]
+        ,[UpdL]
+        ,[LosKZ]
+        ,[AstNr]
+        ,[KomKz]
+        ,[RKZ]
+        ,[ParentLanguageNo]
+        ,[UPD_FIELD]
+    FROM [UCRMDEV].[dbo].[CRM_00]"""
+
+    flow = ASELiteToADLS(
+        "Test flow",
+        query=query_designer,
+        sqldb_credentials_secret=credentials_secret,
+        vault_name=vault_name,
+        file_path=TMP_FILE_NAME,
+        to_path="raw/supermetrics/mp/result_df_flow_at_des_m.csv",
+        run_config=None,
+    )
+
+    result = flow.run()
+    assert result.is_successful()
+
+    MAIN_DF = pd.read_csv(TMP_FILE_NAME, delimiter="\t")
+
+    assert isinstance(MAIN_DF, pd.DataFrame) == True
+
+    assert MAIN_DF.shape == (10, 17)
+
+    os.remove(TMP_FILE_NAME)
diff --git a/tests/integration/flows/test_multiple_flows.py b/tests/integration/flows/test_multiple_flows.py
@@ -0,0 +1,25 @@
+from viadot.flows import MultipleFlows
+import logging
+
+
+def test_multiple_flows_working(caplog):
+    list = [
+        ["Flow of flows 1 test", "dev"],
+        ["Flow of flows 2 - working", "dev"],
+        ["Flow of flows 3", "dev"],
+    ]
+    flow = MultipleFlows(name="test", flows_list=list)
+    with caplog.at_level(logging.INFO):
+        flow.run()
+    assert "All of the tasks succeeded." in caplog.text
+
+
+def test_multiple_flows_not_working(caplog):
+    list = [
+        ["Flow of flows 1 test", "dev"],
+        ["Flow of flows 2 test - not working", "dev"],
+        ["Flow of flows 3", "dev"],
+    ]
+    flow = MultipleFlows(name="test", flows_list=list)
+    flow.run()
+    assert "One of the flows has failed!" in caplog.text
diff --git a/tests/integration/tasks/test_aselite.py b/tests/integration/tasks/test_aselite.py
@@ -0,0 +1,15 @@
+from viadot.tasks import ASELiteToDF
+import pandas as pd
+
+
+def test_aselite_to_df():
+    query = """SELECT TOP (10) [usageid]
+      ,[configid]
+      ,[verticalid]
+      ,[textgroupid]
+      ,[nr]
+      ,[storedate]
+  FROM [UCRMDEV_DESIGNER].[dbo].[PORTAL_APPLICATION_TEXTUSAGE]"""
+    task = ASELiteToDF()
+    df = task.run(query=query)
+    assert isinstance(df, pd.DataFrame)
diff --git a/tests/integration/tasks/test_azure_sql.py b/tests/integration/tasks/test_azure_sql.py
@@ -111,7 +111,7 @@ def test_check_column_order_append_diff_col_number(caplog):
         ValidationError,
         match=r"Detected discrepancies in number of columns or different column names between the CSV file and the SQL table!",
     ):
-        check_column_order.run(table=TABLE, if_exists="append", df=df)
+        check_column_order.run(table=TABLE, schema=SCHEMA, if_exists="append", df=df)
 
 
 def test_check_column_order_replace(caplog):
@@ -132,3 +132,13 @@ def test_check_column_order_replace(caplog):
     with caplog.at_level(logging.INFO):
         check_column_order.run(table=TABLE, if_exists="replace", df=df)
     assert "The table will be replaced." in caplog.text
+
+
+def test_check_column_order_append_not_exists(caplog):
+    check_column_order = CheckColumnOrder()
+    data = {"id": [1], "street": ["Green"], "name": ["Tom"]}
+    df = pd.DataFrame(data)
+    check_column_order.run(
+        table="non_existing_table_123", schema="sandbox", if_exists="append", df=df
+    )
+    assert "table doesn't exists" in caplog.text