Skip to content

Commit

Permalink
Merge pull request #364 from dyvenia/dev
Browse files Browse the repository at this point in the history
Release 0.4.0
  • Loading branch information
m-paz authored Apr 7, 2022
2 parents 8026c4f + df4270a commit 1e281f6
Show file tree
Hide file tree
Showing 45 changed files with 2,374 additions and 244 deletions.
68 changes: 64 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,17 +3,73 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).


## [Unreleased]

## [0.4.0] - 2022-04-07
### Added
- Added `custom_mail_state_handler` function that sends mail notification using custom smtp server.
- Added new function `df_clean_column` that cleans data frame columns from special characters
- Added `df_clean_column` util task that removes special characters from a pandas DataFrame
- Added `MultipleFlows` flow class which enables running multiple flows in a given order.
- Added `GetFlowNewDateRange` task to change date range based on Prefect flows
- Added `check_col_order` parameter in `ADLSToAzureSQL`
- Added new source `ASElite`
- Added KeyVault support in `CloudForCustomers` tasks
- Added `SQLServer` source
- Added `DuckDBToDF` task
- Added `DuckDBTransform` flow
- Added `SQLServerCreateTable` task
- Added `credentials` param to `BCPTask`
- Added `get_sql_dtypes_from_df` and `update_dict` util tasks
- Added `DuckDBToSQLServer` flow
- Added `if_exists="append"` option to `DuckDB.create_table_from_parquet()`
- Added `get_flow_last_run_date` util function
- Added `df_to_dataset` task util for writing DataFrames to data lakes using `pyarrow`
- Added retries to Cloud for Customers tasks
- Added `chunksize` parameter to `C4CToDF` task to allow pulling data in chunks
- Added `chunksize` parameter to `BCPTask` task to allow more control over the load process
- Added support for SQL Server's custom `datetimeoffset` type
- Added `AzureSQLToDF` task
- Added `AzureSQLUpsert` task

### Changed
- Changed the base class of `AzureSQL` to `SQLServer`
- `df_to_parquet()` task now creates directories if needed
- Added several more separators to check for automatically in `SAPRFC.to_df()`
- Upgraded `duckdb` version to 0.3.2

### Fixed
- Fixed bug with `CheckColumnOrder` task
- Fixed OpenSSL config for old SQL Servers still using TLS < 1.2
- `BCPTask` now correctly handles custom SQL Server port
- Fixed `SAPRFC.to_df()` ignoring user-specified separator
- Fixed temporary CSV generated by the `DuckDBToSQLServer` flow not being cleaned up
- Fixed some mappings in `get_sql_dtypes_from_df()` and optimized performance
- Fixed `BCPTask` - the case when the file path contained a space
- Fixed credential evaluation logic (`credentials` is now evaluated before `config_key`)
- Fixed "$top" and "$skip" values being ignored by `C4CToDF` task if provided in the `params` parameter
- Fixed `SQL.to_df()` incorrectly handling queries that begin with whitespace

### Removed
- Removed `autopick_sep` parameter from `SAPRFC` functions. The separator is now always picked automatically if not provided.
- Removed `dtypes_to_json` task to task_utils.py


## [0.3.2] - 2022-02-17
### Fixed
- fixed an issue with schema info within `CheckColumnOrder` class.


## [0.3.1] - 2022-02-17
### Changed
-`ADLSToAzureSQL` - added `remove_tab` parameter to remove uncessery tab separators from data.

### Fixed
- fixed an issue with return df within `CheckColumnOrder` class.


## [0.3.0] - 2022-02-16
### Added
- new source `SAPRFC` for connecting with SAP using the `pyRFC` library (requires pyrfc as well as the SAP NW RFC library that can be downloaded [here](https://support.sap.com/en/product/connectors/nwrfcsdk.html)
Expand All @@ -37,6 +93,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- C4C connection with url and report_url optimization
- column mapper in C4C source


## [0.2.15] - 2022-01-12
### Added
- new option to `ADLSToAzureSQL` Flow - `if_exists="delete"`
Expand All @@ -50,10 +107,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0


## [0.2.14] - 2021-12-01

### Fixed
- authorization issue within `CloudForCustomers` source


## [0.2.13] - 2021-11-30
### Added
- Added support for file path to `CloudForCustomersReportToADLS` flow
Expand All @@ -67,6 +124,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `Sharepoint` and `CloudForCustomers` sources will now provide an informative `CredentialError` which is also raised early. This will make issues with input credenials immediately clear to the user.
- Removed set_key_value from `CloudForCustomersReportToADLS` flow


## [0.2.12] - 2021-11-25
### Added
- Added `Sharepoint` source
Expand All @@ -80,18 +138,21 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Added `df_to_parquet` task to task_utils.py
- Added `dtypes_to_json` task to task_utils.py

## [0.2.11] - 2021-10-30

## [0.2.11] - 2021-10-30
### Fixed
- `ADLSToAzureSQL` - fixed path to csv issue.
- `SupermetricsToADLS` - fixed local json path issue.


## [0.2.10] - 2021-10-29
### Release due to CI/CD error


## [0.2.9] - 2021-10-29
### Release due to CI/CD error


## [0.2.8] - 2021-10-29
### Changed
- CI/CD: `dev` image is now only published on push to the `dev` branch
Expand Down Expand Up @@ -124,6 +185,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Fixed `ADLSToAzureSQL` breaking in `"append"` mode if the table didn't exist (#145).
- Fixed `ADLSToAzureSQL` breaking in promotion path for csv files.


## [0.2.6] - 2021-09-22
### Added
- Added flows library docs to the references page
Expand Down Expand Up @@ -249,14 +311,12 @@ specified in the `SUPERMETRICS_DEFAULT_USER` secret
- Tasks now use secrets for credential management (azure tasks use Azure Key Vault secrets)
- SQL source now has a default query timeout of 1 hour


### Fixed
- Fix `SQLite` tests
- Multiple stability improvements with retries and timeouts


## [0.1.12] - 2021-05-08

### Changed
- Moved from poetry to pip

Expand Down
40 changes: 36 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,10 +108,42 @@ However, when developing, the easiest way is to use the provided Jupyter Lab con
2. Set up locally
3. Test your changes with `pytest`
4. Submit a PR. The PR should contain the following:
- new/changed functionality
- tests for the changes
- changes added to `CHANGELOG.md`
- any other relevant resources updated (esp. `viadot/docs`)
- new/changed functionality
- tests for the changes
- changes added to `CHANGELOG.md`
- any other relevant resources updated (esp. `viadot/docs`)

The general flow of working for this repository in case of forking:
1. Pull before making any changes
2. Create a new branch with
```
git checkout -b <name>
```
3. Make some work on repository
4. Stage changes with
```
git add <files>
```
5. Commit the changes with
```
git commit -m <message>
```
__Note__: See out Style Guidelines for more information about commit messages and PR names

6. Fetch and pull the changes that could happen while working with
```
git fetch <remote> <branch>
git checkout <remote>/<branch>
```
7. Push your changes on repostory using
```
git push origin <name>
```
8. Use merge to finish your push to repository
```
git checkout <where_merging_to>
git merge <branch_to_merge>
```

Please follow the standards and best practices used within the library (eg. when adding tasks, see how other tasks are constructed, etc.). For any questions, please reach out to us here on GitHub.

Expand Down
5 changes: 4 additions & 1 deletion docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,17 @@ RUN echo "Acquire::Check-Valid-Until \"false\";\nAcquire::Check-Date \"false\";"


# System packages
RUN apt update && yes | apt install vim unixodbc-dev build-essential \
RUN apt update -q && yes | apt install -q vim unixodbc-dev build-essential \
curl python3-dev libboost-all-dev libpq-dev graphviz python3-gi sudo git
RUN pip install --upgrade cffi

RUN curl http://archive.ubuntu.com/ubuntu/pool/main/g/glibc/multiarch-support_2.27-3ubuntu1_amd64.deb \
-o multiarch-support_2.27-3ubuntu1_amd64.deb && \
apt install ./multiarch-support_2.27-3ubuntu1_amd64.deb

# Fix for old SQL Servers still using TLS < 1.2
RUN chmod +rwx /usr/lib/ssl/openssl.cnf && \
sed -i 's/SECLEVEL=2/SECLEVEL=1/g' /usr/lib/ssl/openssl.cnf

# ODBC -- make sure to pin driver version as it's reflected in odbcinst.ini
RUN curl https://packages.microsoft.com/keys/microsoft.asc | apt-key add - && \
Expand Down
8 changes: 6 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
azure-core==1.20.1
azure-storage-blob==12.9.0
click==8.0.1
black==21.11b1
mkdocs-autorefs==0.3.0
mkdocs-material-extensions==1.0.3
Expand All @@ -17,7 +18,7 @@ openpyxl==3.0.9
jupyterlab==3.2.4
azure-keyvault==4.1.0
azure-identity==1.7.1
great-expectations==0.13.44
great-expectations==0.14.12
matplotlib
adlfs==2021.10.0
PyGithub==1.55
Expand All @@ -26,4 +27,7 @@ imagehash==4.2.1
visions==0.7.4
sharepy==1.3.0
sql-metadata==2.3.0
duckdb==0.3.1
duckdb==0.3.2
google-cloud==0.34.0
google-auth==2.6.2
sendgrid==6.9.7
12 changes: 11 additions & 1 deletion tests/integration/flows/test_adls_to_azure_sql.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import pandas as pd
import os
from viadot.flows import ADLSToAzureSQL
from viadot.flows.adls_to_azure_sql import df_to_csv_task

Expand Down Expand Up @@ -53,5 +54,14 @@ def test_df_to_csv_task():
df = pd.DataFrame(data=d)
assert df["col1"].astype(str).str.contains("\t")[1] == True
task = df_to_csv_task
task.run(df, "result.csv")
task.run(df, path="result.csv", remove_tab=True)
assert df["col1"].astype(str).str.contains("\t")[1] != True


def test_df_to_csv_task_none(caplog):
df = None
task = df_to_csv_task
path = "result_none.csv"
task.run(df, path=path, remove_tab=False)
assert "DataFrame is None" in caplog.text
assert os.path.isfile(path) == False
64 changes: 64 additions & 0 deletions tests/integration/flows/test_aselite_to_adls.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
import logging
import pandas as pd
import os
from typing import Any, Dict, List, Literal
from prefect import Flow
from prefect.tasks.secrets import PrefectSecret
from prefect.run_configs import DockerRun
from viadot.task_utils import df_to_csv, df_converts_bytes_to_int
from viadot.tasks.aselite import ASELiteToDF
from viadot.tasks import AzureDataLakeUpload
from viadot.flows.aselite_to_adls import ASELiteToADLS


TMP_FILE_NAME = "test_flow.csv"
MAIN_DF = None

df_task = ASELiteToDF()
file_to_adls_task = AzureDataLakeUpload()


def test_aselite_to_adls():

credentials_secret = PrefectSecret("aselite").run()
vault_name = PrefectSecret("AZURE_DEFAULT_KEYVAULT").run()

query_designer = """SELECT TOP 10 [ID]
,[SpracheText]
,[SpracheKat]
,[SpracheMM]
,[KatSprache]
,[KatBasisSprache]
,[CodePage]
,[Font]
,[Neu]
,[Upd]
,[UpdL]
,[LosKZ]
,[AstNr]
,[KomKz]
,[RKZ]
,[ParentLanguageNo]
,[UPD_FIELD]
FROM [UCRMDEV].[dbo].[CRM_00]"""

flow = ASELiteToADLS(
"Test flow",
query=query_designer,
sqldb_credentials_secret=credentials_secret,
vault_name=vault_name,
file_path=TMP_FILE_NAME,
to_path="raw/supermetrics/mp/result_df_flow_at_des_m.csv",
run_config=None,
)

result = flow.run()
assert result.is_successful()

MAIN_DF = pd.read_csv(TMP_FILE_NAME, delimiter="\t")

assert isinstance(MAIN_DF, pd.DataFrame) == True

assert MAIN_DF.shape == (10, 17)

os.remove(TMP_FILE_NAME)
25 changes: 25 additions & 0 deletions tests/integration/flows/test_multiple_flows.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
from viadot.flows import MultipleFlows
import logging


def test_multiple_flows_working(caplog):
list = [
["Flow of flows 1 test", "dev"],
["Flow of flows 2 - working", "dev"],
["Flow of flows 3", "dev"],
]
flow = MultipleFlows(name="test", flows_list=list)
with caplog.at_level(logging.INFO):
flow.run()
assert "All of the tasks succeeded." in caplog.text


def test_multiple_flows_not_working(caplog):
list = [
["Flow of flows 1 test", "dev"],
["Flow of flows 2 test - not working", "dev"],
["Flow of flows 3", "dev"],
]
flow = MultipleFlows(name="test", flows_list=list)
flow.run()
assert "One of the flows has failed!" in caplog.text
15 changes: 15 additions & 0 deletions tests/integration/tasks/test_aselite.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from viadot.tasks import ASELiteToDF
import pandas as pd


def test_aselite_to_df():
query = """SELECT TOP (10) [usageid]
,[configid]
,[verticalid]
,[textgroupid]
,[nr]
,[storedate]
FROM [UCRMDEV_DESIGNER].[dbo].[PORTAL_APPLICATION_TEXTUSAGE]"""
task = ASELiteToDF()
df = task.run(query=query)
assert isinstance(df, pd.DataFrame)
12 changes: 11 additions & 1 deletion tests/integration/tasks/test_azure_sql.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ def test_check_column_order_append_diff_col_number(caplog):
ValidationError,
match=r"Detected discrepancies in number of columns or different column names between the CSV file and the SQL table!",
):
check_column_order.run(table=TABLE, if_exists="append", df=df)
check_column_order.run(table=TABLE, schema=SCHEMA, if_exists="append", df=df)


def test_check_column_order_replace(caplog):
Expand All @@ -132,3 +132,13 @@ def test_check_column_order_replace(caplog):
with caplog.at_level(logging.INFO):
check_column_order.run(table=TABLE, if_exists="replace", df=df)
assert "The table will be replaced." in caplog.text


def test_check_column_order_append_not_exists(caplog):
check_column_order = CheckColumnOrder()
data = {"id": [1], "street": ["Green"], "name": ["Tom"]}
df = pd.DataFrame(data)
check_column_order.run(
table="non_existing_table_123", schema="sandbox", if_exists="append", df=df
)
assert "table doesn't exists" in caplog.text
Loading

0 comments on commit 1e281f6

Please sign in to comment.