Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO Package refactor #172

Open
wants to merge 63 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 58 commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
3664415
Draft code for read_sdmx.
javihern98 Dec 18, 2024
bdf8145
Merge branch 'develop' into 156-add-read_sdmx-convenience-method
javihern98 Dec 18, 2024
2d87af7
Refactored code on Message class to move submission and ActionType to…
javihern98 Dec 19, 2024
617aa4c
Linting and mypy changes
javihern98 Dec 19, 2024
2c2cab3
Refactored code to ensure we use the Short URN as keys of the message…
javihern98 Dec 19, 2024
ffb2a8e
Added tests for read_sdmx with csv files.
javihern98 Dec 19, 2024
01f5489
Linting and mypy changes.
javihern98 Dec 19, 2024
ab63f01
Adapted structures tests to reach max code coverage
javihern98 Dec 19, 2024
8b960d2
Removed unique_id method and adapted code to use Reference.
javihern98 Dec 19, 2024
41e7a8d
Added utils methods and classes to all.
javihern98 Dec 19, 2024
b140de0
Adapted tests for input processor to max code coverage.
javihern98 Dec 19, 2024
0c03933
Adapted tests for read_sdmx to max code coverage.
javihern98 Dec 19, 2024
f567c50
Added tag to concept and itemReference to avoid serialization issues.
javihern98 Dec 19, 2024
85076af
Merge remote-tracking branch 'refs/remotes/origin/develop' into 156-a…
javihern98 Dec 19, 2024
fe1fbb4
Fixed message imports to prevent circular imports. Ruff automatic fix…
javihern98 Dec 19, 2024
8afd68b
Read and write XML methods added (as requested, the write method has …
mla2001 Jan 9, 2025
5417e73
Fixed ruff errors.
mla2001 Jan 9, 2025
6e9ec54
Fixed mypy errors.
mla2001 Jan 9, 2025
386af2f
Fixed read Path input error.
mla2001 Jan 9, 2025
6b6886f
Adapted ReadSDMX to infer the format automatically.
javihern98 Jan 9, 2025
b6e6200
Linting changes.
javihern98 Jan 9, 2025
03dd8ad
Renamed ReadFormat to SDMXFormat. Added Submission and Error formats …
javihern98 Jan 9, 2025
bd27711
Ignored mypy error.
javihern98 Jan 10, 2025
5818c40
Added to_schema method to DataStructureDefinition. Draft code for get…
javihern98 Jan 10, 2025
6a4bac7
Replaced whole URN to Short URN in Dataset. Added tests to match cove…
javihern98 Jan 10, 2025
d38ed7f
Added get_datasets to io
javihern98 Jan 10, 2025
e2b2946
Added URL parsing support on read_sdmx. Made httpx a mandatory depend…
javihern98 Jan 10, 2025
4eca815
Merge branch 'develop' into 156-add-read_sdmx-convenience-method
javihern98 Jan 10, 2025
183fe4d
Updated poetry lock.
javihern98 Jan 10, 2025
d8fa929
Merge remote-tracking branch 'origin/156-add-read_sdmx-convenience-me…
javihern98 Jan 10, 2025
4ab5896
Updated XML reader and writers with the new signature. Refactored tes…
mla2001 Jan 10, 2025
d197d44
Fixed XML read return typing.
mla2001 Jan 10, 2025
4be7535
Fixed most ruff errors.
mla2001 Jan 10, 2025
a78dca6
Fixed all ruff errors.
mla2001 Jan 10, 2025
0a79372
Fixed all mypy errors.
mla2001 Jan 10, 2025
3b48f9e
Updated poetry lock and used correct version of httpx. Updated tests …
javihern98 Jan 10, 2025
3ea8431
100% code coverage.
mla2001 Jan 10, 2025
1544afb
Fixed all ruff errors.
mla2001 Jan 10, 2025
ed6303a
Added validate flag to get_datasets.
javihern98 Jan 10, 2025
eb7038a
Fixed mypy errors. Fixed all errors.
mla2001 Jan 10, 2025
e26e1ee
Merge branch 'develop' into 164-checking-for-inconsistencies-in-parse…
mla2001 Jan 10, 2025
2a4ceb3
Update test_reader.py
mla2001 Jan 10, 2025
c7f3d14
Fixed all errors.
mla2001 Jan 10, 2025
8341e65
Restored fixes for reading at best effort basis.
javihern98 Jan 8, 2025
85bc5ea
Refactor on method names to make it private, signature of structures …
javihern98 Jan 13, 2025
f6df667
Merge remote-tracking branch 'refs/remotes/origin/164-checking-for-in…
javihern98 Jan 13, 2025
53ac3e6
Refactor on message to use Sequence[...] objects.
javihern98 Jan 13, 2025
f78463c
Draft code on SDMX-ML 2.1 readers refactor to be consistent with writ…
javihern98 Jan 13, 2025
a527de0
Draft code on input processing. Added dataflow support to get_datasets.
javihern98 Jan 13, 2025
09fbb8c
Removed inconsistencies on read signature. Fixed input data on each X…
javihern98 Jan 13, 2025
af4ecc7
Fixed mypy errors before testing.
javihern98 Jan 13, 2025
f84d234
Changed short_urn to property in MaintainableArtefact.
javihern98 Jan 13, 2025
a9af618
Added property short_urn to DataStructureDefinition (sdmxtype must be…
javihern98 Jan 13, 2025
4cfcab7
Fixed tests on xml.
javihern98 Jan 13, 2025
2363a61
Fixed message tests.
javihern98 Jan 13, 2025
885f1af
Removed inconsistencies on SDMX-CSV writers by adding the datasets as…
javihern98 Jan 13, 2025
3a18798
Cleaned up code. Added tests for Coverage.
javihern98 Jan 13, 2025
3cbde9f
Replaced MessageType to SDMXFormat. Fixed tests.
javihern98 Jan 13, 2025
8105528
Mypy fix.
javihern98 Jan 13, 2025
b16b2c6
Docs updated
javihern98 Jan 13, 2025
fd1c3f3
Updated signature on input processor
javihern98 Jan 13, 2025
d14d5db
Fixed typo on write function in csv.
javihern98 Jan 16, 2025
f8e2815
Separated enumeration of SDMX JSON into Data and Structures
javihern98 Jan 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,371 changes: 716 additions & 655 deletions poetry.lock

Large diffs are not rendered by default.

3 changes: 1 addition & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ classifiers = [

[tool.poetry.dependencies]
python = "^3.9"
httpx = {version = "0.*", optional = true}
httpx = "^0.27.0"
msgspec = "0.*"
lxml = {version = "5.*", optional = true}
xmltodict = {version = "0.*", optional = true}
Expand All @@ -34,7 +34,6 @@ pandas = {version = "^2.2.2", optional = true}

[tool.poetry.extras]
dc = ["python-dateutil"]
fmr = ["httpx"]
xml = ["lxml", "xmltodict", "sdmxschemas"]
data = ["pandas"]

Expand Down
4 changes: 4 additions & 0 deletions src/pysdmx/io/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
"""IO module for SDMX data."""

from pysdmx.io.reader import get_datasets, read_sdmx

__all__ = ["read_sdmx", "get_datasets"]
17 changes: 7 additions & 10 deletions src/pysdmx/io/csv/sdmx10/reader/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""SDMX 1.0 CSV reader module."""

from io import StringIO
from typing import Dict
from typing import Sequence

import pandas as pd

Expand All @@ -14,10 +14,7 @@ def __generate_dataset_from_sdmx_csv(data: pd.DataFrame) -> PandasDataset:
structure_id = data["DATAFLOW"].iloc[0]
# Drop 'DATAFLOW' column from DataFrame
df_csv = data.drop(["DATAFLOW"], axis=1)
urn = (
f"urn:sdmx:org.sdmx.infomodel.datastructure."
f"DataFlow={structure_id}"
)
urn = f"Dataflow={structure_id}"

# Extract dataset attributes from sdmx-csv (all values are the same)
attributes = {
Expand All @@ -36,11 +33,11 @@ def __generate_dataset_from_sdmx_csv(data: pd.DataFrame) -> PandasDataset:
)


def read(infile: str) -> Dict[str, PandasDataset]:
def read(input_str: str) -> Sequence[PandasDataset]:
"""Reads csv file and returns a payload dictionary.

Args:
infile: Path to file, str.
input_str: Path to file, str.

Returns:
payload: dict.
Expand All @@ -49,7 +46,7 @@ def read(infile: str) -> Dict[str, PandasDataset]:
Invalid: If it is an invalid CSV file.
"""
# Get Dataframe from CSV file
df_csv = pd.read_csv(StringIO(infile))
df_csv = pd.read_csv(StringIO(input_str))
# Drop empty columns
df_csv = df_csv.dropna(axis=1, how="all")

Expand Down Expand Up @@ -88,13 +85,13 @@ def read(infile: str) -> Dict[str, PandasDataset]:

# Create a payload dictionary to store datasets with the
# different unique_ids as keys
payload = {}
payload = []
for df in list_df:
# Generate a dataset from each subset of the DataFrame
dataset = __generate_dataset_from_sdmx_csv(data=df)

# Add the dataset to the payload dictionary
payload[dataset.short_urn] = dataset
payload.append(dataset)

# Return the payload generated
return payload
35 changes: 21 additions & 14 deletions src/pysdmx/io/csv/sdmx10/writer/__init__.py
Original file line number Diff line number Diff line change
@@ -1,35 +1,42 @@
"""SDMX 1.0 CSV writer module."""

from copy import copy
from typing import Optional
from typing import Optional, Sequence

import pandas as pd

from pysdmx.io.pd import PandasDataset


def writer(
javihern98 marked this conversation as resolved.
Show resolved Hide resolved
dataset: PandasDataset, output_path: Optional[str] = None
datasets: Sequence[PandasDataset], output_path: Optional[str] = None
) -> Optional[str]:
"""Converts a dataset to an SDMX CSV format.
"""Write data to SDMX-CSV 1.0 format.

Args:
dataset: dataset
output_path: output_path
datasets: List of datasets to write.
Must have the same components.
output_path: Path to write the data to.
If None, the data is returned as a string.

Returns:
SDMX CSV data as a string
SDMX CSV data as a string, if output_path is None.
"""
# Link to pandas.to_csv documentation on sphinx:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

# Create a copy of the dataset
df: pd.DataFrame = copy(dataset.data)
df.insert(0, "DATAFLOW", dataset.short_urn.split("=")[1])

# Add additional attributes to the dataset
for k, v in dataset.attributes.items():
df[k] = v

dataframes = []
for dataset in datasets:
df: pd.DataFrame = copy(dataset.data)
df.insert(0, "DATAFLOW", dataset.short_urn.split("=")[1])

# Add additional attributes to the dataset
for k, v in dataset.attributes.items():
df[k] = v
dataframes.append(df)

# Concatenate the dataframes
all_data = pd.concat(dataframes, ignore_index=True, axis=0)
# Return the SDMX CSV data as a string
return df.to_csv(output_path, index=False, header=True)
return all_data.to_csv(output_path, index=False, header=True)
2 changes: 1 addition & 1 deletion src/pysdmx/io/csv/sdmx20/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""SDMX 2.0 CSV reader and writer."""

from pysdmx.model.message import ActionType
from pysdmx.model.dataset import ActionType

SDMX_CSV_ACTION_MAPPER = {
ActionType.Append: "A",
Expand Down
31 changes: 11 additions & 20 deletions src/pysdmx/io/csv/sdmx20/reader/__init__.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
"""SDMX 2.0 CSV reader module."""

from io import StringIO
from typing import Dict
from typing import Sequence

import pandas as pd

from pysdmx.errors import Invalid
from pysdmx.io.pd import PandasDataset
from pysdmx.model.message import ActionType
from pysdmx.model.dataset import ActionType

ACTION_SDMX_CSV_MAPPER_READING = {
"A": ActionType.Append,
Expand Down Expand Up @@ -49,20 +49,11 @@ def __generate_dataset_from_sdmx_csv(data: pd.DataFrame) -> PandasDataset:
df_csv = data.drop(["STRUCTURE", "STRUCTURE_ID"], axis=1)

if structure_type == "DataStructure".lower():
urn = (
"urn:sdmx:org.sdmx.infomodel.datastructure."
f"DataStructure={structure_id}"
)
elif structure_type == "DataFlow".lower():
urn = (
"urn:sdmx:org.sdmx.infomodel.datastructure."
f"DataFlow={structure_id}"
)
urn = f"DataStructure={structure_id}"
elif structure_type == "Dataflow".lower():
urn = f"Dataflow={structure_id}"
elif structure_type == "dataprovision":
urn = (
f"urn:sdmx:org.sdmx.infomodel.registry."
f"ProvisionAgreement={structure_id}"
)
urn = f"ProvisionAgreement={structure_id}"
else:
raise Invalid(
"Invalid value on STRUCTURE column",
Expand All @@ -87,11 +78,11 @@ def __generate_dataset_from_sdmx_csv(data: pd.DataFrame) -> PandasDataset:
)


def read(infile: str) -> Dict[str, PandasDataset]:
def read(input_str: str) -> Sequence[PandasDataset]:
"""Reads csv file and returns a payload dictionary.

Args:
infile: Path to file, str.
input_str: Path to file, str.

Returns:
payload: dict.
Expand All @@ -100,7 +91,7 @@ def read(infile: str) -> Dict[str, PandasDataset]:
Invalid: If it is an invalid CSV file.
"""
# Get Dataframe from CSV file
df_csv = pd.read_csv(StringIO(infile))
df_csv = pd.read_csv(StringIO(input_str))
# Drop empty columns
df_csv = df_csv.dropna(axis=1, how="all")

Expand Down Expand Up @@ -142,13 +133,13 @@ def read(infile: str) -> Dict[str, PandasDataset]:

# Create a payload dictionary to store datasets with the
# different unique_ids as keys
payload = {}
payload = []
for df in list_df:
# Generate a dataset from each subset of the DataFrame
dataset = __generate_dataset_from_sdmx_csv(data=df)

# Add the dataset to the payload dictionary
payload[dataset.short_urn] = dataset
payload.append(dataset)

# Return the payload generated
return payload
52 changes: 30 additions & 22 deletions src/pysdmx/io/csv/sdmx20/writer/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
"""SDMX 2.0 CSV writer module."""

from copy import copy
from typing import Optional
from typing import Optional, Sequence

import pandas as pd

Expand All @@ -10,38 +10,46 @@


def writer(
dataset: PandasDataset, output_path: Optional[str] = None
datasets: Sequence[PandasDataset], output_path: Optional[str] = None
) -> Optional[str]:
"""Converts a dataset to an SDMX CSV format.
"""Write data to SDMX-CSV 2.0 format.

Args:
dataset: dataset
output_path: output_path
datasets: List of datasets to write.
Must have the same components.
output_path: Path to write the data to.
If None, the data is returned as a string.

Returns:
SDMX CSV data as a string
SDMX CSV data as a string, if output_path is None.
"""
# Link to pandas.to_csv documentation on sphinx:
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

# Create a copy of the dataset
df: pd.DataFrame = copy(dataset.data)
dataframes = []
for dataset in datasets:
# Create a copy of the dataset
df: pd.DataFrame = copy(dataset.data)

# Add additional attributes to the dataset
for k, v in dataset.attributes.items():
df[k] = v
# Add additional attributes to the dataset
for k, v in dataset.attributes.items():
df[k] = v

structure_ref, unique_id = dataset.short_urn.split("=", maxsplit=1)
if structure_ref in ["DataStructure", "DataFlow"]:
structure_ref = structure_ref.lower()
else:
structure_ref = "dataprovision"
structure_ref, unique_id = dataset.short_urn.split("=", maxsplit=1)
if structure_ref in ["DataStructure", "Dataflow"]:
structure_ref = structure_ref.lower()
else:
structure_ref = "dataprovision"

# Insert two columns at the beginning of the data set
df.insert(0, "STRUCTURE", structure_ref)
df.insert(1, "STRUCTURE_ID", unique_id)
action_value = SDMX_CSV_ACTION_MAPPER[dataset.action]
df.insert(2, "ACTION", action_value)
# Insert two columns at the beginning of the data set
df.insert(0, "STRUCTURE", structure_ref)
df.insert(1, "STRUCTURE_ID", unique_id)
action_value = SDMX_CSV_ACTION_MAPPER[dataset.action]
df.insert(2, "ACTION", action_value)

dataframes.append(df)

all_data = pd.concat(dataframes, ignore_index=True, axis=0)

# Convert the dataset into a csv file
return df.to_csv(output_path, index=False, header=True)
return all_data.to_csv(output_path, index=False, header=True)
21 changes: 21 additions & 0 deletions src/pysdmx/io/enums.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
"""IO Enumerations for SDMX files."""

from enum import Enum


class SDMXFormat(Enum):
javihern98 marked this conversation as resolved.
Show resolved Hide resolved
"""Enumeration of supported SDMX read formats."""

SDMX_ML_2_1_STRUCTURE = "SDMX-ML 2.1 Structure"
SDMX_ML_2_1_DATA_STRUCTURE_SPECIFIC = "SDMX-ML 2.1 StructureSpecific"
SDMX_ML_2_1_DATA_GENERIC = "SDMX-ML 2.1 Generic"
SDMX_ML_2_1_REGISTRY_INTERFACE = "SDMX-ML 2.1 Registry Interface"
SDMX_ML_2_1_ERROR = "SDMX-ML 2.1 Error"
SDMX_JSON_2 = "SDMX-JSON 2.0.0"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also distinguish between Structure and Data, as we do with XML?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me check the Standard, in any case this has no impact in the rest of the code as we have a NotImplemented

Copy link
Contributor Author

@javihern98 javihern98 Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed to differentiate between SDMX_JSON_2_DATA and SDMX_JSON_2_STRUCTURE

FUSION_JSON = "FusionJSON"
SDMX_CSV_1_0 = "SDMX-CSV 1.0"
SDMX_CSV_2_0 = "SDMX-CSV 2.0"

def __str__(self) -> str:
"""Return the string representation of the format."""
return self.value
Loading
Loading