Implement the Dataset class #72

javihern98 · 2024-07-19T08:05:10Z

This is a review of the topics discussed that the Dataset class attributes must be:

Data: for version 1.0, use Pandas, subsequent versions will include streaming capabilities
Attached attributes: attributes at dataset level (simple dictionary at the moment, will be changed later)
Relationship to Schema class or URN

@sosna and @stratosn proposed the following code, which was accepted by everyone and will be implemented after 1.0, as some changes in parsers and writers are required.

from dataclasses import dataclass
from datetime import datetime
from typing import Any, Generator, Optional, Sequence, Union

from pysdmx.model import MetadataReport, DataProvider, Schema


@dataclass
class _Component:
    id: str
    value: Any


@dataclass
class Dimension(_Component):
    pass


@dataclass
class DataAttribute(_Component):
    pass


@dataclass
class Measure(_Component):
    pass


@dataclass
class _Package:
    key: str  # Full key (cf. MEDAL) A.F.G.M.*
    dimensions: Sequence[Dimension]
    attributes: Optional[Sequence[DataAttribute]]
    name: Optional[str]
    metadata: Optional[Sequence[MetadataReport, str]]


@dataclass
class Observation(_Package):
    measures: Sequence[Measure]


@dataclass
class _ObsPackage(_Package):
    observations: Generator[Observation]
    obs_count: Optional[int]
    start_period: Optional[str]
    end_period: Optional[str]
    last_updated: Optional[datetime]


@dataclass
class Series(_ObsPackage):
    pass


@dataclass
class Group(_Package):
    pass


@dataclass
class Dataset(_ObsPackage):
    packages: Generator[Union[Group, Series, Observation]]
    provider: Optional[DataProvider]
    structure: Union[Schema, str]  # Schema or the SDMX URN of the structure


    @property
    def groups(self):  # A view on the packages of type Group
        return (p for p in self.packages if isinstance(p, Group))

    @property
    def series(self):  # A view on the packages of type Series
        return (p for p in self.packages if isinstance(p, Series))


@dataclass
class PandasDataset(Dataset):
    def to_pandas():
        pass

gabrielgellner · 2024-10-18T15:12:50Z

Do we think it might be possible to do this with Narwhals to make this dataframe agnostic? (I am a huge lover of pysdmx and moving all my sdmx code over to it, but also a huge polars user :))

sosna · 2024-10-25T06:38:23Z

Thanks for the suggestion and kind words, @gabrielgellner, we'll definitely have a look at it! Indeed, some of us in the Dev team cannot use Pandas (as there is no guarantee the dataset would fit in memory) and so we will soon need to look at adding more options.

javihern98 · 2024-12-20T09:08:00Z

Hi @gabrielgellner . This is scheduled for next year. I would like to first investigate all possible libraries (including Dask, Modin, etc). For sure I will investigate Narwhals as well, seems quite straightforward!

The main goal is to keep all functionalities but add the possibility of loading datasets bigger than memory. As we have already achieved most functionalities, the remaining goals are to add compatibility to SDMX-ML 3.0 and gather more use cases for data consumers and producers in the early months of 2025.

Low memory data loading is tricky as we need as well to find a common interface between lxml and these "low memory libraries" and do not read the whole XML file, adding the necessary "pointers" to read whenever it is necessary. That requires some time and effort to generate good quality software that anyone can use in a production environment. This has been a raising need after talking with some potential users so we will prioritize it.

We need as well to find a common solution to handle data efficiently through the whole library, but bearing in mind that we shall use something that is easily recognizable for the users (like Pandas) and does not add a lot of cognitive load when interacting with the actual data.

javihern98 self-assigned this Jul 19, 2024

javihern98 mentioned this issue Jul 19, 2024

49 csv parsers for data reading #65

Merged

javihern98 assigned sosna Sep 20, 2024

javihern98 added this to the 1.0 milestone Sep 20, 2024

sosna removed this from the 1.0 milestone Sep 26, 2024

sosna added new feature Add new functionality major A complex issue or an issue with a big impact labels Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement the Dataset class #72

Implement the Dataset class #72

javihern98 commented Jul 19, 2024

gabrielgellner commented Oct 18, 2024

sosna commented Oct 25, 2024

javihern98 commented Dec 20, 2024

Implement the Dataset class #72

Implement the Dataset class #72

Comments

javihern98 commented Jul 19, 2024

gabrielgellner commented Oct 18, 2024

sosna commented Oct 25, 2024

javihern98 commented Dec 20, 2024