Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement the Dataset class #72

Open
javihern98 opened this issue Jul 19, 2024 · 3 comments
Open

Implement the Dataset class #72

javihern98 opened this issue Jul 19, 2024 · 3 comments
Assignees
Labels
major A complex issue or an issue with a big impact new feature Add new functionality

Comments

@javihern98
Copy link
Contributor

This is a review of the topics discussed that the Dataset class attributes must be:

  • Data: for version 1.0, use Pandas, subsequent versions will include streaming capabilities
  • Attached attributes: attributes at dataset level (simple dictionary at the moment, will be changed later)
  • Relationship to Schema class or URN

@sosna and @stratosn proposed the following code, which was accepted by everyone and will be implemented after 1.0, as some changes in parsers and writers are required.

from dataclasses import dataclass
from datetime import datetime
from typing import Any, Generator, Optional, Sequence, Union

from pysdmx.model import MetadataReport, DataProvider, Schema


@dataclass
class _Component:
    id: str
    value: Any


@dataclass
class Dimension(_Component):
    pass


@dataclass
class DataAttribute(_Component):
    pass


@dataclass
class Measure(_Component):
    pass


@dataclass
class _Package:
    key: str  # Full key (cf. MEDAL) A.F.G.M.*
    dimensions: Sequence[Dimension]
    attributes: Optional[Sequence[DataAttribute]]
    name: Optional[str]
    metadata: Optional[Sequence[MetadataReport, str]]


@dataclass
class Observation(_Package):
    measures: Sequence[Measure]


@dataclass
class _ObsPackage(_Package):
    observations: Generator[Observation]
    obs_count: Optional[int]
    start_period: Optional[str]
    end_period: Optional[str]
    last_updated: Optional[datetime]


@dataclass
class Series(_ObsPackage):
    pass


@dataclass
class Group(_Package):
    pass


@dataclass
class Dataset(_ObsPackage):
    packages: Generator[Union[Group, Series, Observation]]
    provider: Optional[DataProvider]
    structure: Union[Schema, str]  # Schema or the SDMX URN of the structure


    @property
    def groups(self):  # A view on the packages of type Group
        return (p for p in self.packages if isinstance(p, Group))

    @property
    def series(self):  # A view on the packages of type Series
        return (p for p in self.packages if isinstance(p, Series))


@dataclass
class PandasDataset(Dataset):
    def to_pandas():
        pass
@javihern98 javihern98 self-assigned this Jul 19, 2024
@javihern98 javihern98 added this to the 1.0 milestone Sep 20, 2024
@sosna sosna removed this from the 1.0 milestone Sep 26, 2024
@gabrielgellner
Copy link

Do we think it might be possible to do this with Narwhals to make this dataframe agnostic? (I am a huge lover of pysdmx and moving all my sdmx code over to it, but also a huge polars user :))

@sosna
Copy link
Collaborator

sosna commented Oct 25, 2024

Thanks for the suggestion and kind words, @gabrielgellner, we'll definitely have a look at it! Indeed, some of us in the Dev team cannot use Pandas (as there is no guarantee the dataset would fit in memory) and so we will soon need to look at adding more options.

@javihern98
Copy link
Contributor Author

Hi @gabrielgellner . This is scheduled for next year. I would like to first investigate all possible libraries (including Dask, Modin, etc). For sure I will investigate Narwhals as well, seems quite straightforward!

The main goal is to keep all functionalities but add the possibility of loading datasets bigger than memory. As we have already achieved most functionalities, the remaining goals are to add compatibility to SDMX-ML 3.0 and gather more use cases for data consumers and producers in the early months of 2025.

Low memory data loading is tricky as we need as well to find a common interface between lxml and these "low memory libraries" and do not read the whole XML file, adding the necessary "pointers" to read whenever it is necessary. That requires some time and effort to generate good quality software that anyone can use in a production environment. This has been a raising need after talking with some potential users so we will prioritize it.

We need as well to find a common solution to handle data efficiently through the whole library, but bearing in mind that we shall use something that is easily recognizable for the users (like Pandas) and does not add a lot of cognitive load when interacting with the actual data.

@sosna sosna added new feature Add new functionality major A complex issue or an issue with a big impact labels Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major A complex issue or an issue with a big impact new feature Add new functionality
Projects
None yet
Development

No branches or pull requests

3 participants