Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame generic over column names for pl.col convenience #20761

Open
iliya-malecki opened this issue Jan 16, 2025 · 1 comment
Open

DataFrame generic over column names for pl.col convenience #20761

iliya-malecki opened this issue Jan 16, 2025 · 1 comment
Labels
enhancement New feature or an improvement of an existing feature

Comments

@iliya-malecki
Copy link
Contributor

iliya-malecki commented Jan 16, 2025

Description

i feel like the biggest slowdown when working with polars is the inability of pyright to infer column names in pl.col(''). It is good in the sense that all other activities are quite streamlined and natural, however, the typing now is the lowest-hanging fruit in terms of usability. I would like to start a conversation about a simple and unobtrusive way column name hinting could be implemented. I did spend a minute fooling around to come up with a way, not necessarily a good way, to achieve that. Of course, this should be viewed with a great deal of skepticism since the more logical way to do typing is pydantic models but i decided to start dumb and unobtrusive.

import polars as pl
from typing import Generic, TypeVar, LiteralString, Literal, Iterable
from decimal import Decimal
from datetime import date, time, timedelta, datetime
from polars.functions.col import ColumnFactory, ColumnFactoryMeta

T = TypeVar("T", bound=LiteralString, covariant=True)

class GExpr(pl.Expr, Generic[T]):...

IntoExpr = int | float | Decimal | date | time | datetime | timedelta | T | bool | bytes | list | GExpr[T] | pl.Series | None

class DF(pl.DataFrame, Generic[T]):
    def select(
        self,
        *exprs: IntoExpr[T] | Iterable[IntoExpr[T]],
        **named_exprs: IntoExpr[T],
    ) -> pl.DataFrame:
        return super().select(*exprs, **named_exprs)

class GColumnFactoryMeta(ColumnFactoryMeta):
    def __getitem__(self, item: T)-> GExpr[T]:
        return getattr(self, item)

class col(ColumnFactory, metaclass=GColumnFactoryMeta):
    ...


DF[Literal['a', 'bbb']](
    {'a':[1,2,3], 'bbb':[323,2,42]}
).select(col['bbb']) # this allows the popup in vscode due to pyright inferring possible keys, as well as type checking the literal

Image

Obviously, the dataframe init would have to change to accomodate the generic, possibly using the schema parameter.

Some of my own criticism of this approach includes the necessity to include an explicit schema in the init, and the fact that im not adressing pydantic well :)

@iliya-malecki iliya-malecki added the enhancement New feature or an improvement of an existing feature label Jan 16, 2025
@iliya-malecki iliya-malecki changed the title DataFrame generic over columns DataFrame generic over column names Jan 16, 2025
@iliya-malecki
Copy link
Contributor Author

I understand this is not a priority, but if this approach looks roughly right i can implement the typing changes and open a PR for a closer and more comprehensive review in-citu

@iliya-malecki iliya-malecki changed the title DataFrame generic over column names DataFrame generic over column names for pl.col convenience Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
Development

No branches or pull requests

1 participant