Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

is_in fails with custom Collection #20784

Open
2 tasks done
harrymconner opened this issue Jan 18, 2025 · 1 comment
Open
2 tasks done

is_in fails with custom Collection #20784

harrymconner opened this issue Jan 18, 2025 · 1 comment
Labels
bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@harrymconner
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

This MRE shows is_in failing for a custom Collection but succeeding for a custom Sequence.

from collections.abc import Collection, Iterator, Sequence
from typing import overload

import polars as pl

df = pl.DataFrame({"A": [1, 2, 3]})


class CustomCollection(Collection[int]):
    def __init__(self, vals: Collection[int]) -> None:
        super().__init__()
        self.vals = vals

    def __contains__(self, x: object) -> bool:
        return x in self.vals

    def __iter__(self) -> Iterator[int]:
        yield from self.vals

    def __len__(self) -> int:
        return len(self.vals)


coll = CustomCollection([2, 3, 4])
print(f"CustomCollection is Collection: {isinstance(coll, Collection)}")  # True


try:
    df.filter(pl.col("A").is_in(coll))
except TypeError as e:
    # TypeError: Series constructor called with unsupported type 'CustomCollection' for the `values` parameter
    print(e, "\n")


class CustomSequence(Sequence[int]):
    def __init__(self, vals: Sequence[int]) -> None:
        super().__init__()
        self.vals = vals

    def __len__(self) -> int:
        return len(self.vals)

    @overload
    def __getitem__(self, index: slice) -> Sequence[int]: ...

    @overload
    def __getitem__(self, index: int) -> int: ...

    def __getitem__(self, index: int | slice) -> int | Sequence[int]:
        return self.vals[index]


seq = CustomSequence([2, 3, 4])
print(f"CustomSequence is Sequence: {isinstance(seq, Sequence)}")  # True

print(df.filter(pl.col("A").is_in(seq)))

Log output

Issue description

The other parameter in the is_in method is annotated to accept Expr | Collection[Any] | Series. However, passing a custom Collection raises a TypeError when is_in tries to convert the custom Collection to a pl.Series. The isinstance checks in pl.Series look for Sequences but not Collections, which is resulting in the TypeError. Perhaps the type annotation for the other parameter of is_in should be updated to reflect that pl.Series expects a Sequence instead of a Collection?

Expected behavior

Passing a custom Collection to is_in behaves the same as passing a native list, tuple, etc.

Installed versions

--------Version info---------
Polars:              1.20.0
Index type:          UInt32
Platform:            Windows-11-10.0.26100-SP0
Python:              3.12.8 (main, Dec 19 2024, 14:41:01) [MSC v.1942 64 bit (AMD64)]
LTS CPU:             False

----Optional dependencies----
Azure CLI            'az' is not recognized as an internal or external command,
operable program or batch file.
<not installed>
adbc_driver_manager  <not installed>
altair               <not installed>
azure.identity       <not installed>
boto3                <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            0.12.1
fsspec               2024.12.0
gevent               <not installed>
google.auth          <not installed>
great_tables         <not installed>
matplotlib           <not installed>
nest_asyncio         <not installed>
numpy                2.2.1
openpyxl             <not installed>
pandas               <not installed>
pyarrow              19.0.0
pydantic             2.10.5
pyiceberg            <not installed>
sqlalchemy           2.0.37
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           3.2.0
@harrymconner harrymconner added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jan 18, 2025
@MarcoGorelli MarcoGorelli added P-low Priority: low and removed needs triage Awaiting prioritization by a maintainer labels Jan 18, 2025
@github-project-automation github-project-automation bot moved this to Ready in Backlog Jan 18, 2025
@MarcoGorelli
Copy link
Collaborator

Thanks @harrymconner for the report

I think the current check

            if isinstance(other, (set, frozenset)):
                other = list(other)

might be too tight

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

2 participants