Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_historical_features does not work on join keys with field mappings #4889

Open
aloysius-lim opened this issue Jan 3, 2025 · 0 comments
Open

Comments

@aloysius-lim
Copy link
Contributor

Expected Behavior

Given an Entity where the join key column is called something else in the data source, a field_mapping can be set on the data source to map the source column name to the join key. get_historical_features should then recognize that the join key has a field mapping, and generate the correct alias in the query.

For example:

from feast import Entity, Field, FeatureStore, FeatureView
from feast.types import Float32, String
from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import SparkSource

# Initialize Feature Store.
store = FeatureStore(...)

# "driver_id" is used in the Feature Store.
driver = Entity(name="driver", join_keys=["driver_id"])

# Using SparkSource as an example, but this applies to other sources.
# Source data contains a primary key called "id". This is mapped to the join key "driver_id".
driver_stats_src = SparkSource(
    name="driver_stats",
    field_mapping={"id": "driver_id"},
    path=...,
    file_format=...,
)
driver_stats_fv = FeatureView(
    name="driver_stats",
    source= driver_stats_src,
    entities=[driver],
    schema=[
        # join key must be specified in the schema, else it is not included in driver_stats_fv.entity_columns
        Field(name="driver_id", dtype=String),
        Field(name="stat1", dtype=Float32),
        Field(name="stat2", dtype=Float32),
    ]
)

# Get historical features
store.get_historical_features(
    entity_df=...,
    features=[
        "driver_stats:stat1",
        "driver_stats:stat2",
    ]

When get_historical_features is run, the alias id AS driver_id should be provided to the query. In the case of Spark, for example, this should be the query:

driver_stats__subquery AS (
    SELECT
        event_timestamp as event_timestamp,
        created as created_timestamp,

        id AS driver_id,
        
        stat1 as stat1, stat2 as stat2
    FROM `feast_entity_df_677a1a6fd13443c6b0e8ccc059b25f01`
    WHERE event_timestamp <= '2025-01-05T14:00:00'
)

Current Behavior

This is what currently happens (Spark example):

pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `driver_id` cannot be resolved. Did you mean one of the following? [`id`, `stat1`, `stat2`]

Underlying Spark query:

driver_stats__subquery AS (
    SELECT
        event_timestamp as event_timestamp,
        created as created_timestamp,

        -- Here is the problem.
        driver_id AS driver_id,

        stat1 as stat1, stat2 as stat2
    FROM `feast_entity_df_677a1a6fd13443c6b0e8ccc059b25f01`
    WHERE event_timestamp <= '2025-01-05T14:00:00'
)

Steps to reproduce

See example above.

Specifications

  • Version: 0.42.0
  • Platform: macOS 14.6.1
  • Subsystem:

Possible Solution

See PR #4886

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant