Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gap-filling #109

Open
gdkrmr opened this issue Jan 9, 2025 · 1 comment
Open

Gap-filling #109

gdkrmr opened this issue Jan 9, 2025 · 1 comment

Comments

@gdkrmr
Copy link

gdkrmr commented Jan 9, 2025

We are interested in gap-filling and I tried an approach along these lines:

def splitcol(data, col):
    """ splitcol

    - split data in data without col and col.

    - the split the data without col into data where col is not missing and
    where col is missing.
    """

    data_no_col = data.drop(col, axis = 1)
    data_col = data[[col]]
    data_col_non_na_idx = data_col.dropna().index

    x_train = data_no_col.loc[data_col_non_na_idx]
    y_train = data_col.loc[data_col_non_na_idx]

    x_pred = data_no_col.drop(data_col_non_na_idx)

    return x_train, y_train, x_pred


for col in var_cols:
    col = var_cols[0]
    x_train, y_train, x_pred = splitcol(wdi_data, col)

    clf = TabPFNRegressor()
    clf.fit(x_train, y_train) 
    prediction = clf.predict(x_pred)

but this fails in clf.predict because of #108 . Is there a plan to add gap-filling support?

@noahho
Copy link
Collaborator

noahho commented Jan 9, 2025

We might have this functionality available via tabpfn-extensions, take a look if this does the job (experimental):

from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.model_selection import train_test_split

from tabpfn import TabPFNClassifier, TabPFNRegressor
from tabpfn_extensions import unsupervised

import numpy as np
import torch

# Load and prepare breast cancer dataset
df = load_breast_cancer(return_X_y=False)
X, y = df['data'], df['target']
feature_names = df['feature_names']

# Initialize TabPFN models
model_unsupervised = unsupervised.TabPFNUnsupervisedModel(
    tabpfn_clf=TabPFNClassifier(),
    tabpfn_reg=TabPFNRegressor()
)

# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Initialize a classifier
clf = TabPFNClassifier()
clf.fit(X_train, y_train)

### REMOVE SOME DATA TO IMPUTE / INFILL ###
X_test[30:40, 0:5] = np.nan

model_unsupervised.fit(torch.tensor(X_train).float(), torch.tensor(y_train).float())
X_imputed = model_unsupervised.impute(torch.tensor(X_test).float())

X_imputed[np.isnan(X_test)]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants