Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up #79

Open
andeElliott opened this issue Jul 12, 2022 · 0 comments
Open

Speed up #79

andeElliott opened this issue Jul 12, 2022 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@andeElliott
Copy link
Collaborator

andeElliott commented Jul 12, 2022

The one hot implementation can be sped up a bit, by replacing the current implementation with the code below, which vectorises the lookups letting numpy do the complex work.

This results in a speed up of the groundhog example code by 25%, with speeds ups of the individual function of between 20% and 75% (depending on the number of categories), with a small regression in some cases.

However, this implementation is more complex and maybe more fragile and thus perhaps should be thought through in more depth.

def one_hot(col_data, categories):
    res = np.zeros((len(col_data), len(categories)))
    col_data_np = col_data.to_numpy()
    for idx,x in enumerate(categories):
        res[:,idx] = col_data_np==x
    if not (res.sum(axis=1)==1).all():
        Exception("A value has been discovered which is not in the list of categories")
    return res

Alternative formulations using pandas get_dummies or using pandas map appear to be slower on this dataset.

However, we should give the map formulation as this should scale better for a large number of categories which is not present in this dataset.

      c1d = {d:idx for idx,d in enumerate(categories)}
      cidx = col_data.map(c1d)

This formulation appears to save 20-40 percent but should in general have better scaling for large number of categories and thus might be preferred.

@andeElliott andeElliott added the enhancement New feature or request label Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants