Speed up #79

andeElliott · 2022-07-12T14:14:43Z

The one hot implementation can be sped up a bit, by replacing the current implementation with the code below, which vectorises the lookups letting numpy do the complex work.

This results in a speed up of the groundhog example code by 25%, with speeds ups of the individual function of between 20% and 75% (depending on the number of categories), with a small regression in some cases.

However, this implementation is more complex and maybe more fragile and thus perhaps should be thought through in more depth.

def one_hot(col_data, categories):
    res = np.zeros((len(col_data), len(categories)))
    col_data_np = col_data.to_numpy()
    for idx,x in enumerate(categories):
        res[:,idx] = col_data_np==x
    if not (res.sum(axis=1)==1).all():
        Exception("A value has been discovered which is not in the list of categories")
    return res

Alternative formulations using pandas get_dummies or using pandas map appear to be slower on this dataset.

However, we should give the map formulation as this should scale better for a large number of categories which is not present in this dataset.

      c1d = {d:idx for idx,d in enumerate(categories)}
      cidx = col_data.map(c1d)

This formulation appears to save 20-40 percent but should in general have better scaling for large number of categories and thus might be preferred.

The text was updated successfully, but these errors were encountered:

andeElliott added the enhancement New feature or request label Jul 12, 2022

andeElliott assigned andeElliott and fhoussiau Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up #79

Speed up #79

andeElliott commented Jul 12, 2022 •

edited

Loading

Speed up #79

Speed up #79

Comments

andeElliott commented Jul 12, 2022 • edited Loading

andeElliott commented Jul 12, 2022 •

edited

Loading