You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The one hot implementation can be sped up a bit, by replacing the current implementation with the code below, which vectorises the lookups letting numpy do the complex work.
This results in a speed up of the groundhog example code by 25%, with speeds ups of the individual function of between 20% and 75% (depending on the number of categories), with a small regression in some cases.
However, this implementation is more complex and maybe more fragile and thus perhaps should be thought through in more depth.
def one_hot(col_data, categories):
res = np.zeros((len(col_data), len(categories)))
col_data_np = col_data.to_numpy()
for idx,x in enumerate(categories):
res[:,idx] = col_data_np==x
if not (res.sum(axis=1)==1).all():
Exception("A value has been discovered which is not in the list of categories")
return res
Alternative formulations using pandas get_dummies or using pandas map appear to be slower on this dataset.
However, we should give the map formulation as this should scale better for a large number of categories which is not present in this dataset.
c1d = {d:idx for idx,d in enumerate(categories)}
cidx = col_data.map(c1d)
This formulation appears to save 20-40 percent but should in general have better scaling for large number of categories and thus might be preferred.
The text was updated successfully, but these errors were encountered:
The one hot implementation can be sped up a bit, by replacing the current implementation with the code below, which vectorises the lookups letting numpy do the complex work.
This results in a speed up of the groundhog example code by 25%, with speeds ups of the individual function of between 20% and 75% (depending on the number of categories), with a small regression in some cases.
However, this implementation is more complex and maybe more fragile and thus perhaps should be thought through in more depth.
Alternative formulations using pandas get_dummies or using pandas map appear to be slower on this dataset.
However, we should give the map formulation as this should scale better for a large number of categories which is not present in this dataset.
This formulation appears to save 20-40 percent but should in general have better scaling for large number of categories and thus might be preferred.
The text was updated successfully, but these errors were encountered: