-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAutoML_Python.Rmd
529 lines (378 loc) · 19.2 KB
/
AutoML_Python.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
---
title: "Auto ML vs Feature Engineering"
author: "Masanao Yajima"
date: "2023-01-05"
output: html_document
---
```{css,echo=FALSE}
.btn {
border-width: 0 0px 0px 0px;
font-weight: normal;
text-transform: ;
}
.btn-default {
color: #2ecc71;
background-color: #ffffff;
border-color: #ffffff;
}
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE,message=FALSE,fig.align="center",fig.width=7,fig.height=7)
```
# Class Workbook {.tabset .tabset-fade .tabset-pills}
```{python}
import numpy as np
import pandas as pd
import math
from matplotlib.pyplot import subplots
#import statsmodels.api as sm
from plotnine import *
import plotly.express as px
import statsmodels.formula.api as sm
#import ISLP as islp
```
## Ames Housing data
We will look at the Ames Housing data. The task is to predict the houses after 2008 based on data up to 2008.
```{python}
ames_raw=pd.read_csv("ames_raw.csv")
ames_raw_2009, ames_raw_2008= ames_raw.query('`Yr Sold`>=2008').copy(), ames_raw.query('`Yr Sold` <2008').copy()
```
The loss will be the same as before. If your algorithm decides to pay more than the actual price your company buys. If the predicted price is lower, your company will fail to buy.
- If you bought for more than the actual value, you’ve overpaid.
- If you bid less and lost, you lost a profit of 10% of the house price.
```{python}
def calc_loss(prediction,actual):
difpred = actual-prediction
RMSE =pow(difpred.pow(2).mean(),1/2)
operation_loss=abs(sum(difpred[difpred<0]))+sum(0.1*actual[difpred>0])
return RMSE,operation_loss
```
## Feature engineering
### Types of Feature engineering
There are several categories of feature engineering.
1. Adding information from other sources
2. Missing Data Handling
3. Dealing with problematic values (outliers, inliers, etc)
4. Making variables that make sense for the context
5. Transformation
6. Scaling
7. Discretization
### 1. Adding information from other sources
When handed a dataset, it's easy to jump right into the analysis. This is typical behavior, especially for a novice. However, there is often information that could be explored if you know what you are looking for. There are a few categories of such information.
a) Information that was not given to you but someone has access to.
When you are not the data creator, sometimes you are not given access to certain information. The most common is information that pertains to privacy or protected attributes. This information is often not given to you for reasons external to the project you are working on. However, in certain circumstances, if you know what you are looking for, you might be able to negotiate information that could save you some headaches down the line. Think outside the box and be creative. The important caveat is that obtaining some information could have legal consequences. Web scraping and other means of data collection should be done with care. Some industry such as pharmacies have strict rule that prohibits the use of pharmacy information for their retail purpose.
b) Information that is public but you need to obtain.
There are information about places and things on the internet that are easy to incorporate.
For example, in housing data, geographic information could be tied to census information. Financial information might require adjusting for inflation, which again can be found on the internet. Other survey information might be available if you care to look for them. One thing to be careful is that not all information that you can find will be useful. You need to balance the time needed vs the benefit of the information.
c) Information that is confusing for machines
Coded variables without keys do not make sense but for a computer they seem like a numeric variable. If not careful, one might include them as numeric. Take `MS SubClass`, which codes the building class.
```{python}
ames_raw.loc[:,"MS SubClass"].value_counts()
```
Unfortunately, the help file does not contain detailed information on the codes. But with some research you will be able to [find](https://github.com/zzeniale/Ames-housing-price-prediction) that codes do not have ordering to them. Therefore, you need to think carefully about what matters and then discretize the variable in some ways.
- 20 1-STORY 1946 & NEWER ALL STYLES
- 30 1-STORY 1945 & OLDER
- 40 1-STORY W/FINISHED ATTIC ALL AGES
- 45 1-1/2 STORY - UNFINISHED ALL AGES
- 50 1-1/2 STORY FINISHED ALL AGES
- 60 2-STORY 1946 & NEWER
- 70 2-STORY 1945 & OLDER
- 75 2-1/2 STORY ALL AGES
- 80 SPLIT OR MULTI-LEVEL
- 85 SPLIT FOYER
- 90 DUPLEX - ALL STYLES AND AGES
- 120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
- 150 1-1/2 STORY PUD - ALL AGES
- 160 2-STORY PUD - 1946 & NEWER
- 180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
- 190 2 FAMILY CONVERSION - ALL STYLES AND AGES
### 2. Missing Data Handling
To handle missing data, it's always essential to consider the context. Data that is missing is not by themselves a problem. The fundamental problem is the bias that these variable might pose down the line if incorporated. Doing a careful imputation takes effort. When time is of a concern, deleting variables with high rate of missingness should be considered.
a) Missing data that is not really missing
Variable such as `Garage Yr Blt` has 159 observations missing. But if you look carefully, you will realize that the houses that are missing this information are the ones that have no garage. This is not missing data but a coding problem. One must decide what to do with such information based on the context. You should not fill such missingness with some arbitrary number.
```{python}
ames_raw.loc[np.isnan(ames_raw.loc[:,"Garage Yr Blt"]),"Garage Cars"]
```
b) Missing data that is too big
Some variables might have too much missing data, and there may be a good reason for that. If there are ways to craft a variable that could serve as a proxy for such information, one should try. But if such effort introduces additional uncertainty, one might remove the variable altogether.
```{python}
missing_data_proportion = ames_raw.isna().mean()
```
```{python}
missing_data_proportion.iloc[np.where(missing_data_proportion>0.1)]
import matplotlib.pyplot as plt
plt.scatter(np.argsort(missing_data_proportion), missing_data_proportion,marker='o')
# naming the x axis
plt.xlabel("order")
# naming the y axis
plt.ylabel('proportion missing')
# giving a title to my graph
plt.title('Missing data proportion')
# function to show the plot
plt.show()
```
c) Missing data that could be an additional information
If missingness is intentional, one might add a variable to signify such missingness. You will need to fill the missing value with some value, which depends on the variable.
d) Missing completely at random (MCAR)
If MCAR, one could remove the rows with missingness without introducing bias. However, this is a strong assumption that is often not met in practice.
e) Missing at Random (MAR)
For MAR, regression-based imputation often is used. Many packages allow you to do these imputations reasonably easily. However, one thing that you will need to think about is that some imputation method will work better after transformation then before. This will rely on the model being used to impute. See `mi`, `mice`, etc for detail.
f) Missing not at random (MNAR)
MNAR variable is hard to deal with. One needs to weigh the cost and benefit of including such variables. An example of such is a variable like income. If all the low-income people are not responding, one might use a small number as a proxy. But if there are reasons to believe there multiple categories of cause they are missing, and there is no way to tell, then you might be better off removing the variable.
### 3. Dealing with problematic values (outliers, inliers, etc)
Problematic observations such as outliers are hard to find and often require you to revisit this step a few times. This is important because you must deal with them before applying transformations. For example, outliers would distort statistics such as means which would be problematic if you plan to use z-score transformation. When you have a lot of zeros, this could impact how you want to transform a variable. EDA often finds outliers, but they may not pop up until the modeling phase. Truncating or removing data with outliers should be done with caution since they often introduce an unwanted feature in the data.
Here is an illustration of two types of outliers that are harder and easier to find.
```{python, fig.width=12, fig.height=4}
from scipy.stats import multivariate_normal
dat=multivariate_normal([0, 0], [[3.0, 2.0], [2.0, 3.0]]).rvs(100)
dat=np.vstack([dat,[7.0,7.0],[-3.0,4.0]])
import matplotlib.pyplot as plt
col=[]
for i in range(100):
col.append("blue")
col.append("magenta")
col.append("magenta")
for i in range(102):
plt.scatter(dat[i,0],dat[i,1],marker='o',color=col[i])
plt.show()
```
```{python}
xx=range(0, 102,1)
for i in range(102):
plt.scatter(xx[i],dat[i,0],marker='o',color=col[i])
plt.show()
```
```{python}
for i in range(102):
plt.scatter(xx[i],dat[i,1],marker='o',color=col[i])
plt.show()
```
Look at the basement and the 2nd floor Square footage, you can see that there are bimodality as well as properties that have outliers. This should make you cautious of performing scaling to these variables.
```{python}
import matplotlib.pyplot as plt
plt.scatter(ames_raw.loc[:,"Bsmt Unf SF"], ames_raw.loc[:,"2nd Flr SF"],marker='o')
# naming the x axis
plt.xlabel("Bsmt Unf SF")
# naming the y axis
plt.ylabel('2nd Flr SF')
# giving a title to my graph
plt.title('Bsmt Unf SF vs 2nd Flr SF')
# function to show the plot
plt.show()
# plot(ames_raw$`Bsmt Unf SF`,ames_raw$`2nd Flr SF`)
```
### 4. Making variables that make sense for the context
Context matters when doing feature engineering. Take, for example, the Ames housing data. Ames is a university town where many people have some ties to the university of Iowa. Therefore, looking at things like distance from the university might make sense to include in the analysis. Another thing to think about is things like the Year built. The impact of the year built is not absolute and shifts over the years. Therefore one might want to make a variable that is the age of the house at sales.
```{python}
# handling Year features
ames_raw["yrs_since_remod"]= ames_raw.loc[:,"Yr Sold"] - ames_raw.loc[:,"Year Remod/Add"]
# Total Living Area
ames_raw["TotalArea"] =ames_raw.loc[:,"Gr Liv Area"] + ames_raw.loc[:,"Total Bsmt SF"]
# TotalBath
ames_raw["TotalBath"] =ames_raw.loc[:,"Bsmt Full Bath"] + 0.5 * ames_raw.loc[:,"Bsmt Half Bath"]+ames_raw.loc[:,"Full Bath"] + 0.5 * ames_raw.loc[:,"Half Bath"]
```
### 5. Transformation
When the predictor is right skewed they tend to distort the linear model by exhibiting leverage points. Taking a log will resolve such a problem.
```{python,fig.width=7,fig.height=4}
(
ggplot(ames_raw)+geom_point()+aes(x="Gr Liv Area",y="SalePrice")+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=False)
)
(
ggplot(ames_raw)+geom_point()+aes(x="Gr Liv Area",y="SalePrice")+xlab("Above grade (ground) living area square feet")+ylab("Sale Price")+geom_smooth(method="lm",se=False)+scale_y_log10()+scale_x_log10()
)
```
### 6. Scaling, centering and normalizing.
For linear regression models, centering and scaling does not change the model itself, but they change the interpretability of the model coefficients. Converting all the predictors on a similar scale has its advantage because the size of the coefficient will directly indicate the influence of the predictor. For some hierarchical models, scaling will also help with the convergence problem. But scaling is critical for all the distance-based methods you will encounter later in the semester.
### 7. Discretization
Categorical variables need to be coded appropriately. Dummy coding or one-hot-encoding is one way when the information is nominal. Take, for example, the building type variable by default, it's a character variable with five values.
```{python}
ames_raw.loc[:,"Bldg Type"].value_counts()
```
One can use contextual information to convert them into meaningful variables like a single family and multiple families or a shared house. Or use dummy coding.
```{python}
pd.get_dummies(ames_raw.loc[:,"Bldg Type"])
```
It's essential to know how a dummy variable is included in a model as it is model specific.
### 8. Grouping
Not all categorical variable needs a unique category. One might consider grouping some categories so that you have fewer categories to model.
For example, the overall condition is rated from 1 to 10, as shown below.
```{python}
(
ggplot(ames_raw)+geom_histogram()+aes(x="Overall Cond")
)
```
It's important to know which way is better. For the Ames data it is infact
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
One could convert them into integers since there is explicit ordering. However, the distribution of the variable is uneven, with many observations at five and very few below 5. In such a case, combining the categories into three may be better since the data does not seem to have the resolution to understand the ten levels.
```{python}
def if_this_else_that(x, thresh, low, mid, high):
if x < thresh:
res = low
elif x > thresh:
res = high
else:
res = mid
return(res)
ames_raw["OverallCond3"]= ames_raw["Overall Cond"].apply(lambda x: if_this_else_that(x, 5, 1, 2, 3))
(
ggplot(ames_raw)+geom_histogram()+aes(x="OverallCond3")
)
```
### 9. Selecting and compressing
There are various reasons why you need to be selective of what to include. This could be the lack of information from the variable due to the limitations posed by the sample size, contextual reasons, or overlapping information.
- If there is very small variability in some variable, it's very unlikely that you will get some differetiating information out of them.
For highly correlated variables you might select variables so that correlation does not impact the model building.
```{python}
# Correlation matrix
import seaborn as sns
import matplotlib.pyplot as plt
g = sns.clustermap(ames_raw.select_dtypes(include=np.number).corr(),
method = 'complete',
cmap = 'RdBu',
annot = True,
annot_kws = {'size': 8})
plt.setp(g.ax_heatmap.get_xticklabels(), rotation=60);
plt.show()
```
Alternatively, you could compress the correlated variable using dimension reduction. However, it's no free lunch since you need to do all the scaling and missing data processing before you can apply PCA and you need to decide how many components to include. pcaMethods package offers a way to fit a model even in the presence of missing data.
```{python}
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit(ames_raw.select_dtypes(include=np.number))
ames_raw_fill=imp_mean.transform(ames_raw.select_dtypes(include=np.number))
pca = PCA(n_components=20)
pca.fit(ames_raw_fill,20)
print(pca.explained_variance_ratio_)
```
## In class work
### Model fitting
Since you've worked on it in MA679 please copy and paste your best model here.
```{python,echo=TRUE}
##
##
```
Your answer:
~~~
Please write your answer in full sentences.
~~~
Please perform feature engineering on the Ames housing data that you think will help with the prediction.
```{python,echo=TRUE}
##
##
```
Your answer:
~~~
Please write your answer in full sentences.
~~~
Compare the result before and after the feature engineering step.
```{python,echo=TRUE}
##
##
```
Your answer:
~~~
Please write your answer in full sentences.
~~~
## AutoML
Feature engineering is mostly about context. But does it matter if the prediction is of interest? Is there automatic ways to do all of this that is better? Let's find out.
Include all the vairables you included as well as anything you want to add to the model.
```{python}
vars = ["SalePrice","Lot Area","Gr Liv Area","Full Bath"]
#vars <- c("SalePrice")#
```
```{python}
train_set = ames_raw_2008.loc[:, vars]
test_set = ames_raw_2009.loc[:, vars]
train_set["SalePrice"] = np.log(train_set["SalePrice"])
test_set["SalePrice"] = np.log(test_set["SalePrice"])
```
### Pycaret
```{python}
# https://michael-fuchs-python.netlify.app/2022/01/15/automl-using-pycaret-regression/
import pycaret.regression as pycr
summary_preprocess = pycr.setup(train_set,
target = 'SalePrice',
numeric_features = ["Lot Area","Gr Liv Area","Full Bath"],
normalize = True,
feature_selection = True,
remove_multicollinearity = True)
available_models = pycr.models()
best_reg = pycr.compare_models()
evaluation_best_clf = pycr.evaluate_model(best_reg)
```
### H2O autoML
h2o autoML is well known in the field as something pretty powerful.
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
```{python,eval=FALSE}
import h2o
from h2o.automl import H2OAutoML
localH2O = h2o.init(nthreads = -1, max_mem_size="4g")
# upload the data
train_hf = h2o.H2OFrame(train_set)
test_hf = h2o.H2OFrame(test_set)
aml = H2OAutoML(max_models=10, max_runtime_secs=10,seed=1)
# Train the model
aml.train(x=["Lot Area","Gr Liv Area","Full Bath"], y="SalePrice", training_frame=train_hf)
# Look at the result
lb = aml.leaderboard
lb.head(rows=lb.nrows) # Print all rows instead of default (10 rows)
# The leader model is stored here
aml.leader
preds = aml.leader.predict(test_hf)
# Get the best performing model
m = aml.get_best_model()
# Get the best model using a non-default metric
m = aml.get_best_model(criterion="rmse")
# Get the best XGBoost model using default sort metric
xgb = aml.get_best_model(algorithm="xgboost")
h2o.cluster().shutdown(prompt =False)
```
### autosklern
I was not able to make autosklearn work on Mac.
```{python}
# import autosklearn.regression
# automl = autosklearn.regression.AutoSklearnRegressor(
# time_left_for_this_task=120,
# per_run_time_limit=30,
# tmp_folder="/tmp/autosklearn_regression_example_tmp",
# )
# automl.fit(X_train, y_train, dataset_name="diabetes")
```
### autoxgboost
XG Boost is a popular implementation of gradient boosting method that we will talk about in MA679. Leaving aside the detail, it's another popular ML method that has a lot of tuning parameters. AutoXGBoost is a function that would search for good choice of these parameters automaticall.
I was not able to make autoxgboost work on Mac.
```{python,eval=FALSE}
# load library
# from autoxgb import AutoXGB
#
# train_set.to_csv('train_set.csv', index=False)
# test_set.to_csv('test_set.csv', index=False)
#
# axgb = AutoXGB(
# train_filename="train_set.csv",
# output="output",
# test_filename="test_set.csv",
# task=None,
# idx=None,
# targets=["SalePrice"],
# features=None,
# categorical_features=None,
# use_gpu=False,
# num_folds=5,
# seed=123,
# num_trials=100,
# time_limit=30,
# fast=False,
# )
# axgb.train()
```