-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPalmer Penguins.Rmd
422 lines (288 loc) · 10.7 KB
/
Palmer Penguins.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
---
title: "R Notebook"
output:
word_document: default
html_notebook: default
pdf_document: default
---
This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Ctrl+Shift+Enter*.
Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Ctrl+Alt+I*.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Ctrl+Shift+K* to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
```{r}
#Install Remote Package if not already available
#install.packages("remotes")
#Install Palmer Penguin Data
#remotes::install_github("allisonhorst/palmerpenguins")
```
#citation("palmerpenguins")
#>
#> To cite palmerpenguins in publications use:
#>
#> Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer
#> Archipelago (Antarctica) penguin data. R package version 0.1.0.
#> https://allisonhorst.github.io/palmerpenguins/. doi:
#> 10.5281/zenodo.3960218.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {palmerpenguins: Palmer Archipelago (Antarctica) penguin data},
#> author = {Allison Marie Horst and Alison Presmanes Hill and Kristen B Gorman},
#> year = {2020},
#> note = {R package version 0.1.0},
#> doi = {10.5281/zenodo.3960218},
#> url = {https://allisonhorst.github.io/palmerpenguins/},
#> }
#> Also help from Jason Brownlee on machine learning in R February 3, 2016 article
```{r}
#Load Palmer Penguin library and data set
library(palmerpenguins)
data(package = 'palmerpenguins')
```
```{r}
#Load libraries we need for dataframe manipulation and plotting
library(tidyr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(caret)
library(ellipse)
library(cluster)
library(fpc)
library(factoextra)
library(knitr)
```
```{r}
#Initial look at the reduced data set
head(penguins)
```
### Some data will likely not be necesary for prediction purposes like "sex" and "year".
```{r}
#Look at the data set columns and data types.
str(penguins)
```
### Essentially, we have three species: "Adelie", "Chinstrap", and "Gentoo".
### The penguins are located across three islands: "Biscoe", "Dream", and "Torgersen"
```{r}
#Look at number of species in data set
table(penguins$species)
```
```{r}
#Taking a looka tht dimensions of the reduced data set
dim(penguins)
```
### There are 344 rows, matching the count of penguins with 8 columns.
### However, the data does include some "NA" or non measured penguins. Those rows will need ## to be dropped for the machine learning analysis.
```{r}
#Copy data into new dataframe and look at Column names for calculations
df <- penguins
names(df)
```
```{r}
#Initial plot to see which parameters might yield clues about species types
#Species vs body mass
bp<-ggplot(aes(x = species, y = body_mass_g, fill=species), data = df) + geom_boxplot() + coord_flip()
bp + scale_fill_hue(l=40, c=35)
```
### It appears for body mass that the Chinstrap and Adelie penguins share some
### similarities, while the Gentoo penguins are heavier.
```{r}
#Initial plot to see which parameters might yield clues about species types
#Species vs bill length
bp<-ggplot(aes(x = species, y = bill_length_mm, fill=species), data = df) + geom_boxplot() + coord_flip()
bp + scale_fill_hue(l=40, c=35)
```
### On the other hand, the Chinstrap and Gentoo penguins share similar
### bill length.
```{r}
#Initial plot to see which parameters might yield clues about species types
#Species vs bill depth
bp<-ggplot(aes(x = species, y = bill_depth_mm, fill=species), data = df) + geom_boxplot() + coord_flip()
bp + scale_fill_hue(l=40, c=35)
```
### Chinstrap and Adelie penguins have overlap in bill depth.
```{r}
#Initial plot to see which parameters might yield clues about species types
#Species vs flipper length
bp<-ggplot(aes(x = species, y = flipper_length_mm, fill=species), data = df) + geom_boxplot() + coord_flip()
bp + scale_fill_hue(l=40, c=35)
```
### And lastly, Chinstrap and Adelie penguins have some overlap in flipper length
```{r}
#Grid analysis of histograms of the parameters: Length and Width and the Petal Length and Width.
# "species" "island" "bill_length_mm" "bill_depth_mm" "flipper_length_mm"
# "body_mass_g" "sex" "year"
p1 <- ggplot(aes(x = bill_length_mm, fill = species), data = df) +
geom_histogram()
p2 <- ggplot(aes(x = bill_depth_mm, fill = species), data = df) +
geom_histogram()
p3 <- ggplot(aes(x = flipper_length_mm, fill = species), data = df) + geom_histogram()
p4 <- ggplot(aes(x = body_mass_g, fill = species), data = df) +
geom_histogram()
grid.arrange(p1, p2, p3, p4, ncol = 2)
```
### Again, using the histograms, we the same overlaps in a more colorful fashion.
```{r}
#Looking at which penguins were measure on which islands
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(alpha = 0.8) +
scale_fill_manual(values = c("darkorange","purple","cyan4"),
guide = FALSE) +
theme_minimal() +
facet_wrap(~species, ncol = 1) +
coord_flip()
```
### It would appear for the study that Adelie penguins can be found on all
### three islands, which Chinstraps were exclusive to only Dream and
### Gentoo exclusive only to Biscoe.
```{r}
penguins %>%
select(species, body_mass_g, ends_with("_mm")) %>%
GGally::ggpairs(aes(color = species)) +
scale_colour_manual(values = c("darkorange","purple","cyan4")) +
scale_fill_manual(values = c("darkorange","purple","cyan4"))
```
### The pairwise graph sums up our previous findings.
```{r}
#Before modeling, it is good to look to see if there is any clustering in
#the data set
#Variables used in clustering "Island", bill_length_mm", "bill_depth_mm",
#"flipper_length_mm", and "body_mass_g"
#Seed for starting number used to generate a sequence of random numbers
set.seed(20)
#Creating new dataframe for clustering.
data_for_clustering <- df
#Dropping "NA" values
data_for_clustering_no_na <- drop_na(data_for_clustering)
#kmeans data clustering partitioning (assuming 3 centers or clusters).
#Using Species as the target, with features "bill_length_mm" "bill_depth_mm" #"flipper_length_mm" and "body_mass_g"
clusters_penguins <- kmeans(data_for_clustering_no_na[,4:6], centers = 3)
plotcluster(data_for_clustering_no_na[,3:6], clusters_penguins$cluster, color = TRUE, shade = TRUE, xlab="", ylab="")
```
#Clustering does indicate that there is some separation in the penguin data
#that might lend itself for modeling.
```{r}
#Creating clustering table to exam if there is proper separation
penguins_no_na <- drop_na(penguins)
table(clusters_penguins$cluster, penguins_no_na$species)
```
#It appears that some of the penguins aren't properly categorized due to the
#feature overlaps
#Modeling Work
#Goal is to determine if a model can be built to predict the species of penguin
#based on measurements
#Data will be split into a test set and a validation set.
```{r}
#Create a Validation Dataset
#Starting with clean dataframe and removing "NA" values
data_for_model <- df
#Dropping "NA" values
data_for_model_no_na <- drop_na(data_for_model)
set.seed(300)
validation_index <- createDataPartition(data_for_model_no_na$species, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- data_for_model_no_na[-validation_index,]
# use the remaining 80% of data to training and testing the models
data_for_model_no_na <- data_for_model_no_na[validation_index,]
```
```{r}
#Summary of the data set
dim(data_for_model_no_na)
dim(validation)
```
###Validation and Modeling dataframes have correct proportions
```{r}
# list types for each attribute
sapply(data_for_model_no_na, class)
```
```{r}
# take a peek at the first 5 rows of the data
head(data_for_model_no_na)
```
```{r}
# list the levels for the class
levels(data_for_model_no_na$species)
```
#Looking at the above data reads, the modeling dataframe checks out.
```{r}
# summarize the class distribution
percentage <- prop.table(table(data_for_model_no_na$species)) * 100
cbind(freq=table(data_for_model_no_na$species), percentage=round(percentage,1))
```
```{r}
# summarize attribute distributions
summary(data_for_model_no_na)
```
```{r}
#Visualize Data set
# split input and output (not using "Island" feature)
x <- data_for_model_no_na[,3:6]
y <- data_for_model_no_na[,1]
```
```{r}
# barplot for class breakdown
plot(y)
```
```{r}
# scatterplot matrix
featurePlot(x=x, y=y$species, plot="ellipse")
```
```{r}
# box and whisker plots for each attribute
featurePlot(x=x, y=y$species, plot="box")
```
```{r}
# density plots for each attribute by class value
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x, y=y$species, plot="density", scales=scales)
```
```{r}
#Evaluate 5 different Algorithms
# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10, savePredictions = TRUE)
metric <- "Accuracy"
```
```{r}
# a) linear algorithms
set.seed(7)
fit.lda <- train(species~., data=data_for_model_no_na, method="lda", metric=metric, trControl=control)
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(species~., data=data_for_model_no_na, method="rpart", metric=metric, trControl=control)
# kNN
set.seed(7)
fit.knn <- train(species~., data=data_for_model_no_na, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(species~., data=data_for_model_no_na, method="svmRadial", metric=metric, trControl=control)
# Random Forest
set.seed(7)
fit.rf <- train(species~., data=data_for_model_no_na, method="rf", metric=metric, trControl=control)
```
```{r}
# summarize accuracy of models
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
```
#At first blush, lda has the highest accuracy with SVM and rf following closely
```{r}
# Here we visually compare accuracy of the models
dotplot(results)
```
```{r}
# Summarize Best Model
print(fit.lda)
```
```{r}
# estimate skill of best model on the validation dataset
predictions <- predict(fit.lda, validation)
confusionMatrix(predictions, validation$species)
```
###In Summary, the data set is rather small with only 344 measurements.
###However, there appears to be enough variance in the features to be able
###to use modeling to predict the penguin species based on the measurements.
###Possible next steps would be to gather more penguin data.