Palmer Penguins.Rmd

---
title: "R Notebook"
output:
  word_document: default
  html_notebook: default
  pdf_document: default
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Ctrl+Shift+Enter*. 

Add a new chunk by clicking the *Insert Chunk* button on the toolbar or by pressing *Ctrl+Alt+I*.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the *Preview* button or press *Ctrl+Shift+K* to preview the HTML file).

The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike *Knit*, *Preview* does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.

```{r}
#Install Remote Package if not already available

#install.packages("remotes")

#Install Palmer Penguin Data
#remotes::install_github("allisonhorst/palmerpenguins")

```

#citation("palmerpenguins")
#> 
#> To cite palmerpenguins in publications use:
#> 
#>   Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer
#>   Archipelago (Antarctica) penguin data. R package version 0.1.0.
#>   https://allisonhorst.github.io/palmerpenguins/. doi:
#>   10.5281/zenodo.3960218.
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Manual{,
#>     title = {palmerpenguins: Palmer Archipelago (Antarctica) penguin data},
#>     author = {Allison Marie Horst and Alison Presmanes Hill and Kristen B Gorman},
#>     year = {2020},
#>     note = {R package version 0.1.0},
#>     doi = {10.5281/zenodo.3960218},
#>     url = {https://allisonhorst.github.io/palmerpenguins/},
#>   }
#> Also help from Jason Brownlee on machine learning in R February 3, 2016 article


```{r}
#Load Palmer Penguin library and data set
library(palmerpenguins)
data(package = 'palmerpenguins')
```

```{r}
#Load libraries we need for dataframe manipulation and plotting
library(tidyr)
library(dplyr)
library(ggplot2)
library(gridExtra)
library(caret)
library(ellipse)
library(cluster) 
library(fpc)
library(factoextra)
library(knitr)
```


```{r}
#Initial look at the reduced data set
head(penguins)
```
### Some data will likely not be necesary for prediction purposes like "sex" and "year". 


```{r}
#Look at the data set columns and data types.
str(penguins)
```
### Essentially, we have three species: "Adelie", "Chinstrap", and "Gentoo". 
### The penguins are located across three islands: "Biscoe", "Dream", and "Torgersen"

```{r}
#Look at number of species in data set
table(penguins$species)
```
```{r}
#Taking a looka tht dimensions of the reduced data set
dim(penguins)
```
### There are 344 rows, matching the count of penguins with 8 columns.
### However, the data does include some "NA" or non measured penguins. Those rows will need ## to be dropped for the machine learning analysis.


```{r}
#Copy data into new dataframe and look at Column names for calculations

df <- penguins
names(df)
```


```{r}
#Initial plot to see which parameters might yield clues about species types
#Species vs body mass

bp<-ggplot(aes(x = species, y = body_mass_g, fill=species), data = df) + geom_boxplot() + coord_flip()

bp + scale_fill_hue(l=40, c=35)
```
### It appears for body mass that the Chinstrap and Adelie penguins share some
### similarities, while the Gentoo penguins are heavier.


```{r}
#Initial plot to see which parameters might yield clues about species types
#Species vs bill length
bp<-ggplot(aes(x = species, y = bill_length_mm, fill=species), data = df) + geom_boxplot() + coord_flip()

bp + scale_fill_hue(l=40, c=35)
```
### On the other hand, the Chinstrap and Gentoo penguins share similar
### bill length.


```{r}
#Initial plot to see which parameters might yield clues about species types
#Species vs bill depth
bp<-ggplot(aes(x = species, y = bill_depth_mm, fill=species), data = df) + geom_boxplot() + coord_flip()

bp + scale_fill_hue(l=40, c=35)
```

### Chinstrap and Adelie penguins have overlap in bill depth.

```{r}
#Initial plot to see which parameters might yield clues about species types
#Species vs flipper length
bp<-ggplot(aes(x = species, y = flipper_length_mm, fill=species), data = df) + geom_boxplot() + coord_flip()

bp + scale_fill_hue(l=40, c=35)

```
### And lastly, Chinstrap and Adelie penguins have some overlap in flipper length


```{r}
#Grid analysis of histograms of the parameters: Length and Width and the Petal Length and Width.
# "species"           "island"            "bill_length_mm"    "bill_depth_mm"     "flipper_length_mm"
# "body_mass_g"       "sex"               "year"     

p1 <- ggplot(aes(x = bill_length_mm, fill = species), data = df) +
  geom_histogram()

p2 <- ggplot(aes(x = bill_depth_mm, fill = species), data = df) +
   geom_histogram()

p3 <- ggplot(aes(x = flipper_length_mm, fill = species), data = df) + geom_histogram()

p4 <- ggplot(aes(x = body_mass_g, fill = species), data = df) +
   geom_histogram()


grid.arrange(p1, p2, p3, p4, ncol = 2)
```

### Again, using the histograms, we the same overlaps in a more colorful fashion.


```{r}

#Looking at which penguins were measure on which islands

ggplot(penguins, aes(x = island, fill = species)) +
  geom_bar(alpha = 0.8) +
  scale_fill_manual(values = c("darkorange","purple","cyan4"),
                    guide = FALSE) +
  theme_minimal() +
  facet_wrap(~species, ncol = 1) +
  coord_flip()


```
### It would appear for the study that Adelie penguins can be found on all
### three islands, which Chinstraps were exclusive to only Dream and
### Gentoo exclusive only to Biscoe.

```{r}
penguins %>%
  select(species, body_mass_g, ends_with("_mm")) %>%
  GGally::ggpairs(aes(color = species)) +
  scale_colour_manual(values = c("darkorange","purple","cyan4")) +
  scale_fill_manual(values = c("darkorange","purple","cyan4"))
```
### The pairwise graph sums up our previous findings.


```{r}

#Before modeling, it is good to look to see if there is any clustering in
#the data set

#Variables used in clustering "Island", bill_length_mm", "bill_depth_mm",
#"flipper_length_mm", and "body_mass_g"

#Seed for starting number used to generate a sequence of random numbers
set.seed(20)

#Creating new dataframe for clustering.
data_for_clustering <- df

#Dropping "NA" values
data_for_clustering_no_na <- drop_na(data_for_clustering)

#kmeans data clustering partitioning (assuming 3 centers or clusters).

#Using Species as the target, with features "bill_length_mm" "bill_depth_mm"     #"flipper_length_mm" and "body_mass_g" 

clusters_penguins <- kmeans(data_for_clustering_no_na[,4:6], centers = 3)

plotcluster(data_for_clustering_no_na[,3:6], clusters_penguins$cluster, color = TRUE, shade = TRUE, xlab="", ylab="")

```

#Clustering does indicate that there is some separation in the penguin data
#that might lend itself for modeling. 

```{r}
#Creating clustering table to exam if there is proper separation
penguins_no_na <- drop_na(penguins)

table(clusters_penguins$cluster, penguins_no_na$species)
```
#It appears that some of the penguins aren't properly categorized due to the
#feature overlaps


#Modeling Work
#Goal is to determine if a model can be built to predict the species of penguin
#based on measurements

#Data will be split into a test set and a validation set.

```{r}
#Create a Validation Dataset

#Starting with clean dataframe and removing "NA" values


data_for_model <- df

#Dropping "NA" values
data_for_model_no_na <- drop_na(data_for_model)


set.seed(300)

validation_index <- createDataPartition(data_for_model_no_na$species, p=0.80, list=FALSE)

# select 20% of the data for validation
validation <- data_for_model_no_na[-validation_index,]


# use the remaining 80% of data to training and testing the models
data_for_model_no_na <- data_for_model_no_na[validation_index,]

```

```{r}
#Summary of the data set
dim(data_for_model_no_na)
dim(validation)
```
###Validation and Modeling dataframes have correct proportions

```{r}
# list types for each attribute
sapply(data_for_model_no_na, class)
```

```{r}
# take a peek at the first 5 rows of the data
head(data_for_model_no_na)
```


```{r}

# list the levels for the class
levels(data_for_model_no_na$species)
```
#Looking at the above data reads, the modeling dataframe checks out.

```{r}

# summarize the class distribution
percentage <- prop.table(table(data_for_model_no_na$species)) * 100
cbind(freq=table(data_for_model_no_na$species), percentage=round(percentage,1))

```

```{r}
# summarize attribute distributions
summary(data_for_model_no_na)
```

```{r}
#Visualize Data set


# split input and output (not using "Island" feature)
x <- data_for_model_no_na[,3:6]
y <- data_for_model_no_na[,1]


```

```{r}
# barplot for class breakdown
plot(y)
```

```{r}

# scatterplot matrix
featurePlot(x=x, y=y$species, plot="ellipse")
```

```{r}
# box and whisker plots for each attribute
featurePlot(x=x, y=y$species, plot="box")
```

```{r}

# density plots for each attribute by class value
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x, y=y$species, plot="density", scales=scales)
```

```{r}
#Evaluate 5 different Algorithms

# Run algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10, savePredictions = TRUE)
metric <- "Accuracy"


```

```{r}

# a) linear algorithms
set.seed(7)
fit.lda <- train(species~., data=data_for_model_no_na, method="lda", metric=metric, trControl=control)


# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(species~., data=data_for_model_no_na, method="rpart", metric=metric, trControl=control)

# kNN
set.seed(7)
fit.knn <- train(species~., data=data_for_model_no_na, method="knn", metric=metric, trControl=control)

# c) advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(species~., data=data_for_model_no_na, method="svmRadial", metric=metric, trControl=control)

# Random Forest
set.seed(7)
fit.rf <- train(species~., data=data_for_model_no_na, method="rf", metric=metric, trControl=control)
```

```{r}
# summarize accuracy of models
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
```
#At first blush, lda has the highest accuracy with SVM and rf following closely


```{r}

# Here we visually compare accuracy of the models
dotplot(results)
```

```{r}
# Summarize Best Model
print(fit.lda)

```

```{r}

# estimate skill of best model on the validation dataset
predictions <- predict(fit.lda, validation)
confusionMatrix(predictions, validation$species)

```

###In Summary, the data set is rather small with only 344 measurements.
###However, there appears to be enough variance in the features to be able
###to use modeling to predict the penguin species based on the measurements.
###Possible next steps would be to gather more penguin data.