Analysis.Rmd

---
title: "Shelter Animal Outcomes"
author: "Gabriel Lapointe"
date: "May 13, 2016"
output:
  pdf_document:
    toc: yes
  html_document:
    highlight: pygments
    number_sections: yes
    toc: yes
variant: markdown_github
---

# Data Acquisition
Every year, approximately 7.6 million companion animals end up in US shelters. Many animals are given up as unwanted by their owners, while others are picked up after getting lost or taken out of cruelty situations. Many of these animals find forever families to take them home, but just as many are not so lucky. 2.7 million dogs and cats are euthanized in the US every year.


## Objective
The objective is to help improving outcomes for shelter animals. Using a dataset of intake information including breed, color, sex, and age from the Austin Animal Center, we have to predict the outcome for each animal.

To support our analysis, we are given the train and test sets in CSV files. From this, we have written a code book to help understanding each feature and their values.


## Data Source
The data comes from [Austin Animal Center](http://www.austintexas.gov/department/animal-services) from October 1st, 2013 to March, 2016. Outcomes represent the status of animals as they leave the Animal Center. All animals receive a unique Animal ID during intake. Source of the competition: [Shelter Animal Outcomes](https://www.kaggle.com/c/shelter-animal-outcomes).


## Dataset Questions
Before we start the exploration of the dataset, we need to write a list of questions about this dataset considering the problem we have to solve. 

* How big is the dataset? (Described in the Code Book)
* Does the dataset contains `NA` or missing values? Can we replace them by a value? Why?
* Does the data is coherent (date with same format, no out of bound values, no misspelled words, etc.)?
* What does the data look like and what are the relationships between features if they exist?
* What are the measures used?
* Can we solve the problem with this dataset?

Questions on features:

* Are features in the dataset sufficiant to explain each outcome?
* What is the proportion of animals and how many cats and dogs are they for each outcome?
* Are there features that can be split in many other features? If yes, are they improving the score? Why and how?
* For each outcome, what features have the most importance and why?


## Evaluation Metrics
Since we have 5 outcome types where we need to calculate a prediction's probability for each class (outcome type), we have to manage 5 classes using a multi-class algorithm. The `Log Loss` function quantifies the accuracy of a classifier by penalising false classifications. Minimising the `Log Loss` function is basically equivalent to maximising the accuracy of the classifier which is what we need.

Thus, Kaggle provides us the evaluation metric we need:
$$LogLoss = -\frac{1}{N} \sum_{i = 0}^N \sum_{j = 0}^M y_{i,j} \log{\mathbb{P}_{i,j}}$$
where 
    $N$ is the total number of animals, 
    $M$ is the number of outcomes, 
    $y_{i,j}$ is 1 if observation $i$ is in outcome $j$ and 0 otherwise,
    $\mathbb{P}_{i,j}$ is the predicted probability that observation $i$ belongs to outcome $j$.


## Methodology
When we will evaluate a model, we will train it first with the training set. When our model will be chosen, we will test it with this model. We will also use cross-validation since we are limited with the data given.

Since the output to predict is known and well defined, we will use a supervised algorithm to solve the problem. This is a multi-class problem where we have to give a probability for each class. 

In this document, we start by exploring the dataset and build the data story behind it. This will give us important insights which will answer our questions on this dataset. The next step is to do some feature engineering which consists to create, remove or replace features regarding insights we got when exploring the dataset. We will ensure our new dataset is a valid input for our prediction model. After applying our model to the test set, we will visualize the predictions calculated and explain the results.


<!------------------------------------------------------------EXPLORATORY ANALYSIS------------------------------------------------------------------------------>


# Data Preparation and Exploratory
In this section, we explore the dataset and we test hypotheses on features. The objective is to visualize and understand the relationships between features in the dataset we have to solve the problem. We will also compare changes we will make to this dataset to validate if they have significant influance on the outcomes or not.

We will also add and remove features from the dataset depending on the importance of insights we will find with data visualization. We will transform each feature in categories represented by a positive integer. The objective is to clean and prepare the dataset for our prediction's model and to answer the questions: `Are there features that can be split in many other features? If yes, are they improving the score? Why and how?`. Those questions will be answered throughout this section and with the data story presented at the conclusion of this document.

We will look at each feature of the dataset and split them if there are more information given than what the feature's name tells. For example, the feature `SexuponOutcome` tells us more than just the sex of the animal regarding the possible values. We can extract the sex and if the animal is sterile or not.

Each features having limited number of unique values (categories) will be numbered by a positive integer starting at 0. The zero-based ID will not hold for new features describing a number of elements (e.g. month or day number). Those ones will be one-base instead.

We first load the test and train datasets, and set the seed. Lets answer the question: `Does the dataset contains NA or missing values? Can we replace them by a value? Why?` All blank, space and unknown values are replaced by the value `0`. After writting the code book, we have seen that unique values in all features except the ID can be interpreted as Categories which can be grouped by another feature. In this way, the unknown, space or blank values can be treated as a Category ID as well. Thus, we decide to set their value to 0.

```{r echo = TRUE, message = FALSE, warning = FALSE}
source("Utilities.R")

strings.toNA <- c("", " ", "Unknown", "0 years")
train <- read.csv("train.csv", 
                  header = TRUE, 
                  na.strings = strings.toNA, 
                  stringsAsFactors = FALSE)
test <- read.csv("test.csv", 
                 header = TRUE, 
                 na.strings = strings.toNA, 
                 stringsAsFactors = FALSE)

set.seed(1234)

## Remove scientific notation (e.g. E-005).
options(scipen = 999)

## Remove the AnimalID which are not needed for the analysis.
train$AnimalID <- NULL
test.id <- test$ID
test$ID <- NULL

train[is.na(train)] <- 0
test[is.na(test)] <- 0

## Samples and population sizes.
train.n <- nrow(train)
test.n <- nrow(test)
N <- train.n + test.n
```

Since the number of possible values for each feature is generally greater than 4 or 5, we will use bar charts to visualize our analysis of the dataset.

We also know that the population is $N = `r N`$ animals. The samples are $n_{test} = `r test.n`$ animals which is `r test.n / N * 100` % of the population and $n_{train} = `r train.n`$ animals the other `r train.n / N * 100`% of the population.


## Data Coherence
We answer the question: `Does the data is coherent (date with same format, no out of bound values, no misspelled words, etc.)?` Most of features are coherent, but we found some incoherence in the `AgeuponOutcome` feature. Incoherent values found are `0 years` which is impossible because age cannot be 0. Also, a grammar error on the value `1 weeks` which does not take a 's'. Since there is a value `1 week` which is correctly written, we need to find (with `grep`) values containing the string "week". This will automatically give both, weeks and week.


## Outcomes Visualization
The objective is to answer the question: `What is the proportion of animals and how many cats and dogs are they for each outcome?`. Let's visualize how the outcomes are split in the train set with the following histogram.

```{r echo = FALSE, message = FALSE, warning = FALSE}
ggplot(train, aes(x = OutcomeType)) +
        geom_bar(aes(y = ..count.. / sum(..count..))) +
        facet_wrap(~AnimalType, ncol = 2) +
        scale_y_continuous(labels = percent) +
        geom_text(aes(y = ..count.. / sum(..count..) , 
                      label = scales::percent(..count.. / sum(..count..))), 
                  stat = "count", 
                  vjust = -0.25) +
        ggtitle("Percentage of cats and dogs by outcome") +
        labs(y = "Percentage of animals", x = "Outcome Types") +
        theme(axis.text.x = element_text(angle = 90))
```

We can see from this bar chart that dogs are returned to their owner almost 8 times (16% for dogs against 1.9% for cats) more than cats. Cats are mostly transferred and dogs are mostly returned to the owner with significant difference. The percentage for the other outcomes are slightly the same.

We replace the possible values of the AnimalType feature by 0 = Dog and 1 = Cat.

```{r echo = TRUE, message = FALSE, warning = FALSE}
train.copy <- train

animal.types <- unique(c(train$AnimalType, test$AnimalType))
train$AnimalType <- GetIntegerFeatureFromGroups(train$AnimalType, animal.types) - 1
test$AnimalType <- GetIntegerFeatureFromGroups(test$AnimalType, animal.types) - 1
```


## Animal's Age
The age of animals should be an important factor on outcomes. People want to adopt a young animal (between 1 month and 1 year old) because they can raise the animal more easily and keep it longer. Older animals (Over 1 year old) may have more difficulty to respond to their new master. Baby animals may be subject to a transfer for experimentations or for clinical reasons. Adopting a baby animal (less than a month old) needs more attention from the owner. Thus, adoptions should not be frequent for them. We show a bar chart of ages of animals.

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$AgeuponOutcome[grep("1 week", train.copy$AgeuponOutcome)] <- "1 week"

ageuponoutcome <- train.copy %>%
    select(AgeuponOutcome, AnimalType, OutcomeType) %>%
    group_by(AgeuponOutcome)

ageuponoutcome %>% 
    ggplot(aes(x = AgeuponOutcome, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        facet_wrap(~AnimalType, nrow = 2) +
        ggtitle("Age of animals by outcome ordered by number of animals") +
        labs(y = "Number of animals", x = "Age of animals") +
        scale_fill_brewer(palette = "Set1") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```

We can see that between 1 day and 4 weeks old, animals are mostly transferred including unknowns identified by 0. Animals are mostly adopted between 1 month and 1 years old. Between 1 year and 10 years old, animal adoptions decrease while age increases. Finally, animals are euthanasied or returned to their owner specially between 11 years and 20 years old.

We transform the feature `AgeuponOutcome` to integer values. For example, the age should be counted in days. Thus, `2 years` is replaced by the value 2 * 365 = 730. For months, the formula is `age * 30`. For weeks, the formula is `age * 7`. For years, the formula is `age * 365`.

```{r echo = TRUE, message = FALSE, warning = FALSE}
## Get the list of integers extracted from the feature AgeuponOutcome.
train.age <- as.integer(regmatches(train$AgeuponOutcome, 
                                   regexpr("[[:digit:]]+", train$AgeuponOutcome)))
test.age <- as.integer(regmatches(test$AgeuponOutcome, 
                                  regexpr("[[:digit:]]+", test$AgeuponOutcome)))

## Get row index list where AgeuponOutcome contains "year", "month", "week" or "day".
## Get the correspondant integer from each row index and 
## apply the formula to get all ages in days.
train.year.list <- grep("year", train$AgeuponOutcome)
train$AgeuponOutcome[train.year.list] <- train.age[train.year.list] * 365
train.month.list <- grep("month", train$AgeuponOutcome)
train$AgeuponOutcome[train.month.list] <- train.age[train.month.list] * 30
train.week.list <- grep("week", train$AgeuponOutcome)
train$AgeuponOutcome[train.week.list] <- train.age[train.week.list] * 7
train.day.list <- grep("day", train$AgeuponOutcome)
train$AgeuponOutcome[train.day.list] <- train.age[train.day.list]

test.year.list <- grep("year", test$AgeuponOutcome)
test$AgeuponOutcome[test.year.list] <- test.age[test.year.list] * 365
test.month.list <- grep("month", test$AgeuponOutcome)
test$AgeuponOutcome[test.month.list] <- test.age[test.month.list] * 30
test.week.list <- grep("week", test$AgeuponOutcome)
test$AgeuponOutcome[test.week.list] <- test.age[test.week.list] * 7
test.day.list <- grep("day", test$AgeuponOutcome)
test$AgeuponOutcome[test.day.list] <- test.age[test.day.list]

train$AgeuponOutcome <- as.integer(train$AgeuponOutcome)
test$AgeuponOutcome <- as.integer(test$AgeuponOutcome)
```

Lets see now how the ages are distributed in function of the number of animals and outcomes.

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$Age <- train$AgeuponOutcome

ageuponoutcome %>% 
    filter(grepl("year", AgeuponOutcome)) %>% 
    mutate(AgeYear = as.integer(gsub(" year[s]?", "", AgeuponOutcome))) %>%
    ggplot(aes(x = AgeYear, fill = AnimalType)) +
        geom_bar(stat = "count") + 
        facet_wrap(~OutcomeType, nrow = 2) +
        ggtitle("1 year or older animals by outcome and animal type") +
        labs(y = "Number of animals", x = "Age of animals") +
        scale_fill_brewer(palette = "Set1")

ageuponoutcome %>% 
    filter(grepl("month", AgeuponOutcome))  %>% 
    mutate(AgeMonth = as.integer(gsub(" month[s]?", "", AgeuponOutcome))) %>%
    ggplot(aes(x = AgeMonth, fill = AnimalType)) +
        geom_bar(stat = "count") + 
        facet_wrap(~OutcomeType, nrow = 2) +
        ggtitle("Animals between 1 and 11 months old by outcome and animal types") +
        labs(y = "Number of animals", x = "Age of animals") +
        scale_fill_brewer(palette = "Set1")

ageuponoutcome %>% 
    filter(grepl("week", AgeuponOutcome)) %>% 
    mutate(AgeWeek = as.integer(gsub(" week[s]?", "", AgeuponOutcome))) %>%
    ggplot(aes(x = AgeWeek, fill = AnimalType)) +
        geom_bar(stat = "count") + 
        facet_wrap(~OutcomeType, nrow = 2) +
        ggtitle("Animals between 1 and 5 weeks old by outcome and animal types") +
        labs(y = "Number of animals", x = "Age of animals") +
        scale_fill_brewer(palette = "Set1")

ageuponoutcome %>% 
    filter(grepl("day", AgeuponOutcome)) %>% 
    mutate(AgeDay = as.integer(gsub(" day[s]?", "", AgeuponOutcome))) %>%
    ggplot(aes(x = AgeDay, fill = AnimalType)) +
        geom_bar(stat = "count") + 
        facet_wrap(~OutcomeType, nrow = 2) +
        ggtitle("Animals between 1 and 6 days old by outcome and animal types") +
        labs(y = "Number of animals", x = "Age of animals") +
        scale_fill_brewer(palette = "Set1")
```

We can see 3 distinct age groups: days and weeks, months and years. The days and weeks group contains no adoptions and mostly transfers are done. We can state the following hypotheses from this group. 

1. The Animal Center makes the adoption possible only when an animal is 1 month or older.
2. People do not want to adopt baby animals because they need to learn to their animal to be clean.
3. Animals get neutered or spayed starting at 1 month old.

The months group representing animals from 1 month to 11 months is clearly showing that 2 months old animals are mostly adopted.

The years group representing animals having 1 year old or more is described by the following density bar chart.

```{r echo = FALSE, message = FALSE, warning = FALSE}
# SELECT AgeuponOutcome, OutcomeType, AnimalType, COUNT(AgeuponOutcome) AS Frequency, 
#        CAST(SUBSTRING(AgeuponOutcome, 1, PATINDEX(' year[s]?', AgeuponOutcome)) AS INT) AS AgeYear
# FROM Animal
# WHERE OutcomeType = 'Adoption' 
# GROUP BY AgeuponOutcome
# HAVING AgeuponOutcome LIKE '%year%'

age.filtered <- ageuponoutcome %>% 
    filter(grepl("year", AgeuponOutcome) & OutcomeType == "Adoption")  %>% 
    mutate(AgeYear = as.integer(gsub(" year[s]?", "", AgeuponOutcome))) %>%
    mutate(Frequency = n()) %>%
    ungroup() %>% 
    arrange(AgeYear) %>%
    distinct(AgeYear)

# If there are no animals for particular ages, we set the density to 0.
age.sequence <- seq(min(age.filtered$AgeYear), max(age.filtered$AgeYear), by = 1)
age.notfound <- age.sequence[!age.sequence %in% age.filtered$AgeYear]
newyear <- data.frame(AgeuponOutcome = paste(age.notfound, "years"), 
                      AnimalType = "Dog", 
                      OutcomeType = "Adoption", 
                      AgeYear = age.notfound, 
                      Frequency = 0)
age.filtered <- bind_rows(age.filtered, newyear) %>%
    arrange(AgeYear)

age.filtered %>%
    ggplot(aes(x = AgeYear)) +
        geom_bar(aes(y = Frequency / sum(Frequency)), stat = "identity") + 
        ggtitle("1 year or older adopted animals by animal type") +
        labs(y = "Density", x = "Age of animals") +
        scale_x_continuous(breaks = seq(min(age.filtered$AgeYear), max(age.filtered$AgeYear), by = 1)) +
        scale_fill_brewer(palette = "Set1")

age.density <- data.frame(year_density = age.filtered$Frequency / sum(age.filtered$Frequency),
                          poisson_density = dpois(1:18, lambda = 1.6))

print(age.density)

plot(age.density$year_density, 
     type = "l", 
     main = "Age and Poisson densities comparaison.", 
     xlab = "Age", 
     ylab = "Density")
lines(age.density$poisson_density, col = 2)
legend("topright", c("Age Density", "Poisson Density"), col = 1:2, cex = 0.8, fill = 1:2)
```

The discrete random variable Age is greater than 0 and ages are independant between each other. This means that age X cannot influance Age Y. Regarding the bar chart, it makes sense that the age approximately follows the Poisson distribution with lambda (mean) 1.6. 


## Animal's Name
The feature `Name` is transformed to a boolean value where the value is 0 when the animal has no name and 1 otherwise. Logically, the name of an animal should not have any impact on the outcomes. But knowing that an animal has no name versus has a name may influance the outcomes. 

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$NameLength <- ifelse(train.copy$Name != 0, str_length(train.copy$Name), 0)

train.copy %>% 
    select(NameLength, OutcomeType, AnimalType) %>%
    group_by(NameLength) %>%
    ggplot(aes(x = NameLength, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        facet_wrap(~AnimalType, ncol = 2) +
        ggtitle("Name length of animals by Outcome") +
        scale_x_continuous(breaks = seq(min(train.copy$NameLength), max(train.copy$NameLength), by = 1)) +
        scale_fill_brewer(palette = "Set1") + 
        labs(y = "Number of animals", x = "Name Length")

train.cats.noname <- train.copy %>%
    select(AnimalType, Name, OutcomeType) %>%
    filter(AnimalType == "Cat" & Name == 0 & OutcomeType == "Transfer")

train.transfers <- train.copy %>%
    select(OutcomeType) %>%
    filter(OutcomeType == "Transfer")

name.summary <- summary(train.copy$NameLength[train.copy$NameLength > 0])
print(name.summary)
```

From the bar chart, we can see that the name length follow the normal distribution if we exclude length of 0. In this context, animals having no name is possible, so it is not considered as an anomaly.

```{r echo = FALSE, message = FALSE, warning = FALSE}
name.length <- train.copy %>% 
    select(NameLength, AnimalType) %>%
    group_by(NameLength) %>% 
    filter(NameLength > 0)  %>% 
    mutate(Frequency = n()) %>%
    ungroup() %>% 
    arrange(NameLength) %>%
    distinct(NameLength)

name.length %>%
    ggplot(aes(x = NameLength)) +
        geom_bar(aes(y = Frequency / sum(Frequency)), stat = "identity") + 
        scale_x_continuous(breaks = seq(min(name.length$NameLength), max(name.length$NameLength), by = 1)) +
        ggtitle("Density of the name length of animals") +
        labs(y = "Density", x = "Name length") +
        scale_fill_brewer(palette = "Set1")

names.length.density <- data.frame(length_density = name.length$Frequency / sum(name.length$Frequency), 
                           normal_density = dnorm(1:12, mean = 5, sd = 1.35))
print(names.length.density)

plot(names.length.density$length_density, 
     type = "l", 
     main = "Name length and Normal density comparaison.", 
     xlab = "Name length", 
     ylab = "Density")
lines(names.length.density$normal_density, col = 2)
legend("topright", c("Name length Distribution", "Normal Distribution"), col = 1:2, cex = 0.8, fill = 1:2)
```

The density of the name length distribution is approximated by the Normal distribution. Mathematically, we define the random variable $L$ as the name length of an animal such that $L \sim N(5, 1.35^2)$.

Animals having no name are mostly transfered but for cats, this is very significant. Transferred cats having no name represent `r nrow(train.cats.noname) / train.n * 100` % of the train set and `r nrow(train.transfers) / train.n * 100` % of all the transferred animals of the train set. This is clearly an insight to consider. Therefore, we transform string values of the `Name` feature in boolean values telling if the animal has a name = 1 or has no name = 0.

```{r echo = TRUE, message = FALSE, warning = FALSE}
train$Name[train$Name != 0] <- 1
test$Name[test$Name != 0] <- 1
```


## Animal's Sterility and Sex
We want to see if extracting information that check if the animal is sterile or not will have influance on the outcomes. Generally, people who want to adopt a dog or a cat want to know if the animal is sterile or intact. Let's see if this is true with our dataset.

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy %>% 
    select(SexuponOutcome, OutcomeType, AnimalType) %>%
    mutate(SexuponOutcome = ifelse(SexuponOutcome == 0, "Unknown", SexuponOutcome)) %>%
    group_by(SexuponOutcome) %>%
    ggplot(aes(x = SexuponOutcome, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        facet_wrap(~AnimalType, ncol = 2) +
        ggtitle("Sex and sterility by outcome") +
        scale_fill_brewer(palette = "Set1") +
        labs(y = "Number of animals", x = "Sex and sterility") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```

We can see that sterile animals are mostly adopted which confirms our hypothesis. Unknown or intact animals are mostly transferred. This makes sense with unknown ones since they may need to be transferred to the clinic to identify clearly their sex and if they are sterile or not. Note that animals that we do not have information on their sterility and sex are not adopted. Thus, knowing if the animal is sterile or not has influance on the outcomes.

From this bar chart, we distinguish 3 groups: Unknown, Intact (not neutered or not spayed), Sterile (neutered or spayed).

We replace the feature `SexuponOutcome` by `Sterility` where Unknown = 0, Intact = 1 and Sterile = 2. Since there is no significant difference between sexes, we won't create a feature Sex.

```{r echo = TRUE, message = FALSE, warning = FALSE}
train$Sterility <- 0
train$Sterility[grep("Intact", train$SexuponOutcome)] <- 1
train$Sterility[grep("Spayed|Neutered", train$SexuponOutcome)] <- 2

test$Sterility <- 0
test$Sterility[grep("Intact", test$SexuponOutcome)] <- 1
test$Sterility[grep("Spayed|Neutered", test$SexuponOutcome)] <- 2

train$SexuponOutcome <- NULL
test$SexuponOutcome <- NULL
```


## Date & Time
The date and time may have a big influance on the outcomes. We have seen that adoptions represent the most popular outcome of the dataset. Our hypothesis is that people will mostly adopt an animal the weekend.

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$Weekday <- weekdays(as.Date(train$DateTime))
train.copy$Weekday <- factor(train.copy$Weekday, 
                             levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
#train.copy <- train.copy[order(train.copy$Weekday), ]

train.date <- ymd_hms(train$DateTime)

train.copy %>% 
    select(Weekday, OutcomeType, AnimalType) %>%
    group_by(Weekday) %>%
    ggplot(aes(x = Weekday, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        ggtitle("Weekdays by outcome") +
        facet_wrap(~AnimalType, ncol = 2) +
        scale_fill_brewer(palette = "Set1") +
        labs(y = "Number of animals", x = "Weekdays") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```

From the bar chart, we can see that animals are mostly adopted the weekend (Saturday and Sunday) which confirm our hypothesis. This makes sense since most of people are working from Monday to Friday. Looking at the website [Austin Animal Center](http://www.austintexas.gov/department/animal-services), animal adoption claims are from 11am - 7pm every day. From these information, we suppose that extracting the hour from the `DateTime` feature could have influance on the outcomes.

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$Hour <- hour(train.date)

train.copy %>% 
    select(Hour, OutcomeType, AnimalType) %>%
    group_by(Hour) %>%
    ggplot(aes(x = Hour, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        ggtitle("Hours by outcome") +
        facet_wrap(~AnimalType, ncol = 2) +
        labs(y = "Number of animals", x = "Hours") +
        scale_fill_brewer(palette = "Set1")
```

From the bar chart, we see that adoptions and returns to owner occur at 5pm and 6pm. Generally, people finish working around 4:30pm - 5pm so this makes sense. There are no outcomes done between 1am and 4am inclusively. There are transfers done at midnight, and at 9am for cats mostly, but most of them are done between 11am and 7pm which correspond to the open hours of the center. For adoptions and transfers done out of open hours, it may be the following hypothesis:

1. They are done for exceptional circumstances.
2. Error when entering the hours in the system.

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$Minute <- minute(train.date)

train.copy %>% 
    select(Minute, OutcomeType, AnimalType) %>%
    group_by(Minute) %>%
    ggplot(aes(x = Minute, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        ggtitle("Minutes by outcome") + 
        labs(y = "Number of animals", x = "Minutes") +
        scale_fill_brewer(palette = "Set1") 
```

Looking at minutes seems useless since 1 minute is very short and most of us will look at the hour only except if the software records also the minutes and even the seconds. But, from this bar chart, we can see that at 0 minute, there are a lot of transfers done compared to other minutes. From 1 to 59 minutes, the difference is negligible since it approximately follows a uniform distribution as expected. This could be explained by an automatic setting if the user does not enter the minutes when recording a new outcome with the software. The software convert empty minutes by "00" as well as the seconds. Since this is recurrent for transfers at 9:00am, it could be the same employee that records these transfers. This is also the case if the user enters only the date. The hours may be set automatically to 00:00:00. Therefore, it is possible that some entries concerning hours and minutes are not correct.

Looking at the days, we should see more adoptions the weekend since the majority of people are not working.

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$Day <- day(train.date)

train.copy %>% 
    select(Day, OutcomeType, AnimalType) %>%
    group_by(Day) %>%
    ggplot(aes(x = Day, fill = OutcomeType)) +
        geom_bar(stat = "count") +
        facet_wrap(~AnimalType, ncol = 2) +
        ggtitle("Days in months by outcome") +
        labs(y = "Number of animals", x = "Days in months") +
        scale_x_continuous(breaks = seq(min(train.copy$Day), max(train.copy$Day), by = 5)) +
        scale_fill_brewer(palette = "Set1")
```

There are 4 days (12, 13, 18, 19) where the number of adopted cats are higher than the other days. Also, note that there are less animals on the day 31. Indeed, since there are only 7 months over 12 having 31 days, this result makes sense.

We look now at the months to check if some of them can reveal important insights.

```{r echo = FALSE, message = FALSE, warning = FALSE}
## Ordering by month names for ggplot.
train.copy$Month <- months(train.date)
train.copy$Month <- factor(train.copy$Month, 
                           levels = format(ISOdate(2004, 1:12, 1),"%B"))
#train.copy <- train.copy[order(train.copy$Month), ]

train.copy %>% 
    select(Month, OutcomeType, AnimalType) %>%
    group_by(Month) %>%
    ggplot(aes(x = Month, fill = OutcomeType)) +
        geom_bar(stat = "count") +
        facet_wrap(~AnimalType, ncol = 2) +
        ggtitle("Months by outcome") +
        labs(y = "Number of animals", x = "Months") +
        scale_fill_brewer(palette = "Set1") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```

From the bar chart, we can see that December and January are months having highest number of adoptions of dogs. for the cats, July and December represent the highest number of adoptions. This can be explained by holidays like Christmas and the New Year's day. One possible hypothesis is that children want a dog or a cat for Christmas so they ask their parents to adopt an animal and give them as a Christmas gift.

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$Year <- year(train.date)

train.copy %>% 
    select(Year, OutcomeType, AnimalType) %>%
    group_by(Year) %>%
    ggplot(aes(x = Year, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        facet_wrap(~AnimalType, ncol = 2) +
        ggtitle("Years by outcome") +
        labs(y = "Number of animals", x = "Years") +
        scale_fill_brewer(palette = "Set1")
```

Years 2013 and 2016 have less outcomes than 2014 and 2015 since the dataset dates start on `r min(train.date)` and end on `r max(train.date)`.

We did not mention the seconds since there is `r nrow(filter(train, second(train.date) > 0))` animal where the seconds are different to 0.

From the feature `DateTime` of the dataset, we extract many useful features. Those features are transformed to positive integers. The features we extract are the following:

* Day: integer from 1 to 31 depending on the month
* Month: integer from 1 to 12
* Year: integer from 2013 to 2016
* Weekday: integer from 0 to 6 which represent Sunday to Saturday
* DateInDays: Number of days from the oldest date of the dataset
* Hour: integer from 0 to 23 where 0 = midnight
* Time: The time represented by the equation $60h + m$ where $h$ is the hours and $m$ the minutes.

```{r echo = TRUE, message = FALSE, warning = FALSE}
train$Weekday <- as.POSIXlt(train$DateTime)$wday
test$Weekday <- as.POSIXlt(test$DateTime)$wday

train.date <- ymd_hms(train$DateTime)
test.date <- ymd_hms(test$DateTime)

## The year starts at 0 which is the minimum year of the dataset.
train$Year <- year(train.date)
train$Month <- month(train.date)
train$Day <- day(train.date)
train$Hour <- hour(train.date)
train$DateInDays <- difftime(as.Date(train.date,'%Y/%m/%d'), 
                             as.Date(min(train.date),'%Y/%m/%d'), 
                             units = c("days"))
train$Time <- hour(train.date) * 60 + minute(train.date)

test$Year <- year(test.date)
test$Month <- month(test.date)
test$Day <- day(test.date)
test$Hour <- hour(test.date)
test$DateInDays <- difftime(as.Date(test.date,'%Y/%m/%d'), 
                            as.Date(min(test.date),'%Y/%m/%d'), 
                            units = c("days"))
test$Time <- hour(test.date) * 60 + minute(test.date)

train$DateTime <- NULL
test$DateTime <- NULL
```


## Animal's Breed
Since we have many breeds, we will create a breed class named 'Other' where breeds having less than 100 animals will be in this class. The breed should be important since people do not want to adopt an agressive animal for example. 

```{r echo = FALSE, message = FALSE, warning = FALSE}
## SELECT Breed1, OutcomeType, COUNT(Breed1) AS Breed1Count
## FROM Animal
## GROUP BY Breed1
## HAVING Breed1Count >= 100

breed.nomix <- gsub(" Mix", "", train.copy$Breed)
train.copy$Breed1 <- strsplit(x = breed.nomix, split = "/") %>% sapply(function(x){x[1]})

breeds.common <- train.copy %>% 
    select(Breed1) %>%
    group_by(Breed1) %>%
    filter(n() >= 100)

train.copy$Breed1[!(train.copy$Breed1 %in% breeds.common$Breed1)] <- "Others"

## Note that we cannot use 'facet_wrap' because cats don't have the same breeds as the dogs.
train.copy %>% 
    filter(AnimalType == "Dog") %>% 
    ggplot(aes(x = Breed1, fill = OutcomeType)) +
        geom_bar(stat = "count") +
        ggtitle("Dog breeds by outcome") +
        scale_fill_brewer(palette = "Set1") +
        labs(y = "Number of animals", x = "Breeds") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

train.copy %>% 
    filter(AnimalType == "Cat") %>% 
    ggplot(aes(x = Breed1, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        ggtitle("Cat breeds by outcome") +
        scale_fill_brewer(palette = "Set1") +
        labs(y = "Number of animals", x = "Breeds") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```

For the dogs, Shih Tzu are mostly transferred and less adopted. Pit bulls, Labrador Retreiver and Chihuahua Shorthair are the most common dogs for outcomes. For cats, the domestic shorthair is the most frequent and represent `r nrow(filter(train.copy, grepl("Domestic Shorthair", Breed) & AnimalType == "Cat")) / nrow(filter(train.copy, AnimalType == "Cat")) * 100` % of the cats in the train set.

Lets summarize the number of adopted animals and their percentage grouped by the breed. In this way, we will see which animals have the lowest ratio for adoption.

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.adopted.list <- train.copy %>% 
    group_by(Breed1, OutcomeType) %>% 
    summarise(Total = n()) %>% 
    mutate(Percentage = Total / sum(Total)) %>% 
    filter(OutcomeType == "Adoption")
train.adopted.list <- train.adopted.list[with(train.adopted.list, order(Percentage)), ]

print(train.adopted.list)

train.adopted.list %>% 
    ggplot(aes(x = Breed1, y = Percentage)) +
        geom_bar(stat = "identity") + 
        ggtitle("Percentage of adopted animal by breed") +
        scale_y_continuous(labels = percent) +
        labs(y = "Percentage of animals", x = "Breeds") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```

From the bar chart and summary, we see that `r paste0(head(train.adopted.list$Breed1, 3), collapse = ", ")` are breeds with lowest percentage of adoption.

We also check if the breed types (Mix, Cross or Pure) have a significant impact on the outcomes.

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.breed.mix.list <- grep(" Mix", train$Breed)
train.breed.cross.list <- grep("/", train$Breed)

train.copy$BreedType <- "Pure"
train.copy$BreedType[train.breed.cross.list] <- "Cross"
train.copy$BreedType[train.breed.mix.list] <- "Mix"

train.copy %>% 
    ggplot(aes(x = BreedType, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        ggtitle("Animal breed types by outcome") +
        facet_wrap(~AnimalType, ncol = 2) +
        scale_fill_brewer(palette = "Set1") +
        labs(y = "Number of animals", x = "Breed Types") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```

We extract this information from the `Breed` feature and add a new feature `BreedType` where all purebred animals are identified with the value 0. The mixed breed are identified with the value 2 and crossed breed with value 1.

```{r echo = TRUE, message = FALSE, warning = FALSE}
train$BreedType <- 0
train$BreedType[train.breed.cross.list] <- 1
train$BreedType[train.breed.mix.list] <- 2

test.breed.mix.list <- grep(" Mix", test$Breed)
test.breed.cross.list <- grep("/", test$Breed)
test$BreedType <- 0
test$BreedType[test.breed.cross.list] <- 1
test$BreedType[test.breed.mix.list] <- 2
```

Special cases with Pit Bull, Shih Tzu and Pug are considered since they are less adopted than the others.

```{r echo = TRUE, message = FALSE, warning = FALSE}
breeds <- c("Pit Bull", "Shih Tzu", "Pug")
train$CommonBreeds <- GetIntegerFeatureFromGroups(train$Breed, breeds)
test$CommonBreeds <- GetIntegerFeatureFromGroups(test$Breed, breeds)

train$Breed <- NULL
test$Breed <- NULL
```


## Animal's Color
For each outcome type, we want to know which animal's color has the greatest and lowest count. For colors having less than 100 animals, we categorize them as `Others` meaning that these colors are not common.

```{r echo = FALSE, message = FALSE, warning = FALSE}
## SELECT Color1, OutcomeType, COUNT(Color1) AS Color1Count
## FROM Animal
## GROUP BY Color1
## HAVING Color1Count >= 100

train.copy$Color1 <- strsplit(x = train.copy$Color, split = "/") %>% sapply(function(x){x[1]})

colors.common <- train.copy %>% 
    select(Color1) %>%
    group_by(Color1) %>%
    filter(n() >= 100)

train.copy$Color1[!(train.copy$Color1 %in% colors.common$Color1)] <- "Others"

train.copy %>% 
    filter(AnimalType == "Dog") %>% 
    ggplot(aes(x = Color1, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        ggtitle("Dog colors by outcome") +
        scale_fill_brewer(palette = "Set1") +
        labs(y = "Number of animals", x = "Colors") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

train.copy %>% 
    filter(AnimalType == "Cat") %>% 
    ggplot(aes(x = Color1, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        ggtitle("Cat colors by outcome") +
        scale_fill_brewer(palette = "Set1") +
        labs(y = "Number of animals", x = "Colors") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```

For the cats, we can see that tan cats are always adopted and tricolor cats are always transferred. Another important insight is that brown cats have not died, are not euthanasied or are not returned to their owner. However, the number of cats for of those colors is not big. Black color is the most popular for cats and dogs following this bar chart.

Some of colors have a qualifier but these should not have big influance on the outcomes. We take the feature `Color1` to determine the qualifiers and see what happens.

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$ColorQualifier1 <- strsplit(x = train.copy$Color1, split = "\\s") %>% 
    sapply(function(x){ifelse(length(x) > 1, x[2], "None")})

color.qualifier <- train.copy %>%
    select(ColorQualifier1, OutcomeType, AnimalType) %>%
    group_by(ColorQualifier1)

color.qualifier %>% 
    filter(AnimalType == "Dog") %>% 
    ggplot(aes(x = ColorQualifier1, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        ggtitle("Dog color qualifiers by outcome") +
        scale_fill_brewer(palette = "Set1") +
        labs(y = "Number of animals", x = "Color Qualifiers") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

color.qualifier %>% 
    filter(AnimalType == "Cat") %>% 
    ggplot(aes(x = ColorQualifier1, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        ggtitle("Cat color qualifiers by outcome") +
        scale_fill_brewer(palette = "Set1") +
        labs(y = "Number of animals", x = "Color Qualifiers") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```

We can verify with the bar chart above shows that qualifiers has no significant influance on the outcomes for dogs. For cats, tabby is frequent.

With the color feature, we can extract the number of colors. If we find a slash `/` character which we define as a `separator`, then we have 2 colors. If we find `Tricolor`, then this means that we have 3 colors. 

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$NumberOfColors <- str_count(train.copy$Color, "/") + 1
train.copy$NumberOfColors[grep("Tricolor", train.copy$Color)] <- 3

train.copy %>% 
    ggplot(aes(x = NumberOfColors, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        ggtitle("Number of colors for an animal by outcome") +
        facet_wrap(~AnimalType, ncol = 2) +
        scale_fill_brewer(palette = "Set1") +
        labs(y = "Number of animals", x = "Number of colors")

train$NumberOfColors <- train.copy$NumberOfColors

test$NumberOfColors <- str_count(test$Color, "/") + 1
test$NumberOfColors[grep("Tricolor", test$Color)] <- 3

train$Color <- NULL
test$Color <- NULL
```


<!----------------------------------------------------------------------TRAINING MODEL------------------------------------------------------------------------->


# Modeling and Interpretation
In this section, we present what algorithm is used, why are we using it and how it is used. The objective is to build the final submission file containing the probabilities for each outcome type of each animal given in the test set. 

Extreme Gradient Boosting Trees is a good choice to modelize our data. Each tree is a decision tree representing a group of animals where the probabilities of each outcome are approximately the same. For example, first level could be if the animal is sterile, intact or unknown. The second level could be if an animal has a name or not.  


## Fine Tuning Parameters
We prepare the parameters and matrices for the cross-validation and final prediction. Since we need to get the probabilities for each class, then the objective used is `multi:softprob`. This metric requires the number of classes `num_class` to predict which is 5 in our case. We have found that the evaluation metric used is the multi-class log loss function which is `mlogloss` for the `XGBoost` algorithm.

We use the `subsample` parameter which is the subsample ratio of the training instances. The parameter `eta` control the learning rate since the XGBoost uses the gradient descent which needs a learning rate parameter.

```{r echo = TRUE, message = FALSE, warning = FALSE}
outcomes <- unique(train$OutcomeType)
outcome.type <- CategoryToInteger(train$OutcomeType) - 1

train$OutcomeSubtype <- NULL
train$OutcomeType <- NULL

outcome.class.num <- length(outcomes)
param <- list(objective        = "multi:softprob",
              num_class        = outcome.class.num,    
              eta              = 0.12,  # Control the learning rate
              subsample        = 0.8,
              max_depth        = 8,
              colsample_bytree = 0.9, 
              eval_metric      = "mlogloss")
```


## Cross-Validation
We proceed to a 10-fold cross-validation to get the optimal number of trees and multi-class log-loss score. We use randomly subsamples representing 80% of the training set. Since we don't have too many observations (animals) and features, we can use 10 folds instead of 5 folds. Thus, the training set will be split in 10 test samples where each test sample has `r train.n / 10` observations.

For each tree, we will have the average of 10 error estimates to obtain a more robust estimate of the true prediction error. This is done for all trees and we get the optimal number of trees to use for the test set.

We also display 2 curves indicating the test and train multi-Logloss mean progression. The vertical dotted line is the optimal number of trees. This plot shows if the model overfit or underfit.

```{r echo = FALSE, message = FALSE, warning = FALSE}
cv.nfolds <- 10
cv.nrounds <- 230

train.matrix <- xgb.DMatrix(data.matrix(train), label = outcome.type)
model.cv <- xgb.cv(data     = train.matrix, 
                   nfold    = cv.nfolds, 
                   param    = param,
                   nrounds  = cv.nrounds, 
                   verbose  = 0)

model.cv$names <- as.integer(rownames(model.cv))
best <- model.cv[model.cv$test.mlogloss.mean == min(model.cv$test.mlogloss.mean), ]
cv.plot.title <- paste("Training log-loss using", cv.nfolds, "folds CV")

print(ggplot(model.cv, aes(x = names)) + 
          geom_line(aes(y = test.mlogloss.mean, colour = "test")) + 
          geom_line(aes(y = train.mlogloss.mean, colour = "train")) + 
          geom_vline(xintercept = best$names, linetype="dotted") +
          ggtitle(cv.plot.title) + 
          xlab("Number of trees") + 
          ylab("log-loss"))
     
print(model.cv)
cat("\nOptimal testing set Log-Loss score:", best$test.mlogloss.mean)
cat("\nAssociated training set Log-Loss score:", best$train.mlogloss.mean)
cat("\nInterval testing set Log-Loss score: [", best$test.mlogloss.mean - best$test.mlogloss.std, ", ", best$test.mlogloss.mean + best$test.mlogloss.std, "].")
cat("\nDifference between optimal training and testing sets Log-Loss:", abs(best$train.mlogloss.mean - best$test.mlogloss.mean))
cat("\nOptimal number of trees:", best$names)
```


## Prediction
We proceed to the predictions of the test set and show the features importance.

```{r echo = FALSE, message = FALSE, warning = FALSE}
nrounds <- as.integer(best$names)

model = xgboost(param = param, 
                train.matrix, 
                nrounds = nrounds,
                verbose = 0,
                probability = TRUE)

test.matrix <- xgb.DMatrix(data.matrix(test))

prediction.test <- predict(model, test.matrix)
prediction.train <- predict(model, train.matrix)

#Check which features are the most important.
names <- dimnames(train)[[2]]
importance.matrix <- xgb.importance(names, model = model)
print(importance.matrix)

# Display the features importance.
print(xgb.plot.importance(importance.matrix))
```

Lets try with the Random Forest algorithm.

```{r echo = FALSE, message = FALSE, warning = FALSE}
rf.model <- randomForest(x = train, 
                         y = as.factor(outcome.type), 
                         ntree = 100, 
                         importance = TRUE, 
                         do.trace = 5)

colnames(rf.model$err.rate) <- c("OOB", outcomes)
plot(rf.model, ylim = c(0, 1))
legend("topright", colnames(rf.model$err.rate), col = 1:6, cex = 0.8, fill = 1:6)

print(rf.model)

varImpPlot(rf.model)

par(mfrow = c(2, 2))
for (i in 1:outcome.class.num)
{
    # Reduce the x-axis labels font by 0.5. Rotate 90° the x-axis labels.
    barplot(sort(rf.model$importance[, i], dec = TRUE), 
            type = "h", 
            main = outcomes[i], 
            xlab = "", 
            ylab = "Gain", 
            las=2,
            cex.names=0.7)
}

rf.prediction.test <- predict(rf.model, test, type = "prob")
rf.prediction.train <- predict(rf.model, train, type = "prob")
```

Note that normally, we would have used a threshold to identify each animal associated to an outcome. Since this is a Kaggle competition, it is not required.


## Multi-class Log-Loss
We can verify how our predictions score under the multi-class log-loss function. We take our predictions applied to the train set and we compare to the real `OutcomeType` values of the train set.

```{r echo = FALSE, message = FALSE, warning = FALSE}
prediction.train.matrix <- matrix(prediction.train, ncol = outcome.class.num, byrow = TRUE)

rows <- nrow(train)
outcomes.train <- matrix(0, nrow = rows, ncol = outcome.class.num)
for(i in 1:rows)
{
    outcomes.train[i ,outcome.type[i] + 1] <- 1
}

multiclass.logloss <- MultiLogLoss(prediction.train.matrix, outcomes.train)
print(multiclass.logloss)

## Confusion matrix
predicted.train <- data.frame(prediction.train.matrix)
colnames(predicted.train) <- outcomes
predicted <- predicted.train / rowSums(predicted.train)
predicted$Class <- apply(predicted.train, 1, function(x) names(predicted.train)[which.max(x)])

confusionMatrix(predicted$Class, train.copy$OutcomeType)
```


## Submission
We write the `ID` and the predicted outcome classes in the submission file.

```{r echo = TRUE, message = FALSE, warning = FALSE}
prediction.test <- data.frame(matrix(prediction.test, 
                                     ncol = outcome.class.num, 
                                     byrow = TRUE))[, order(outcomes)]
colnames(prediction.test) <- sort(outcomes)

submission <- cbind(data.frame(ID = test.id), prediction.test)
write.csv(submission, "Submission.csv", row.names = FALSE)
```


<!---------------------------------------------------------------RECOMMANDATIONS------------------------------------------------------------------------------->


# Recommandations
The objective is to understand trends in animal outcomes. These insights could help shelters focus their energy on specific animals who need a little extra help finding a new home. Therefore, we need insights to help increasing the number of animals adopted.

An important insight we found in the dataset is if the animal is sterile, not sterile or unknown. 
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy %>% 
    select(SexuponOutcome, OutcomeType, AnimalType) %>%
    group_by(SexuponOutcome) %>%
    ggplot(aes(x = SexuponOutcome, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        facet_wrap(~AnimalType, ncol = 2) +
        ggtitle("Sex and sterility by outcome") +
        scale_fill_brewer(palette = "Set1") +
        labs(y = "Number of animals", x = "Sex and sterility") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

```

The bar chart shows that animals with no information about its sterility are never adopted. They represent `r nrow(filter(train.copy, SexuponOutcome == 0)) / train.n * 100` % of the dataset where we already know the outcomes. Our predictions go the same way where the probabilities are very low as the following summary shows the results.

```{r echo = FALSE, message = FALSE, warning = FALSE}
indices <- which(test$Sterility == 0)
summary(prediction.test$Adoption[indices])

intacts <- nrow(filter(train.copy, grepl("Intact", train.copy$SexuponOutcome)))
intacts.adoption <- nrow(filter(train.copy, grepl("Intact", train.copy$SexuponOutcome), OutcomeType == "Adoption")) / intacts * 100


steriles <- nrow(filter(train.copy, grepl("Spayed|Neutered", train.copy$SexuponOutcome)))
steriles.adoption <- nrow(filter(train.copy, grepl("Spayed|Neutered", train.copy$SexuponOutcome), OutcomeType == "Adoption")) / steriles * 100
```

Intact animals have a low percentage of adoptions. They represent `r intacts.adoption` % of `r intacts` intact animals.

```{r echo = FALSE, message = FALSE, warning = FALSE}
indices <- which(test$Sterility == 1)
summary(prediction.test$Adoption[indices])
```

Sterile animals show a better adoption ratio. They represent `r steriles.adoption` % of `r steriles` sterile animals. This is an increase of `r steriles.adoption - intacts.adoption` % compared to the intact ones.

```{r echo = FALSE, message = FALSE, warning = FALSE}
indices <- which(test$Sterility == 2)
summary(prediction.test$Adoption[indices])
```

Since the adoption ratio is by far much better for sterile animals, we recommand to sterilize them when received to the Animal Center. This will greatly increase their chance to be adopted. Indeed, regarding the web site, the steps of adoption take much more time if the animal is not sterilized. Poeple have to come to the center again to complete the adoption. 

Regarding the open hours of the Animal Center which is between 11am and 7pm during the work days, we observed that there are more adoptions between 5pm and 7pm. Since people normally finish working between 4h30pm and 5h30pm, this result makes sense.

```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy %>% 
    select(Hour, OutcomeType, AnimalType) %>%
    group_by(Hour) %>%
    ggplot(aes(x = Hour, fill = OutcomeType)) +
        geom_bar(stat = "count") + 
        ggtitle("Number of animals for each hour by outcome") +
        facet_wrap(~AnimalType, ncol = 2) +
        labs(y = "Number of animals", x = "Hours") +
        scale_fill_brewer(palette = "Set1")
```

We have also seen that the weekend (Saturday and Sunday), there are more adoptions, specially between 5pm and 7pm as for work days. Therefore, we recommand to keep these open hours for adoption claim everyday and extend them for the weekend as for the work days to allow people coming to the Animal Center later. Adding one hours (11am to 8pm) may help.