-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAnalysis.Rmd
1034 lines (788 loc) · 51.6 KB
/
Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Shelter Animal Outcomes"
author: "Gabriel Lapointe"
date: "May 13, 2016"
output:
pdf_document:
toc: yes
html_document:
highlight: pygments
number_sections: yes
toc: yes
variant: markdown_github
---
# Data Acquisition
Every year, approximately 7.6 million companion animals end up in US shelters. Many animals are given up as unwanted by their owners, while others are picked up after getting lost or taken out of cruelty situations. Many of these animals find forever families to take them home, but just as many are not so lucky. 2.7 million dogs and cats are euthanized in the US every year.
## Objective
The objective is to help improving outcomes for shelter animals. Using a dataset of intake information including breed, color, sex, and age from the Austin Animal Center, we have to predict the outcome for each animal.
To support our analysis, we are given the train and test sets in CSV files. From this, we have written a code book to help understanding each feature and their values.
## Data Source
The data comes from [Austin Animal Center](http://www.austintexas.gov/department/animal-services) from October 1st, 2013 to March, 2016. Outcomes represent the status of animals as they leave the Animal Center. All animals receive a unique Animal ID during intake. Source of the competition: [Shelter Animal Outcomes](https://www.kaggle.com/c/shelter-animal-outcomes).
## Dataset Questions
Before we start the exploration of the dataset, we need to write a list of questions about this dataset considering the problem we have to solve.
* How big is the dataset? (Described in the Code Book)
* Does the dataset contains `NA` or missing values? Can we replace them by a value? Why?
* Does the data is coherent (date with same format, no out of bound values, no misspelled words, etc.)?
* What does the data look like and what are the relationships between features if they exist?
* What are the measures used?
* Can we solve the problem with this dataset?
Questions on features:
* Are features in the dataset sufficiant to explain each outcome?
* What is the proportion of animals and how many cats and dogs are they for each outcome?
* Are there features that can be split in many other features? If yes, are they improving the score? Why and how?
* For each outcome, what features have the most importance and why?
## Evaluation Metrics
Since we have 5 outcome types where we need to calculate a prediction's probability for each class (outcome type), we have to manage 5 classes using a multi-class algorithm. The `Log Loss` function quantifies the accuracy of a classifier by penalising false classifications. Minimising the `Log Loss` function is basically equivalent to maximising the accuracy of the classifier which is what we need.
Thus, Kaggle provides us the evaluation metric we need:
$$LogLoss = -\frac{1}{N} \sum_{i = 0}^N \sum_{j = 0}^M y_{i,j} \log{\mathbb{P}_{i,j}}$$
where
$N$ is the total number of animals,
$M$ is the number of outcomes,
$y_{i,j}$ is 1 if observation $i$ is in outcome $j$ and 0 otherwise,
$\mathbb{P}_{i,j}$ is the predicted probability that observation $i$ belongs to outcome $j$.
## Methodology
When we will evaluate a model, we will train it first with the training set. When our model will be chosen, we will test it with this model. We will also use cross-validation since we are limited with the data given.
Since the output to predict is known and well defined, we will use a supervised algorithm to solve the problem. This is a multi-class problem where we have to give a probability for each class.
In this document, we start by exploring the dataset and build the data story behind it. This will give us important insights which will answer our questions on this dataset. The next step is to do some feature engineering which consists to create, remove or replace features regarding insights we got when exploring the dataset. We will ensure our new dataset is a valid input for our prediction model. After applying our model to the test set, we will visualize the predictions calculated and explain the results.
<!------------------------------------------------------------EXPLORATORY ANALYSIS------------------------------------------------------------------------------>
# Data Preparation and Exploratory
In this section, we explore the dataset and we test hypotheses on features. The objective is to visualize and understand the relationships between features in the dataset we have to solve the problem. We will also compare changes we will make to this dataset to validate if they have significant influance on the outcomes or not.
We will also add and remove features from the dataset depending on the importance of insights we will find with data visualization. We will transform each feature in categories represented by a positive integer. The objective is to clean and prepare the dataset for our prediction's model and to answer the questions: `Are there features that can be split in many other features? If yes, are they improving the score? Why and how?`. Those questions will be answered throughout this section and with the data story presented at the conclusion of this document.
We will look at each feature of the dataset and split them if there are more information given than what the feature's name tells. For example, the feature `SexuponOutcome` tells us more than just the sex of the animal regarding the possible values. We can extract the sex and if the animal is sterile or not.
Each features having limited number of unique values (categories) will be numbered by a positive integer starting at 0. The zero-based ID will not hold for new features describing a number of elements (e.g. month or day number). Those ones will be one-base instead.
We first load the test and train datasets, and set the seed. Lets answer the question: `Does the dataset contains NA or missing values? Can we replace them by a value? Why?` All blank, space and unknown values are replaced by the value `0`. After writting the code book, we have seen that unique values in all features except the ID can be interpreted as Categories which can be grouped by another feature. In this way, the unknown, space or blank values can be treated as a Category ID as well. Thus, we decide to set their value to 0.
```{r echo = TRUE, message = FALSE, warning = FALSE}
source("Utilities.R")
strings.toNA <- c("", " ", "Unknown", "0 years")
train <- read.csv("train.csv",
header = TRUE,
na.strings = strings.toNA,
stringsAsFactors = FALSE)
test <- read.csv("test.csv",
header = TRUE,
na.strings = strings.toNA,
stringsAsFactors = FALSE)
set.seed(1234)
## Remove scientific notation (e.g. E-005).
options(scipen = 999)
## Remove the AnimalID which are not needed for the analysis.
train$AnimalID <- NULL
test.id <- test$ID
test$ID <- NULL
train[is.na(train)] <- 0
test[is.na(test)] <- 0
## Samples and population sizes.
train.n <- nrow(train)
test.n <- nrow(test)
N <- train.n + test.n
```
Since the number of possible values for each feature is generally greater than 4 or 5, we will use bar charts to visualize our analysis of the dataset.
We also know that the population is $N = `r N`$ animals. The samples are $n_{test} = `r test.n`$ animals which is `r test.n / N * 100` % of the population and $n_{train} = `r train.n`$ animals the other `r train.n / N * 100`% of the population.
## Data Coherence
We answer the question: `Does the data is coherent (date with same format, no out of bound values, no misspelled words, etc.)?` Most of features are coherent, but we found some incoherence in the `AgeuponOutcome` feature. Incoherent values found are `0 years` which is impossible because age cannot be 0. Also, a grammar error on the value `1 weeks` which does not take a 's'. Since there is a value `1 week` which is correctly written, we need to find (with `grep`) values containing the string "week". This will automatically give both, weeks and week.
## Outcomes Visualization
The objective is to answer the question: `What is the proportion of animals and how many cats and dogs are they for each outcome?`. Let's visualize how the outcomes are split in the train set with the following histogram.
```{r echo = FALSE, message = FALSE, warning = FALSE}
ggplot(train, aes(x = OutcomeType)) +
geom_bar(aes(y = ..count.. / sum(..count..))) +
facet_wrap(~AnimalType, ncol = 2) +
scale_y_continuous(labels = percent) +
geom_text(aes(y = ..count.. / sum(..count..) ,
label = scales::percent(..count.. / sum(..count..))),
stat = "count",
vjust = -0.25) +
ggtitle("Percentage of cats and dogs by outcome") +
labs(y = "Percentage of animals", x = "Outcome Types") +
theme(axis.text.x = element_text(angle = 90))
```
We can see from this bar chart that dogs are returned to their owner almost 8 times (16% for dogs against 1.9% for cats) more than cats. Cats are mostly transferred and dogs are mostly returned to the owner with significant difference. The percentage for the other outcomes are slightly the same.
We replace the possible values of the AnimalType feature by 0 = Dog and 1 = Cat.
```{r echo = TRUE, message = FALSE, warning = FALSE}
train.copy <- train
animal.types <- unique(c(train$AnimalType, test$AnimalType))
train$AnimalType <- GetIntegerFeatureFromGroups(train$AnimalType, animal.types) - 1
test$AnimalType <- GetIntegerFeatureFromGroups(test$AnimalType, animal.types) - 1
```
## Animal's Age
The age of animals should be an important factor on outcomes. People want to adopt a young animal (between 1 month and 1 year old) because they can raise the animal more easily and keep it longer. Older animals (Over 1 year old) may have more difficulty to respond to their new master. Baby animals may be subject to a transfer for experimentations or for clinical reasons. Adopting a baby animal (less than a month old) needs more attention from the owner. Thus, adoptions should not be frequent for them. We show a bar chart of ages of animals.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$AgeuponOutcome[grep("1 week", train.copy$AgeuponOutcome)] <- "1 week"
ageuponoutcome <- train.copy %>%
select(AgeuponOutcome, AnimalType, OutcomeType) %>%
group_by(AgeuponOutcome)
ageuponoutcome %>%
ggplot(aes(x = AgeuponOutcome, fill = OutcomeType)) +
geom_bar(stat = "count") +
facet_wrap(~AnimalType, nrow = 2) +
ggtitle("Age of animals by outcome ordered by number of animals") +
labs(y = "Number of animals", x = "Age of animals") +
scale_fill_brewer(palette = "Set1") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
We can see that between 1 day and 4 weeks old, animals are mostly transferred including unknowns identified by 0. Animals are mostly adopted between 1 month and 1 years old. Between 1 year and 10 years old, animal adoptions decrease while age increases. Finally, animals are euthanasied or returned to their owner specially between 11 years and 20 years old.
We transform the feature `AgeuponOutcome` to integer values. For example, the age should be counted in days. Thus, `2 years` is replaced by the value 2 * 365 = 730. For months, the formula is `age * 30`. For weeks, the formula is `age * 7`. For years, the formula is `age * 365`.
```{r echo = TRUE, message = FALSE, warning = FALSE}
## Get the list of integers extracted from the feature AgeuponOutcome.
train.age <- as.integer(regmatches(train$AgeuponOutcome,
regexpr("[[:digit:]]+", train$AgeuponOutcome)))
test.age <- as.integer(regmatches(test$AgeuponOutcome,
regexpr("[[:digit:]]+", test$AgeuponOutcome)))
## Get row index list where AgeuponOutcome contains "year", "month", "week" or "day".
## Get the correspondant integer from each row index and
## apply the formula to get all ages in days.
train.year.list <- grep("year", train$AgeuponOutcome)
train$AgeuponOutcome[train.year.list] <- train.age[train.year.list] * 365
train.month.list <- grep("month", train$AgeuponOutcome)
train$AgeuponOutcome[train.month.list] <- train.age[train.month.list] * 30
train.week.list <- grep("week", train$AgeuponOutcome)
train$AgeuponOutcome[train.week.list] <- train.age[train.week.list] * 7
train.day.list <- grep("day", train$AgeuponOutcome)
train$AgeuponOutcome[train.day.list] <- train.age[train.day.list]
test.year.list <- grep("year", test$AgeuponOutcome)
test$AgeuponOutcome[test.year.list] <- test.age[test.year.list] * 365
test.month.list <- grep("month", test$AgeuponOutcome)
test$AgeuponOutcome[test.month.list] <- test.age[test.month.list] * 30
test.week.list <- grep("week", test$AgeuponOutcome)
test$AgeuponOutcome[test.week.list] <- test.age[test.week.list] * 7
test.day.list <- grep("day", test$AgeuponOutcome)
test$AgeuponOutcome[test.day.list] <- test.age[test.day.list]
train$AgeuponOutcome <- as.integer(train$AgeuponOutcome)
test$AgeuponOutcome <- as.integer(test$AgeuponOutcome)
```
Lets see now how the ages are distributed in function of the number of animals and outcomes.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$Age <- train$AgeuponOutcome
ageuponoutcome %>%
filter(grepl("year", AgeuponOutcome)) %>%
mutate(AgeYear = as.integer(gsub(" year[s]?", "", AgeuponOutcome))) %>%
ggplot(aes(x = AgeYear, fill = AnimalType)) +
geom_bar(stat = "count") +
facet_wrap(~OutcomeType, nrow = 2) +
ggtitle("1 year or older animals by outcome and animal type") +
labs(y = "Number of animals", x = "Age of animals") +
scale_fill_brewer(palette = "Set1")
ageuponoutcome %>%
filter(grepl("month", AgeuponOutcome)) %>%
mutate(AgeMonth = as.integer(gsub(" month[s]?", "", AgeuponOutcome))) %>%
ggplot(aes(x = AgeMonth, fill = AnimalType)) +
geom_bar(stat = "count") +
facet_wrap(~OutcomeType, nrow = 2) +
ggtitle("Animals between 1 and 11 months old by outcome and animal types") +
labs(y = "Number of animals", x = "Age of animals") +
scale_fill_brewer(palette = "Set1")
ageuponoutcome %>%
filter(grepl("week", AgeuponOutcome)) %>%
mutate(AgeWeek = as.integer(gsub(" week[s]?", "", AgeuponOutcome))) %>%
ggplot(aes(x = AgeWeek, fill = AnimalType)) +
geom_bar(stat = "count") +
facet_wrap(~OutcomeType, nrow = 2) +
ggtitle("Animals between 1 and 5 weeks old by outcome and animal types") +
labs(y = "Number of animals", x = "Age of animals") +
scale_fill_brewer(palette = "Set1")
ageuponoutcome %>%
filter(grepl("day", AgeuponOutcome)) %>%
mutate(AgeDay = as.integer(gsub(" day[s]?", "", AgeuponOutcome))) %>%
ggplot(aes(x = AgeDay, fill = AnimalType)) +
geom_bar(stat = "count") +
facet_wrap(~OutcomeType, nrow = 2) +
ggtitle("Animals between 1 and 6 days old by outcome and animal types") +
labs(y = "Number of animals", x = "Age of animals") +
scale_fill_brewer(palette = "Set1")
```
We can see 3 distinct age groups: days and weeks, months and years. The days and weeks group contains no adoptions and mostly transfers are done. We can state the following hypotheses from this group.
1. The Animal Center makes the adoption possible only when an animal is 1 month or older.
2. People do not want to adopt baby animals because they need to learn to their animal to be clean.
3. Animals get neutered or spayed starting at 1 month old.
The months group representing animals from 1 month to 11 months is clearly showing that 2 months old animals are mostly adopted.
The years group representing animals having 1 year old or more is described by the following density bar chart.
```{r echo = FALSE, message = FALSE, warning = FALSE}
# SELECT AgeuponOutcome, OutcomeType, AnimalType, COUNT(AgeuponOutcome) AS Frequency,
# CAST(SUBSTRING(AgeuponOutcome, 1, PATINDEX(' year[s]?', AgeuponOutcome)) AS INT) AS AgeYear
# FROM Animal
# WHERE OutcomeType = 'Adoption'
# GROUP BY AgeuponOutcome
# HAVING AgeuponOutcome LIKE '%year%'
age.filtered <- ageuponoutcome %>%
filter(grepl("year", AgeuponOutcome) & OutcomeType == "Adoption") %>%
mutate(AgeYear = as.integer(gsub(" year[s]?", "", AgeuponOutcome))) %>%
mutate(Frequency = n()) %>%
ungroup() %>%
arrange(AgeYear) %>%
distinct(AgeYear)
# If there are no animals for particular ages, we set the density to 0.
age.sequence <- seq(min(age.filtered$AgeYear), max(age.filtered$AgeYear), by = 1)
age.notfound <- age.sequence[!age.sequence %in% age.filtered$AgeYear]
newyear <- data.frame(AgeuponOutcome = paste(age.notfound, "years"),
AnimalType = "Dog",
OutcomeType = "Adoption",
AgeYear = age.notfound,
Frequency = 0)
age.filtered <- bind_rows(age.filtered, newyear) %>%
arrange(AgeYear)
age.filtered %>%
ggplot(aes(x = AgeYear)) +
geom_bar(aes(y = Frequency / sum(Frequency)), stat = "identity") +
ggtitle("1 year or older adopted animals by animal type") +
labs(y = "Density", x = "Age of animals") +
scale_x_continuous(breaks = seq(min(age.filtered$AgeYear), max(age.filtered$AgeYear), by = 1)) +
scale_fill_brewer(palette = "Set1")
age.density <- data.frame(year_density = age.filtered$Frequency / sum(age.filtered$Frequency),
poisson_density = dpois(1:18, lambda = 1.6))
print(age.density)
plot(age.density$year_density,
type = "l",
main = "Age and Poisson densities comparaison.",
xlab = "Age",
ylab = "Density")
lines(age.density$poisson_density, col = 2)
legend("topright", c("Age Density", "Poisson Density"), col = 1:2, cex = 0.8, fill = 1:2)
```
The discrete random variable Age is greater than 0 and ages are independant between each other. This means that age X cannot influance Age Y. Regarding the bar chart, it makes sense that the age approximately follows the Poisson distribution with lambda (mean) 1.6.
## Animal's Name
The feature `Name` is transformed to a boolean value where the value is 0 when the animal has no name and 1 otherwise. Logically, the name of an animal should not have any impact on the outcomes. But knowing that an animal has no name versus has a name may influance the outcomes.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$NameLength <- ifelse(train.copy$Name != 0, str_length(train.copy$Name), 0)
train.copy %>%
select(NameLength, OutcomeType, AnimalType) %>%
group_by(NameLength) %>%
ggplot(aes(x = NameLength, fill = OutcomeType)) +
geom_bar(stat = "count") +
facet_wrap(~AnimalType, ncol = 2) +
ggtitle("Name length of animals by Outcome") +
scale_x_continuous(breaks = seq(min(train.copy$NameLength), max(train.copy$NameLength), by = 1)) +
scale_fill_brewer(palette = "Set1") +
labs(y = "Number of animals", x = "Name Length")
train.cats.noname <- train.copy %>%
select(AnimalType, Name, OutcomeType) %>%
filter(AnimalType == "Cat" & Name == 0 & OutcomeType == "Transfer")
train.transfers <- train.copy %>%
select(OutcomeType) %>%
filter(OutcomeType == "Transfer")
name.summary <- summary(train.copy$NameLength[train.copy$NameLength > 0])
print(name.summary)
```
From the bar chart, we can see that the name length follow the normal distribution if we exclude length of 0. In this context, animals having no name is possible, so it is not considered as an anomaly.
```{r echo = FALSE, message = FALSE, warning = FALSE}
name.length <- train.copy %>%
select(NameLength, AnimalType) %>%
group_by(NameLength) %>%
filter(NameLength > 0) %>%
mutate(Frequency = n()) %>%
ungroup() %>%
arrange(NameLength) %>%
distinct(NameLength)
name.length %>%
ggplot(aes(x = NameLength)) +
geom_bar(aes(y = Frequency / sum(Frequency)), stat = "identity") +
scale_x_continuous(breaks = seq(min(name.length$NameLength), max(name.length$NameLength), by = 1)) +
ggtitle("Density of the name length of animals") +
labs(y = "Density", x = "Name length") +
scale_fill_brewer(palette = "Set1")
names.length.density <- data.frame(length_density = name.length$Frequency / sum(name.length$Frequency),
normal_density = dnorm(1:12, mean = 5, sd = 1.35))
print(names.length.density)
plot(names.length.density$length_density,
type = "l",
main = "Name length and Normal density comparaison.",
xlab = "Name length",
ylab = "Density")
lines(names.length.density$normal_density, col = 2)
legend("topright", c("Name length Distribution", "Normal Distribution"), col = 1:2, cex = 0.8, fill = 1:2)
```
The density of the name length distribution is approximated by the Normal distribution. Mathematically, we define the random variable $L$ as the name length of an animal such that $L \sim N(5, 1.35^2)$.
Animals having no name are mostly transfered but for cats, this is very significant. Transferred cats having no name represent `r nrow(train.cats.noname) / train.n * 100` % of the train set and `r nrow(train.transfers) / train.n * 100` % of all the transferred animals of the train set. This is clearly an insight to consider. Therefore, we transform string values of the `Name` feature in boolean values telling if the animal has a name = 1 or has no name = 0.
```{r echo = TRUE, message = FALSE, warning = FALSE}
train$Name[train$Name != 0] <- 1
test$Name[test$Name != 0] <- 1
```
## Animal's Sterility and Sex
We want to see if extracting information that check if the animal is sterile or not will have influance on the outcomes. Generally, people who want to adopt a dog or a cat want to know if the animal is sterile or intact. Let's see if this is true with our dataset.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy %>%
select(SexuponOutcome, OutcomeType, AnimalType) %>%
mutate(SexuponOutcome = ifelse(SexuponOutcome == 0, "Unknown", SexuponOutcome)) %>%
group_by(SexuponOutcome) %>%
ggplot(aes(x = SexuponOutcome, fill = OutcomeType)) +
geom_bar(stat = "count") +
facet_wrap(~AnimalType, ncol = 2) +
ggtitle("Sex and sterility by outcome") +
scale_fill_brewer(palette = "Set1") +
labs(y = "Number of animals", x = "Sex and sterility") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
We can see that sterile animals are mostly adopted which confirms our hypothesis. Unknown or intact animals are mostly transferred. This makes sense with unknown ones since they may need to be transferred to the clinic to identify clearly their sex and if they are sterile or not. Note that animals that we do not have information on their sterility and sex are not adopted. Thus, knowing if the animal is sterile or not has influance on the outcomes.
From this bar chart, we distinguish 3 groups: Unknown, Intact (not neutered or not spayed), Sterile (neutered or spayed).
We replace the feature `SexuponOutcome` by `Sterility` where Unknown = 0, Intact = 1 and Sterile = 2. Since there is no significant difference between sexes, we won't create a feature Sex.
```{r echo = TRUE, message = FALSE, warning = FALSE}
train$Sterility <- 0
train$Sterility[grep("Intact", train$SexuponOutcome)] <- 1
train$Sterility[grep("Spayed|Neutered", train$SexuponOutcome)] <- 2
test$Sterility <- 0
test$Sterility[grep("Intact", test$SexuponOutcome)] <- 1
test$Sterility[grep("Spayed|Neutered", test$SexuponOutcome)] <- 2
train$SexuponOutcome <- NULL
test$SexuponOutcome <- NULL
```
## Date & Time
The date and time may have a big influance on the outcomes. We have seen that adoptions represent the most popular outcome of the dataset. Our hypothesis is that people will mostly adopt an animal the weekend.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$Weekday <- weekdays(as.Date(train$DateTime))
train.copy$Weekday <- factor(train.copy$Weekday,
levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
#train.copy <- train.copy[order(train.copy$Weekday), ]
train.date <- ymd_hms(train$DateTime)
train.copy %>%
select(Weekday, OutcomeType, AnimalType) %>%
group_by(Weekday) %>%
ggplot(aes(x = Weekday, fill = OutcomeType)) +
geom_bar(stat = "count") +
ggtitle("Weekdays by outcome") +
facet_wrap(~AnimalType, ncol = 2) +
scale_fill_brewer(palette = "Set1") +
labs(y = "Number of animals", x = "Weekdays") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
From the bar chart, we can see that animals are mostly adopted the weekend (Saturday and Sunday) which confirm our hypothesis. This makes sense since most of people are working from Monday to Friday. Looking at the website [Austin Animal Center](http://www.austintexas.gov/department/animal-services), animal adoption claims are from 11am - 7pm every day. From these information, we suppose that extracting the hour from the `DateTime` feature could have influance on the outcomes.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$Hour <- hour(train.date)
train.copy %>%
select(Hour, OutcomeType, AnimalType) %>%
group_by(Hour) %>%
ggplot(aes(x = Hour, fill = OutcomeType)) +
geom_bar(stat = "count") +
ggtitle("Hours by outcome") +
facet_wrap(~AnimalType, ncol = 2) +
labs(y = "Number of animals", x = "Hours") +
scale_fill_brewer(palette = "Set1")
```
From the bar chart, we see that adoptions and returns to owner occur at 5pm and 6pm. Generally, people finish working around 4:30pm - 5pm so this makes sense. There are no outcomes done between 1am and 4am inclusively. There are transfers done at midnight, and at 9am for cats mostly, but most of them are done between 11am and 7pm which correspond to the open hours of the center. For adoptions and transfers done out of open hours, it may be the following hypothesis:
1. They are done for exceptional circumstances.
2. Error when entering the hours in the system.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$Minute <- minute(train.date)
train.copy %>%
select(Minute, OutcomeType, AnimalType) %>%
group_by(Minute) %>%
ggplot(aes(x = Minute, fill = OutcomeType)) +
geom_bar(stat = "count") +
ggtitle("Minutes by outcome") +
labs(y = "Number of animals", x = "Minutes") +
scale_fill_brewer(palette = "Set1")
```
Looking at minutes seems useless since 1 minute is very short and most of us will look at the hour only except if the software records also the minutes and even the seconds. But, from this bar chart, we can see that at 0 minute, there are a lot of transfers done compared to other minutes. From 1 to 59 minutes, the difference is negligible since it approximately follows a uniform distribution as expected. This could be explained by an automatic setting if the user does not enter the minutes when recording a new outcome with the software. The software convert empty minutes by "00" as well as the seconds. Since this is recurrent for transfers at 9:00am, it could be the same employee that records these transfers. This is also the case if the user enters only the date. The hours may be set automatically to 00:00:00. Therefore, it is possible that some entries concerning hours and minutes are not correct.
Looking at the days, we should see more adoptions the weekend since the majority of people are not working.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$Day <- day(train.date)
train.copy %>%
select(Day, OutcomeType, AnimalType) %>%
group_by(Day) %>%
ggplot(aes(x = Day, fill = OutcomeType)) +
geom_bar(stat = "count") +
facet_wrap(~AnimalType, ncol = 2) +
ggtitle("Days in months by outcome") +
labs(y = "Number of animals", x = "Days in months") +
scale_x_continuous(breaks = seq(min(train.copy$Day), max(train.copy$Day), by = 5)) +
scale_fill_brewer(palette = "Set1")
```
There are 4 days (12, 13, 18, 19) where the number of adopted cats are higher than the other days. Also, note that there are less animals on the day 31. Indeed, since there are only 7 months over 12 having 31 days, this result makes sense.
We look now at the months to check if some of them can reveal important insights.
```{r echo = FALSE, message = FALSE, warning = FALSE}
## Ordering by month names for ggplot.
train.copy$Month <- months(train.date)
train.copy$Month <- factor(train.copy$Month,
levels = format(ISOdate(2004, 1:12, 1),"%B"))
#train.copy <- train.copy[order(train.copy$Month), ]
train.copy %>%
select(Month, OutcomeType, AnimalType) %>%
group_by(Month) %>%
ggplot(aes(x = Month, fill = OutcomeType)) +
geom_bar(stat = "count") +
facet_wrap(~AnimalType, ncol = 2) +
ggtitle("Months by outcome") +
labs(y = "Number of animals", x = "Months") +
scale_fill_brewer(palette = "Set1") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
From the bar chart, we can see that December and January are months having highest number of adoptions of dogs. for the cats, July and December represent the highest number of adoptions. This can be explained by holidays like Christmas and the New Year's day. One possible hypothesis is that children want a dog or a cat for Christmas so they ask their parents to adopt an animal and give them as a Christmas gift.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$Year <- year(train.date)
train.copy %>%
select(Year, OutcomeType, AnimalType) %>%
group_by(Year) %>%
ggplot(aes(x = Year, fill = OutcomeType)) +
geom_bar(stat = "count") +
facet_wrap(~AnimalType, ncol = 2) +
ggtitle("Years by outcome") +
labs(y = "Number of animals", x = "Years") +
scale_fill_brewer(palette = "Set1")
```
Years 2013 and 2016 have less outcomes than 2014 and 2015 since the dataset dates start on `r min(train.date)` and end on `r max(train.date)`.
We did not mention the seconds since there is `r nrow(filter(train, second(train.date) > 0))` animal where the seconds are different to 0.
From the feature `DateTime` of the dataset, we extract many useful features. Those features are transformed to positive integers. The features we extract are the following:
* Day: integer from 1 to 31 depending on the month
* Month: integer from 1 to 12
* Year: integer from 2013 to 2016
* Weekday: integer from 0 to 6 which represent Sunday to Saturday
* DateInDays: Number of days from the oldest date of the dataset
* Hour: integer from 0 to 23 where 0 = midnight
* Time: The time represented by the equation $60h + m$ where $h$ is the hours and $m$ the minutes.
```{r echo = TRUE, message = FALSE, warning = FALSE}
train$Weekday <- as.POSIXlt(train$DateTime)$wday
test$Weekday <- as.POSIXlt(test$DateTime)$wday
train.date <- ymd_hms(train$DateTime)
test.date <- ymd_hms(test$DateTime)
## The year starts at 0 which is the minimum year of the dataset.
train$Year <- year(train.date)
train$Month <- month(train.date)
train$Day <- day(train.date)
train$Hour <- hour(train.date)
train$DateInDays <- difftime(as.Date(train.date,'%Y/%m/%d'),
as.Date(min(train.date),'%Y/%m/%d'),
units = c("days"))
train$Time <- hour(train.date) * 60 + minute(train.date)
test$Year <- year(test.date)
test$Month <- month(test.date)
test$Day <- day(test.date)
test$Hour <- hour(test.date)
test$DateInDays <- difftime(as.Date(test.date,'%Y/%m/%d'),
as.Date(min(test.date),'%Y/%m/%d'),
units = c("days"))
test$Time <- hour(test.date) * 60 + minute(test.date)
train$DateTime <- NULL
test$DateTime <- NULL
```
## Animal's Breed
Since we have many breeds, we will create a breed class named 'Other' where breeds having less than 100 animals will be in this class. The breed should be important since people do not want to adopt an agressive animal for example.
```{r echo = FALSE, message = FALSE, warning = FALSE}
## SELECT Breed1, OutcomeType, COUNT(Breed1) AS Breed1Count
## FROM Animal
## GROUP BY Breed1
## HAVING Breed1Count >= 100
breed.nomix <- gsub(" Mix", "", train.copy$Breed)
train.copy$Breed1 <- strsplit(x = breed.nomix, split = "/") %>% sapply(function(x){x[1]})
breeds.common <- train.copy %>%
select(Breed1) %>%
group_by(Breed1) %>%
filter(n() >= 100)
train.copy$Breed1[!(train.copy$Breed1 %in% breeds.common$Breed1)] <- "Others"
## Note that we cannot use 'facet_wrap' because cats don't have the same breeds as the dogs.
train.copy %>%
filter(AnimalType == "Dog") %>%
ggplot(aes(x = Breed1, fill = OutcomeType)) +
geom_bar(stat = "count") +
ggtitle("Dog breeds by outcome") +
scale_fill_brewer(palette = "Set1") +
labs(y = "Number of animals", x = "Breeds") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
train.copy %>%
filter(AnimalType == "Cat") %>%
ggplot(aes(x = Breed1, fill = OutcomeType)) +
geom_bar(stat = "count") +
ggtitle("Cat breeds by outcome") +
scale_fill_brewer(palette = "Set1") +
labs(y = "Number of animals", x = "Breeds") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
For the dogs, Shih Tzu are mostly transferred and less adopted. Pit bulls, Labrador Retreiver and Chihuahua Shorthair are the most common dogs for outcomes. For cats, the domestic shorthair is the most frequent and represent `r nrow(filter(train.copy, grepl("Domestic Shorthair", Breed) & AnimalType == "Cat")) / nrow(filter(train.copy, AnimalType == "Cat")) * 100` % of the cats in the train set.
Lets summarize the number of adopted animals and their percentage grouped by the breed. In this way, we will see which animals have the lowest ratio for adoption.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.adopted.list <- train.copy %>%
group_by(Breed1, OutcomeType) %>%
summarise(Total = n()) %>%
mutate(Percentage = Total / sum(Total)) %>%
filter(OutcomeType == "Adoption")
train.adopted.list <- train.adopted.list[with(train.adopted.list, order(Percentage)), ]
print(train.adopted.list)
train.adopted.list %>%
ggplot(aes(x = Breed1, y = Percentage)) +
geom_bar(stat = "identity") +
ggtitle("Percentage of adopted animal by breed") +
scale_y_continuous(labels = percent) +
labs(y = "Percentage of animals", x = "Breeds") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
From the bar chart and summary, we see that `r paste0(head(train.adopted.list$Breed1, 3), collapse = ", ")` are breeds with lowest percentage of adoption.
We also check if the breed types (Mix, Cross or Pure) have a significant impact on the outcomes.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.breed.mix.list <- grep(" Mix", train$Breed)
train.breed.cross.list <- grep("/", train$Breed)
train.copy$BreedType <- "Pure"
train.copy$BreedType[train.breed.cross.list] <- "Cross"
train.copy$BreedType[train.breed.mix.list] <- "Mix"
train.copy %>%
ggplot(aes(x = BreedType, fill = OutcomeType)) +
geom_bar(stat = "count") +
ggtitle("Animal breed types by outcome") +
facet_wrap(~AnimalType, ncol = 2) +
scale_fill_brewer(palette = "Set1") +
labs(y = "Number of animals", x = "Breed Types") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
We extract this information from the `Breed` feature and add a new feature `BreedType` where all purebred animals are identified with the value 0. The mixed breed are identified with the value 2 and crossed breed with value 1.
```{r echo = TRUE, message = FALSE, warning = FALSE}
train$BreedType <- 0
train$BreedType[train.breed.cross.list] <- 1
train$BreedType[train.breed.mix.list] <- 2
test.breed.mix.list <- grep(" Mix", test$Breed)
test.breed.cross.list <- grep("/", test$Breed)
test$BreedType <- 0
test$BreedType[test.breed.cross.list] <- 1
test$BreedType[test.breed.mix.list] <- 2
```
Special cases with Pit Bull, Shih Tzu and Pug are considered since they are less adopted than the others.
```{r echo = TRUE, message = FALSE, warning = FALSE}
breeds <- c("Pit Bull", "Shih Tzu", "Pug")
train$CommonBreeds <- GetIntegerFeatureFromGroups(train$Breed, breeds)
test$CommonBreeds <- GetIntegerFeatureFromGroups(test$Breed, breeds)
train$Breed <- NULL
test$Breed <- NULL
```
## Animal's Color
For each outcome type, we want to know which animal's color has the greatest and lowest count. For colors having less than 100 animals, we categorize them as `Others` meaning that these colors are not common.
```{r echo = FALSE, message = FALSE, warning = FALSE}
## SELECT Color1, OutcomeType, COUNT(Color1) AS Color1Count
## FROM Animal
## GROUP BY Color1
## HAVING Color1Count >= 100
train.copy$Color1 <- strsplit(x = train.copy$Color, split = "/") %>% sapply(function(x){x[1]})
colors.common <- train.copy %>%
select(Color1) %>%
group_by(Color1) %>%
filter(n() >= 100)
train.copy$Color1[!(train.copy$Color1 %in% colors.common$Color1)] <- "Others"
train.copy %>%
filter(AnimalType == "Dog") %>%
ggplot(aes(x = Color1, fill = OutcomeType)) +
geom_bar(stat = "count") +
ggtitle("Dog colors by outcome") +
scale_fill_brewer(palette = "Set1") +
labs(y = "Number of animals", x = "Colors") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
train.copy %>%
filter(AnimalType == "Cat") %>%
ggplot(aes(x = Color1, fill = OutcomeType)) +
geom_bar(stat = "count") +
ggtitle("Cat colors by outcome") +
scale_fill_brewer(palette = "Set1") +
labs(y = "Number of animals", x = "Colors") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
For the cats, we can see that tan cats are always adopted and tricolor cats are always transferred. Another important insight is that brown cats have not died, are not euthanasied or are not returned to their owner. However, the number of cats for of those colors is not big. Black color is the most popular for cats and dogs following this bar chart.
Some of colors have a qualifier but these should not have big influance on the outcomes. We take the feature `Color1` to determine the qualifiers and see what happens.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$ColorQualifier1 <- strsplit(x = train.copy$Color1, split = "\\s") %>%
sapply(function(x){ifelse(length(x) > 1, x[2], "None")})
color.qualifier <- train.copy %>%
select(ColorQualifier1, OutcomeType, AnimalType) %>%
group_by(ColorQualifier1)
color.qualifier %>%
filter(AnimalType == "Dog") %>%
ggplot(aes(x = ColorQualifier1, fill = OutcomeType)) +
geom_bar(stat = "count") +
ggtitle("Dog color qualifiers by outcome") +
scale_fill_brewer(palette = "Set1") +
labs(y = "Number of animals", x = "Color Qualifiers") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
color.qualifier %>%
filter(AnimalType == "Cat") %>%
ggplot(aes(x = ColorQualifier1, fill = OutcomeType)) +
geom_bar(stat = "count") +
ggtitle("Cat color qualifiers by outcome") +
scale_fill_brewer(palette = "Set1") +
labs(y = "Number of animals", x = "Color Qualifiers") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
We can verify with the bar chart above shows that qualifiers has no significant influance on the outcomes for dogs. For cats, tabby is frequent.
With the color feature, we can extract the number of colors. If we find a slash `/` character which we define as a `separator`, then we have 2 colors. If we find `Tricolor`, then this means that we have 3 colors.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy$NumberOfColors <- str_count(train.copy$Color, "/") + 1
train.copy$NumberOfColors[grep("Tricolor", train.copy$Color)] <- 3
train.copy %>%
ggplot(aes(x = NumberOfColors, fill = OutcomeType)) +
geom_bar(stat = "count") +
ggtitle("Number of colors for an animal by outcome") +
facet_wrap(~AnimalType, ncol = 2) +
scale_fill_brewer(palette = "Set1") +
labs(y = "Number of animals", x = "Number of colors")
train$NumberOfColors <- train.copy$NumberOfColors
test$NumberOfColors <- str_count(test$Color, "/") + 1
test$NumberOfColors[grep("Tricolor", test$Color)] <- 3
train$Color <- NULL
test$Color <- NULL
```
<!----------------------------------------------------------------------TRAINING MODEL------------------------------------------------------------------------->
# Modeling and Interpretation
In this section, we present what algorithm is used, why are we using it and how it is used. The objective is to build the final submission file containing the probabilities for each outcome type of each animal given in the test set.
Extreme Gradient Boosting Trees is a good choice to modelize our data. Each tree is a decision tree representing a group of animals where the probabilities of each outcome are approximately the same. For example, first level could be if the animal is sterile, intact or unknown. The second level could be if an animal has a name or not.
## Fine Tuning Parameters
We prepare the parameters and matrices for the cross-validation and final prediction. Since we need to get the probabilities for each class, then the objective used is `multi:softprob`. This metric requires the number of classes `num_class` to predict which is 5 in our case. We have found that the evaluation metric used is the multi-class log loss function which is `mlogloss` for the `XGBoost` algorithm.
We use the `subsample` parameter which is the subsample ratio of the training instances. The parameter `eta` control the learning rate since the XGBoost uses the gradient descent which needs a learning rate parameter.
```{r echo = TRUE, message = FALSE, warning = FALSE}
outcomes <- unique(train$OutcomeType)
outcome.type <- CategoryToInteger(train$OutcomeType) - 1
train$OutcomeSubtype <- NULL
train$OutcomeType <- NULL
outcome.class.num <- length(outcomes)
param <- list(objective = "multi:softprob",
num_class = outcome.class.num,
eta = 0.12, # Control the learning rate
subsample = 0.8,
max_depth = 8,
colsample_bytree = 0.9,
eval_metric = "mlogloss")
```
## Cross-Validation
We proceed to a 10-fold cross-validation to get the optimal number of trees and multi-class log-loss score. We use randomly subsamples representing 80% of the training set. Since we don't have too many observations (animals) and features, we can use 10 folds instead of 5 folds. Thus, the training set will be split in 10 test samples where each test sample has `r train.n / 10` observations.
For each tree, we will have the average of 10 error estimates to obtain a more robust estimate of the true prediction error. This is done for all trees and we get the optimal number of trees to use for the test set.
We also display 2 curves indicating the test and train multi-Logloss mean progression. The vertical dotted line is the optimal number of trees. This plot shows if the model overfit or underfit.
```{r echo = FALSE, message = FALSE, warning = FALSE}
cv.nfolds <- 10
cv.nrounds <- 230
train.matrix <- xgb.DMatrix(data.matrix(train), label = outcome.type)
model.cv <- xgb.cv(data = train.matrix,
nfold = cv.nfolds,
param = param,
nrounds = cv.nrounds,
verbose = 0)
model.cv$names <- as.integer(rownames(model.cv))
best <- model.cv[model.cv$test.mlogloss.mean == min(model.cv$test.mlogloss.mean), ]
cv.plot.title <- paste("Training log-loss using", cv.nfolds, "folds CV")
print(ggplot(model.cv, aes(x = names)) +
geom_line(aes(y = test.mlogloss.mean, colour = "test")) +
geom_line(aes(y = train.mlogloss.mean, colour = "train")) +
geom_vline(xintercept = best$names, linetype="dotted") +
ggtitle(cv.plot.title) +
xlab("Number of trees") +
ylab("log-loss"))
print(model.cv)
cat("\nOptimal testing set Log-Loss score:", best$test.mlogloss.mean)
cat("\nAssociated training set Log-Loss score:", best$train.mlogloss.mean)
cat("\nInterval testing set Log-Loss score: [", best$test.mlogloss.mean - best$test.mlogloss.std, ", ", best$test.mlogloss.mean + best$test.mlogloss.std, "].")
cat("\nDifference between optimal training and testing sets Log-Loss:", abs(best$train.mlogloss.mean - best$test.mlogloss.mean))
cat("\nOptimal number of trees:", best$names)
```
## Prediction
We proceed to the predictions of the test set and show the features importance.
```{r echo = FALSE, message = FALSE, warning = FALSE}
nrounds <- as.integer(best$names)
model = xgboost(param = param,
train.matrix,
nrounds = nrounds,
verbose = 0,
probability = TRUE)
test.matrix <- xgb.DMatrix(data.matrix(test))
prediction.test <- predict(model, test.matrix)
prediction.train <- predict(model, train.matrix)
#Check which features are the most important.
names <- dimnames(train)[[2]]
importance.matrix <- xgb.importance(names, model = model)
print(importance.matrix)
# Display the features importance.
print(xgb.plot.importance(importance.matrix))
```
Lets try with the Random Forest algorithm.
```{r echo = FALSE, message = FALSE, warning = FALSE}
rf.model <- randomForest(x = train,
y = as.factor(outcome.type),
ntree = 100,
importance = TRUE,
do.trace = 5)
colnames(rf.model$err.rate) <- c("OOB", outcomes)
plot(rf.model, ylim = c(0, 1))
legend("topright", colnames(rf.model$err.rate), col = 1:6, cex = 0.8, fill = 1:6)
print(rf.model)
varImpPlot(rf.model)
par(mfrow = c(2, 2))
for (i in 1:outcome.class.num)
{
# Reduce the x-axis labels font by 0.5. Rotate 90° the x-axis labels.
barplot(sort(rf.model$importance[, i], dec = TRUE),
type = "h",
main = outcomes[i],
xlab = "",
ylab = "Gain",
las=2,
cex.names=0.7)
}
rf.prediction.test <- predict(rf.model, test, type = "prob")
rf.prediction.train <- predict(rf.model, train, type = "prob")
```
Note that normally, we would have used a threshold to identify each animal associated to an outcome. Since this is a Kaggle competition, it is not required.
## Multi-class Log-Loss
We can verify how our predictions score under the multi-class log-loss function. We take our predictions applied to the train set and we compare to the real `OutcomeType` values of the train set.
```{r echo = FALSE, message = FALSE, warning = FALSE}
prediction.train.matrix <- matrix(prediction.train, ncol = outcome.class.num, byrow = TRUE)
rows <- nrow(train)
outcomes.train <- matrix(0, nrow = rows, ncol = outcome.class.num)
for(i in 1:rows)
{
outcomes.train[i ,outcome.type[i] + 1] <- 1
}
multiclass.logloss <- MultiLogLoss(prediction.train.matrix, outcomes.train)
print(multiclass.logloss)
## Confusion matrix
predicted.train <- data.frame(prediction.train.matrix)
colnames(predicted.train) <- outcomes
predicted <- predicted.train / rowSums(predicted.train)
predicted$Class <- apply(predicted.train, 1, function(x) names(predicted.train)[which.max(x)])
confusionMatrix(predicted$Class, train.copy$OutcomeType)
```
## Submission
We write the `ID` and the predicted outcome classes in the submission file.
```{r echo = TRUE, message = FALSE, warning = FALSE}
prediction.test <- data.frame(matrix(prediction.test,
ncol = outcome.class.num,
byrow = TRUE))[, order(outcomes)]
colnames(prediction.test) <- sort(outcomes)
submission <- cbind(data.frame(ID = test.id), prediction.test)
write.csv(submission, "Submission.csv", row.names = FALSE)
```
<!---------------------------------------------------------------RECOMMANDATIONS------------------------------------------------------------------------------->
# Recommandations
The objective is to understand trends in animal outcomes. These insights could help shelters focus their energy on specific animals who need a little extra help finding a new home. Therefore, we need insights to help increasing the number of animals adopted.
An important insight we found in the dataset is if the animal is sterile, not sterile or unknown.
```{r echo = FALSE, message = FALSE, warning = FALSE}
train.copy %>%
select(SexuponOutcome, OutcomeType, AnimalType) %>%
group_by(SexuponOutcome) %>%
ggplot(aes(x = SexuponOutcome, fill = OutcomeType)) +
geom_bar(stat = "count") +
facet_wrap(~AnimalType, ncol = 2) +
ggtitle("Sex and sterility by outcome") +
scale_fill_brewer(palette = "Set1") +
labs(y = "Number of animals", x = "Sex and sterility") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```
The bar chart shows that animals with no information about its sterility are never adopted. They represent `r nrow(filter(train.copy, SexuponOutcome == 0)) / train.n * 100` % of the dataset where we already know the outcomes. Our predictions go the same way where the probabilities are very low as the following summary shows the results.
```{r echo = FALSE, message = FALSE, warning = FALSE}
indices <- which(test$Sterility == 0)
summary(prediction.test$Adoption[indices])
intacts <- nrow(filter(train.copy, grepl("Intact", train.copy$SexuponOutcome)))
intacts.adoption <- nrow(filter(train.copy, grepl("Intact", train.copy$SexuponOutcome), OutcomeType == "Adoption")) / intacts * 100
steriles <- nrow(filter(train.copy, grepl("Spayed|Neutered", train.copy$SexuponOutcome)))