-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCancerMortality.Rmd
1783 lines (1346 loc) · 63.6 KB
/
CancerMortality.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output:
pdf_document: default
html_document: default
md_document:
variant: markdown_github
editor_options:
chunk_output_type: console
---
<!-- README.md is generated from CancerMortality.Rmd. Please edit that file. -->
```{r opts, echo = FALSE}
knitr::opts_chunk$set(
fig.path = "images/"
)
```
# Data Preparation
```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(crayon)
library(car)
library(FactoMineR)
library(chemometrics)
library(corrplot)
library(RColorBrewer)
library(PerformanceAnalytics)
library(mice)
library(dplyr)
library(readr)
library(stringr)
library(MASS)
library(lmtest)
library(sandwich)
library(cluster)
```
First we import the data and save it as the variable "df" for future
modifications.
```{r}
par(mfrow=c(1,1))
df <- read_csv("data/train.csv")
```
Here's a breakdown of all the variables in the dataset.
**Variable Name** | **Num/Fac** | **Description**
---------------------- | ----------- | ----------------------
avganncount | N | Mean number of reported cases of cancer diagnosed annually (2010-2015
avgdeathsperyear | N | Mean number of reported mortalities due to cancer
target_deathrate | N | Response variable. Mean per capita (100,000) cancer mortalities
incidencerate | N | Mean per capita (100,000) cancer diagnoses
medincome | N | Median income per county
popest2015 | N | Population of county
povertypercent | N | Percent of population in poverty
studypercap | N | Per capita number of cancer-related clinical trials per county
binnedinc | F | Median income per capita binned by decile
medianage | N | Median age of county residents
medianagemale | N | Median age of male county residents
medianagefemale | N | Median age of female county residents
geography | F | County name
percentmarried | N | Percent of county residents who are married
pctnohs18_24 | N | Percent of county residents ages 18-24 highest education attained: less than high school
pcths18_24 | N | Percent of county residents ages 18-24 highest education attained: high school diploma
pctsomecol18_24 | N | Percent of county residents ages 18-24 highest education attained: some college
pcths25_over | N | Percent of county residents ages 25 and over highest education attained: high school diploma
pctbachdeg25_over | N | Percent of county residents ages 25 and over highest education attained: bachelor's degree
pctemployed16_over | N | Percent of county residents ages 16 and over employed
pctunemployed16_over | N | Percent of county residents ages 16 and over unemployed
pctprivatecoverage | N | Percent of county residents with private health coverage
pctprivatecoveragealone | N | Percent of county residents with private health coverage alone (no public assistance)
pctempprivcoverage | N | Percent of county residents with employee-provided private health coverage
pctpubliccoverage | N | Percent of county residents with government-provided health coverage
pctpubliccoveragealone | N | Percent of county residents with government-provided health coverage alone
pctwhite | N | Percent of county residents who identify as White
pctblack | N | Percent of county residents who identify as Black
pctasian | N | Percent of county residents who identify as Asian
pctotherrace | N | Percent of county residents who identify in a category which is not White, Black, or Asian
pctmarriedhouseholds | N | Percent of married households
birthrate | N | Number of live births relative to number of women in county
## Variable Types
All variables but `geography` should be numeric. Only one variable requires
any change: `binnedinc` is a string variable right now, but two things can be
created from it: a factor variable (`f.binnedinc`), and a numeric variable
holding the midpoint of each bin, which we'll conviniently call `binnedinc`.
```{r}
df$f.binnedinc <- as.factor(df$binnedinc)
# Use regex to remove the [,],( and ) from the rows:
inc.midpoints.text <- gsub("[\\[\\]()]", "", df$binnedinc, perl = T)
# Separate them into two numbers
inc.midpoints.text.sep <- strsplit(inc.midpoints.text, ",")
# Convert them to numbers and apply a mean between them to find the midpoint
df$binnedinc <- sapply(inc.midpoints.text.sep, function(x) mean(as.numeric(x)))
```
Note how `geography`, altough being non-numeric, it's not a factor variable.
This can be shown by looking at the number of unique values in the column.
```{r}
nrow(df); length(unique(df$geography))
```
Because the number of unique values is the same as the number of rows, we can
safely assume that `geography` is a unique identifier for each row. It won't be
removed yet, because, as proven in a later section, a factor variable can be
derived from it.
## Duplicated Data Points
Before proceeding with the analysis, we should check if there are any duplicated
rows in the data set. If there are, we shall remove them.
```{r}
duplicated_row_count <- sum(duplicated(df))
if (duplicated_row_count > 0) {
print(sprintf("There are %d duplicated rows.", duplicated_row_count))
df <- unique(df)
}
```
No duplicated rows were found in the data set.
## Missing data
Glancing over the data set, one can see that there are some missing values:
```{r}
for (colname in colnames(df)) {
na.count <- sum(is.na(df[[colname]]))
if (na.count > 0) {
cat(sprintf("%s has %s\n", colname, red(sprintf("%d N/As", sum(is.na(df[colname]))))))
}
}
```
* Column `pctsomecol18_24` has 1376 N/As, which makes for more than 75% of the
data. Due to that, it would most likely provide little to no meaningful
information, so a decision was made to remove it from the study.
* Column `pctemployed16_over` has 82 N/As, which is a manageable amount. They
can easily be imputed using the MICE method.
* Column `pctprivatecoveragealone` has 356 N/As, which acounts for less than 20%
of the rows. This is a small enough amount to be imputed using the MICE
method. Nontheless, this column will be removed from the study later on, for
reasons explained in the exploratory data analysis section.
```{r}
# Drop the column with too many missing values
df <- subset(df, select = -c(pctsomecol18_24))
# Impute missing values
res.mice <- mice(df)
complete_df = complete(res.mice, action = 1)
```
Before substituting the missing values, let's check the deciles of the variables
to see if the imputation makes sense.
```{r}
quantile(df$pctemployed16_over, na.rm = TRUE, probs = seq(0, 1, 0.1))
quantile(df$pctprivatecoveragealone, na.rm = TRUE, probs = seq(0, 1, 0.1))
```
Now let's substitute the missing values in the original data set. And check that
the deciles are still roughly the same.
```{r}
df$pctemployed16_over <- complete_df$pctemployed16_over
df$pctprivatecoveragealone <- complete_df$pctprivatecoveragealone
quantile(df$pctemployed16_over, na.rm = TRUE, probs = seq(0, 1, 0.1))
quantile(df$pctprivatecoveragealone, na.rm = TRUE, probs = seq(0, 1, 0.1))
```
Which they are.
## Univariate Outliers
Column-wise, we can count how many univariate outliers each numeric variable
has:
```{r}
for (colname in colnames(Filter(is.numeric, df))) {
col = df[[colname]]
q1 <- quantile(col, 0.25)
q3 <- quantile(col, 0.75)
iqr <- q3 - q1
severe <- list(top = q3 + 3 * iqr, bot = q1 - 3 * iqr)
mild <- list(top = q3 + 1.5 * iqr, bot = q1 - 1.5 * iqr)
severe_out <- sum(col > severe$top | col < severe$bot)
mild_out <- sum((col > mild$top & col < severe$top) | (col < mild$bot & col > severe$bot))
if (mild_out > 0 | severe_out > 0) {
cat(sprintf("Column %s has %d mild outliers and %d severe outliers\n", colname, mild_out, severe_out))
}
}
```
Row-wise, we'll count the numeric variables in which each data point is an
outlier, and create a new object called `univariate_outlier_count`. As a
gut-driven criterion, we shall consider a row to be an outlier if it is an
outlier in 10 or more variables. Based on this criterion, only 9 counties are.
```{r}
count_outliers <- function(data) {
# Function to check for outliers based on IQR
is_outlier <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
return(x < lower_bound | x > upper_bound)
}
# Apply the outlier function to each column and sum the results for each row using dplyr
data %>%
mutate(outlier_count = rowSums(sapply(., is_outlier), na.rm = TRUE))
}
univariate_outlier_count <- count_outliers(Filter(is.numeric, df))$outlier_count
df[which(univariate_outlier_count >= 10),]
```
All of them have high percentages of non-white population, both black and asian,
a low median age, a high mortality count and a high bias towards private and
employee health coverage. Of these 9 counties, 7 are wealthy (Low poverty
percent) and 2 have a large poor population (over 20%).
Outliers can sometimes provide valuable information, so they won't be removed
from the data set just yet.
### The case of medianage
`medianage` is a continuous variable which contains some data points that make
no sense, for instance, median ages over 100. Thankfully, we have data for male
and female median age, which allow us to replace outlier points by a mean of
male and female age.
```{r}
boxplot(df$medianage, horizontal = TRUE, main = "Median Age")
out = which(df$medianage > 100)
df$medianage[out] <- (df$medianagemale[out] + df$medianagefemale[out]) / 2
```
As the following boxplot shows, the meaninglessly high values have taken a more
reasonable value.
```{r}
boxplot(df$medianage, horizontal = TRUE, main = "Median Age")
```
## Multivariate Outliers
We will apply Moutlier on the numerical variables in order to find multivariate
outliers. We have to perform the calculation excluding the variable studypercap
because otherwise the method is unable to execute due to multicollinearity
casuing a singularity matrix in the intermediate calculations. An extremely mild
threshold is chosen (0.00005%) because even using this threshold we get a
significant amount of multivariate outliers, 4% of the total sample. Lowering
the threshold even further doesn't change much the amount of outliers and rising
it higher makes the amount of outliers rise too much (10% outliers at 0.1%
significance level).
```{r}
numeric.df <- Filter(is.numeric, df)
numeric.df <- numeric.df[, !colnames(numeric.df) %in% c("studypercap")]
res.out_95 <- Moutlier(numeric.df, quantile = 0.95, plot=F)
multi_outliers_95 = which((res.out_95$md > res.out_95$cutoff)&(res.out_95$rd > res.out_95$cutoff))
length(multi_outliers_95)
res.out <- Moutlier(numeric.df, quantile = 0.9999995, plot=F)
multi_outliers = which((res.out$md > res.out$cutoff)&(res.out$rd > res.out$cutoff))
length(multi_outliers)
par(mfrow = c(1,1))
plot(res.out$rd, res.out$md )
abline(h=res.out$cutoff, col="red")
abline(v=res.out$cutoff, col="red")
```
There are 91 multivariate outliers in the data set (265 if we take a 95%
quantile).
# Exploratory Data Analysis
This section will be devided in two parts: single-variable analysis and
multi-variable analysis.
## Single-Variable analysis
This sub-section presents an analysis for each variable of the data set as a
standalone sample.
We'll be performing lots of discretisation of continuous variables based on
their quartiles, so let's create a function to do that.
```{r}
discretize_quartiles <- function(column, level_name) {
res <- cut(column, breaks = quantile(column, probs = seq(0, 1, 0.25)),
include.lowest = T,
labels=c(
sprintf("Low%s", level_name),
sprintf("LowMid%s", level_name),
sprintf("HighMid%s", level_name),
sprintf("High%s", level_name)
)
)
print(table(res)) # Print the table
return(res)
}
```
### Variable 1 - avganncount
This is a continuous ratio variable. The data does not look normally
distributed, which is confirmed by the near-null p-value of the shapiro
normallity test. A histogram is used to visualize the data.
```{r}
summary(df$avganncount)
hist(df$avganncount, breaks = 30, freq = F)
curve(dnorm(x, mean(df$avganncount), sd(df$avganncount)), add = T)
shapiro.test(df$avganncount)
```
An additional factor `f.avganncount` is created to discretize the data according
to the quartiles.
```{r}
df$f.avganncount <- discretize_quartiles(df$avganncount, "CaseCount")
```
### Variable 2 - avgdeathsperyear
This is also a continuous ratio variable similar to variable 1. The data does
not look normally distributed, which is confirmed by the near-null p-value of
the shapiro normallity test. Again a histogram is used to visualize the data.
```{r}
summary(df$avgdeathsperyear)
hist(df$avgdeathsperyear, breaks = 30, freq = F)
curve(dnorm(x, mean(df$avgdeathsperyear), sd(df$avgdeathsperyear)), add = T)
shapiro.test(df$avgdeathsperyear)
```
An additional factor `f.avgdeathsperyear` is created to discretize the data
according to the quartiles.
```{r}
df$f.avgdeathsperyear <- discretize_quartiles(df$avgdeathsperyear, "MortCount")
```
### Variable 3 - target_deathrate
This is the response variable. This is also a continuous ratio variable similar
to the previous variables. The data looks normally distributed, but it is not
and will be further discussed in the next section.
```{r}
summary(df$target_deathrate)
hist(df$target_deathrate, breaks = 30, freq = F)
curve(dnorm(x, mean(df$target_deathrate), sd(df$target_deathrate)), add = T)
shapiro.test(df$target_deathrate)
```
An additional factor `f.target_deathrate` is created to discretize the data
according to the quartiles.
```{r}
df$f.target_deathrate <- discretize_quartiles(df$target_deathrate, "DeathRate")
```
### Variable 4 - incidencerate
We have another continuous ratio variable similar to the previous variables. It
is not normally distributed according to the Shapiro test.
```{r}
summary(df$incidencerate)
hist(df$incidencerate, breaks = 30, freq = F)
curve(dnorm(x, mean(df$incidencerate), sd(df$incidencerate)), add = T)
shapiro.test(df$incidencerate)
```
An additional factor `f.incidencerate` is created to discretize the data
according to the quartiles.
```{r}
df$f.incidencerate <- discretize_quartiles(df$incidencerate, "DiagnPerCap")
```
### Variable 5 - medincome
Very similar to all the previous variables we have a continuous ratio variable
not normally distributed.
```{r}
summary(df$medincome)
hist(df$medincome, breaks = 30, freq = F)
curve(dnorm(x, mean(df$medincome), sd(df$medincome)), add = T)
shapiro.test(df$medincome)
```
An additional factor `f.medincome` is created to discretize the data according
to the quartiles.
```{r}
df$f.medincome <- discretize_quartiles(df$medincome, "MedianInc")
```
### Variable 6 - popest2015
Another continuous ratio variable not normally distributed.
```{r}
summary(df$popest2015)
hist(df$popest2015, breaks = 30, freq = F)
curve(dnorm(x, mean(df$popest2015), sd(df$popest2015)), add = T)
shapiro.test(df$popest2015)
```
An additional factor `f.popest2015` is created to discretize the data according
to the quartiles.
```{r}
df$f.popest2015 <- discretize_quartiles(df$popest2015, "MidPop")
```
### Variable 7 - povertypercent
Another continuous ratio variable not normally distributed.
```{r}
summary(df$povertypercent)
hist(df$povertypercent, breaks = 30, freq = F)
curve(dnorm(x, mean(df$povertypercent), sd(df$povertypercent)), add = T)
shapiro.test(df$povertypercent)
```
An additional factor `f.povertypercent` is created to discretize the data
according to the quartiles.
```{r}
df$f.povertypercent <- discretize_quartiles(df$povertypercent, "Pov%")
```
### Variable 8 - studypercap
Another continuous ratio variable. This variable has the peculiarity of having a
lot of 0s (median is also 0 so more than half of the counties don't perform
cancer related clinical trials). It is not normally distributed.
```{r}
summary(df$studypercap)
hist(df$studypercap, breaks = 30, freq = F)
curve(dnorm(x, mean(df$studypercap), sd(df$studypercap)), add = T)
shapiro.test(df$studypercap)
```
An additional factor `f.studypercap` is created to discretize the data, this
time groupping the data in only 3 groups: 0, and the two median splits of the
non-zero values.
```{r}
non_zero_studypercap_median <- median(df$studypercap[df$studypercap > 0])
df$f.studypercap <- cut(df$studypercap, breaks = c(-Inf, 0, non_zero_studypercap_median, Inf),
include.lowest = T,
labels=c("NoTrials", "MidTrials", "HighTrials")
)
table(df$f.studypercap)
```
### Variable 9 - binnedinc
After having converted it from a string representation of the bin into a numeric
variable, analyzing its normality with Shapiro Test, we can safely say it's not
a normally-distributed variable.
```{r}
summary(df$binnedinc)
hist(df$binnedinc, breaks = 30, freq = F)
curve(dnorm(x, mean(df$binnedinc), sd(df$binnedinc)), add = T)
shapiro.test(df$binnedinc)
```
No further discretisation is needed for this variable, as it is already
categorised.
### Variable 10 - medianage
After having cleaned it, running it through the Shapiro test shows that it is
most likely not normally distributed, altough the histogram shows a closely
bell-shaped curve.
```{r}
summary(df$medianage)
hist(df$medianage, breaks = 30, freq = F)
curve(dnorm(x, mean(df$medianage), sd(df$medianage)), add = T)
shapiro.test(df$medianage)
```
An additional factor `f.medianage` is created to discretize the data according
to the quartiles.
```{r}
df$f.medianage <- discretize_quartiles(df$medianage, "Age")
```
### Variable 11 - medianagemale
Very similar to the previous variable, this is a continuous interval variable,
but with no apparent erroneous values. It is most likely not normally
distributed, according to the Shapiro test, but, as with the previous variable,
the histogram shows a closely bell-shaped curve.
The summary shows that male median age is slightly lower than median age (we can
assume that it will also be lower than the female median age).
```{r}
summary(df$medianagemale)
hist(df$medianagemale, breaks = 30, freq = F)
curve(dnorm(x, mean(df$medianagemale), sd(df$medianagemale)), add = T)
shapiro.test(df$medianagemale)
```
An additional factor `f.medianagemale` is created to discretize the data
according to the quartiles.
```{r}
df$f.medianagemale <- discretize_quartiles(df$medianagemale, "AgeMale")
```
### Variable 12 - medianagefemale
Repeating the analysis of the previous two variables, it is not normally
distributed according to the Shapiro test, but the histogram, again, shows a
closely bell-shaped curve.
As expected, the female median age is slightly higher than the median age, as
well as the male median age.
```{r}
summary(df$medianagefemale)
hist(df$medianagefemale, breaks = 30, freq = F)
curve(dnorm(x, mean(df$medianagefemale), sd(df$medianagefemale)), add = T)
shapiro.test(df$medianagefemale)
```
An additional factor `f.medianagefemale` is created to discretize the data
according to the quartiles.
```{r}
df$f.medianagefemale <- discretize_quartiles(df$medianagefemale, "AgeFemale")
```
#### A small addendum on the median age variables
Leaving correlation analysis for later, let's check whether one can assume that
the expected value of the median age of a population is the same for male as is
for female populations. We'll use a set of wilcox tests (as we've already
established that the data is not normally distributed) with the null hypothesis
of their means being equal.
```{r}
wilcox.test(df$medianage, df$medianagefemale)
wilcox.test(df$medianage, df$medianagemale)
wilcox.test(df$medianagefemale, df$medianagemale)
```
The p-values are all very low, so we can safely reject the null hypothesis and
assume that the median age of a population is different depending on the gender.
### Variable 13 - geography
This is a string variable that is unique for each row of data. Since it is
unique we could delete it, but it has info on not only the unique county of each
observation, but also on its state. We will take this information and create a
new variable named State that could be beneficial to our analysis. The new
variable is a Nominal variable without missing values. However it has a lot of
levels (50) with a few of them sparsly populated so it's not feasible to convert
it to factor.
```{r}
sample(df$geography, 10)
# Use regex to get the state (everything after the comma and white space):
df$state <- sub(".*,\\s*", "", df$geography)
summary(df$state)
table(df$state)
unique(df$state)
```
### Variable 13 - percentmarried
Another continuous ratio variable not normally distributed.
```{r}
summary(df$percentmarried)
hist(df$percentmarried, breaks = 30, freq = F)
curve(dnorm(x, mean(df$percentmarried), sd(df$percentmarried)), add = T)
shapiro.test(df$percentmarried)
```
An additional factor `f.percentmarried` is created to discretize the data
according to the quartiles.
```{r}
df$f.percentmarried <- discretize_quartiles(df$percentmarried, "Married%")
```
### Variable 14 - pctnohs18_24
Another continuous ratio variable not normally distributed.
```{r}
summary(df$pctnohs18_24)
hist(df$pctnohs18_24, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctnohs18_24), sd(df$pctnohs18_24)), add = T)
shapiro.test(df$pctnohs18_24)
```
An additional factor `f.pctnohs18_24` is created to discretize the data
according to the quartiles.
```{r}
df$f.pctnohs18_24 <- discretize_quartiles(df$pctnohs18_24, "NoHighsc%")
```
### Variable 15 - pcths18_24
Another continuous ratio variable (related to the previous one) not normally
distributed. There is one really severe outlier with 0 percent of High School
Graduates, Greeley County, Kansas. It also has only 4.8% non High School
Graduates (really low). It seems like those values could be incorrect. For now,
however, we will leave it as is and deal with it later.
```{r}
summary(df$pcths18_24)
hist(df$pcths18_24, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pcths18_24), sd(df$pcths18_24)), add = T)
shapiro.test(df$pcths18_24)
```
An additional factor `f.pcths18_24` is created to discretize the data according
to the quartiles.
```{r}
df$f.pcths18_24 <- discretize_quartiles(df$pcths18_24, "Highsc%")
```
### Variable 16 - pctsomecol18_24
This variable has been removed due to having too many missing values, so
analyzing it is left outside the scope for this project.
### Variable 17 - pcths25_over
Another continuous ratio variable not normally distributed.
```{r}
summary(df$pcths25_over)
hist(df$pcths25_over, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pcths25_over), sd(df$pcths25_over)), add = T)
shapiro.test(df$pcths25_over)
```
An additional factor `f.pcths25_over` is created to discretize the data according
to the quartiles.
```{r}
df$f.pcths25_over <- discretize_quartiles(df$pcths25_over, "25Highsc%")
```
### Variable 18 - pctbachdeg25_over
Another continuous ratio variable (related to the previous one) not normally
distributed.
```{r}
summary(df$pctbachdeg25_over)
hist(df$pctbachdeg25_over, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctbachdeg25_over), sd(df$pctbachdeg25_over)), add = T)
shapiro.test(df$pctbachdeg25_over)
```
An additional factor `f.pctbachdeg25_over` is created to discretize the data
according to the quartiles.
```{r}
df$f.pctbachdeg25_over <- discretize_quartiles(df$pctbachdeg25_over, "Bach%")
```
### Variable 19 - pctemployed16_over
Another continuous ratio variable not normally distributed.
```{r}
summary(df$pctemployed16_over)
hist(df$pctemployed16_over, breaks = 30, freq = F)
shapiro.test(df$pctemployed16_over)
```
An additional factor `f.pctemployed16_over` is created to discretize the data
according to the quartiles.
```{r}
df$f.pctemployed16_over <- discretize_quartiles(df$pctemployed16_over, "Employ%")
```
### Variable 20 - pctunemployed16_over
One might assume that this variable is 100 minus the previous variable, but
looking at some observations this is proven to not be. It is a continuous ratio
variable not normally distributed.
```{r}
summary(df$pctunemployed16_over)
hist(df$pctunemployed16_over, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctunemployed16_over), sd(df$pctunemployed16_over)), add = T)
shapiro.test(df$pctunemployed16_over)
```
An additional factor `f.pctunemployed16_over` is created to discretize the data
according to the quartiles.
```{r}
df$f.pctunemployed16_over <- discretize_quartiles(df$pctunemployed16_over, "Unemploy%")
```
### Variable 21 - pctprivatecoverage
Another continuous ratio variable not normally distributed.
```{r}
summary(df$pctprivatecoverage)
hist(df$pctprivatecoverage, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctprivatecoverage), sd(df$pctprivatecoverage)), add = T)
shapiro.test(df$pctprivatecoverage)
```
An additional factor `f.pctprivatecoverage` is created to discretize the data
according to the quartiles.
```{r}
df$f.pctprivatecoverage <- discretize_quartiles(df$pctprivatecoverage, "Private%")
```
### Variable 22 - pctprivatecoveragealone
This is a continuous ratio variable very closely related with the previous
variable. In the data quality section, this variable was shown to have a high
amount of missing data, but it was imputed nontheless. However, it has a 0.93
correlation with variable `pctprivatecoverage`, which is high enough to consider
removing it for being redundant.
```{r}
summary(df$pctprivatecoveragealone)
cor.test(df$pctprivatecoverage, df$pctprivatecoveragealone)
df <- subset(df, select = -pctprivatecoveragealone)
```
### Variable 23 - pctempprivcoverage
Another continuous ratio variable normally distributed (with a 99% confidence
level for the shapiro test).
```{r}
summary(df$pctempprivcoverage)
hist(df$pctempprivcoverage, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctempprivcoverage), sd(df$pctempprivcoverage)), add = T)
shapiro.test(df$pctempprivcoverage)
```
An additional factor `f.pctempprivcoverage` is created to discretize the data
according to the quartiles.
```{r}
df$f.pctempprivcoverage <- discretize_quartiles(df$pctempprivcoverage, "EmployeeHealth%")
```
### Variable 24 - pctpubliccoverage
Another continuous ratio variable normally distributed, this time with a very
high p-value for the shapiro test.
```{r}
summary(df$pctpubliccoverage)
hist(df$pctpubliccoverage, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctpubliccoverage), sd(df$pctpubliccoverage)), add = T)
shapiro.test(df$pctpubliccoverage)
```
An additional factor `f.pctpubliccoverage` is created to discretize the data
according to the quartiles.
```{r}
df$f.pctpubliccoverage <- discretize_quartiles(df$pctpubliccoverage, "GovHealth%")
```
### Variable 25 - pctpubliccoveragealone
Another continuous ratio variable related to the previous variable with a 0.87
correlation. It is not normally distributed.
```{r}
summary(df$pctpubliccoveragealone)
cor.test(df$pctpubliccoverage, df$pctpubliccoveragealone)
hist(df$pctpubliccoveragealone, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctpubliccoveragealone), sd(df$pctpubliccoveragealone)), add = T)
shapiro.test(df$pctpubliccoveragealone)
```
An additional factor `f.pctpubliccoveragealone` is created to discretize the data
according to the quartiles.
```{r}
df$f.pctpubliccoveragealone <- discretize_quartiles(df$pctpubliccoveragealone, "GovHealthAlone%")
```
### Variable 25 - pctwhite
Another continuous ratio variable clearly not normally distributed.
```{r}
summary(df$pctwhite)
hist(df$pctwhite, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctwhite), sd(df$pctwhite)), add = T)
shapiro.test(df$pctwhite)
```
An additional factor `f.pctwhite` is created to discretize the data according
to the quartiles.
```{r}
df$f.pctwhite <- discretize_quartiles(df$pctwhite, "White%")
```
### Variable 26 - pctblack
This one is really similar to the previous variable, with a correlation of
-0.84. It is another continuous ratio variable clearly not normally distributed.
```{r}
summary(df$pctblack)
cor.test(df$pctwhite, df$pctblack)
hist(df$pctblack, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctblack), sd(df$pctblack)), add = T)
shapiro.test(df$pctblack)
```
An additional factor `f.pctblack` is created to discretize the data according
to the quartiles.
```{r}
df$f.pctblack <- discretize_quartiles(df$pctblack, "Black%")
```
### Variable 27 - pctasian
Also related to the previous 2 variables. It is a continuous ratio variable
clearly not normally distributed. Looking at the boxplot, there are some points
with a high asian population percentage (probably those from asian ghetto
counties), but none of them higher than 100%.
```{r}
summary(df$pctasian)
boxplot(df$pctasian, horizontal=T)
hist(df$pctasian, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctasian), sd(df$pctasian)), add = T)
shapiro.test(df$pctasian)
```
An additional factor `f.pctasian` is created to discretize the data according
to the quartiles.
```{r}
df$f.pctasian <- discretize_quartiles(df$pctasian, "Asian%")
```
### Variable 28 - pctotherrace
This variable should be 100 minus the sum of the three previous variables but
looking at a sample of observations it is clearly not, and also if we check for
multicollinearity using VIF, since the values are lower than 5 we can use the
rule of thumb to say that there is not a severe multicollinearity so we will
keep the variable for now (if it was always equal to 100 we would erase it since
it wouldn't add any new info).
The variable is a continuous ratio variable clearly not normally distributed.
```{r}
summary(df$pctotherrace)
model <- lm(pctotherrace ~ pctwhite + pctblack + pctasian, data=df)
vif(model)
hist(df$pctotherrace, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctotherrace), sd(df$pctotherrace)), add = T)
shapiro.test(df$pctotherrace)
```
An additional factor `f.pctotherrace` is created to discretize the data according
to the quartiles.
```{r}
df$f.pctotherrace <- discretize_quartiles(df$pctotherrace, "OtherRace%")
```
#### County race clustering
Having discretized the previous race-related variables, we'll define a new
factor variable called `f.race` which will probably come in handy in future
analysis. This variable will have 4 levels: "White", "Black", "Asian" and
"Other", which will be decided based on the maximum value of the 4 columns.
```{r}
getRace <- function (row) {
races = row[c("pctwhite", "pctblack", "pctasian", "pctotherrace")]
max_race = which.max(races)
return(c("White", "Black", "Asian", "Other")[max_race])
}
df$f.race <- as.factor(apply(df, 1, getRace))
table(df$f.race)
```
As expected, the majority of the counties are predominantly white, followed by
those with a black majority. The number of counties with an asian majority is
negligible, and there are no counties with an "other" majority.