CancerMortality.Rmd

---
output:
  pdf_document: default
  html_document: default
  md_document:
    variant: markdown_github
editor_options:
  chunk_output_type: console
---

<!-- README.md is generated from CancerMortality.Rmd. Please edit that file. --> 

```{r opts, echo = FALSE}
knitr::opts_chunk$set(
  fig.path = "images/"
)
```

# Data Preparation

```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(crayon)
library(car)
library(FactoMineR)
library(chemometrics)
library(corrplot)
library(RColorBrewer)
library(PerformanceAnalytics)
library(mice)
library(dplyr)
library(readr)
library(stringr)
library(MASS)
library(lmtest)
library(sandwich)
library(cluster)
```

First we import the data and save it as the variable "df" for future 
modifications. 

```{r}	
par(mfrow=c(1,1))
df <- read_csv("data/train.csv")
```

Here's a breakdown of all the variables in the dataset.

**Variable Name**       | **Num/Fac** | **Description**
 ---------------------- | ----------- | ---------------------- 
avganncount             | N           | Mean number of reported cases of cancer diagnosed annually (2010-2015 
avgdeathsperyear        | N           | Mean number of reported mortalities due to cancer 
target_deathrate        | N           | Response variable. Mean per capita (100,000) cancer mortalities 
incidencerate           | N           | Mean per capita (100,000) cancer diagnoses 
medincome               | N           | Median income per county 
popest2015              | N           | Population of county 
povertypercent          | N           | Percent of population in poverty 
studypercap             | N           | Per capita number of cancer-related clinical trials per county 
binnedinc               | F           | Median income per capita binned by decile 
medianage               | N           | Median age of county residents 
medianagemale           | N           | Median age of male county residents 
medianagefemale         | N           | Median age of female county residents 
geography               | F           | County name 
percentmarried          | N           | Percent of county residents who are married 
pctnohs18_24            | N           | Percent  of  county  residents  ages  18-24  highest  education  attained: less than high school  
pcths18_24              | N           | Percent  of  county  residents  ages  18-24  highest  education  attained:  high school diploma 
pctsomecol18_24         | N           | Percent  of  county  residents  ages  18-24  highest  education  attained: some college 
pcths25_over            | N           | Percent  of  county  residents  ages  25  and  over  highest  education attained: high school diploma 
pctbachdeg25_over       | N           | Percent  of  county  residents  ages  25  and  over  highest  education attained: bachelor's degree 
pctemployed16_over      | N           | Percent of county residents ages 16 and over employed 
pctunemployed16_over    | N           | Percent of county residents ages 16 and over unemployed 
pctprivatecoverage      | N           | Percent of county residents with private health coverage 
pctprivatecoveragealone | N           | Percent  of  county  residents  with  private  health  coverage  alone  (no public assistance) 
pctempprivcoverage      | N           | Percent  of  county  residents  with  employee-provided  private  health coverage 
pctpubliccoverage       | N           | Percent of county residents with government-provided health coverage 
pctpubliccoveragealone  | N           | Percent of county residents with government-provided health coverage alone 
pctwhite                | N           | Percent of county residents who identify as White 
pctblack                | N           | Percent of county residents who identify as Black 
pctasian                | N           | Percent of county residents who identify as Asian 
pctotherrace            | N           | Percent  of  county  residents  who  identify  in  a  category  which  is  not White, Black, or Asian 
pctmarriedhouseholds    | N           | Percent of married households 
birthrate               | N           | Number of live births relative to number of women in county


## Variable Types

All variables but `geography` should be numeric. Only one variable requires
any change: `binnedinc` is a string variable right now, but two things can be
created from it: a factor variable (`f.binnedinc`), and a numeric variable 
holding the midpoint of each bin, which we'll conviniently call `binnedinc`.

```{r}
df$f.binnedinc <- as.factor(df$binnedinc)
# Use regex to remove the [,],( and ) from the rows:
inc.midpoints.text <- gsub("[\\[\\]()]", "", df$binnedinc, perl = T)
# Separate them into two numbers
inc.midpoints.text.sep <- strsplit(inc.midpoints.text, ",")
# Convert them to numbers and apply a mean between them to find the midpoint
df$binnedinc <- sapply(inc.midpoints.text.sep, function(x) mean(as.numeric(x)))
```

Note how `geography`, altough being non-numeric, it's not a factor variable.
This can be shown by looking at the number of unique values in the column.

```{r}
nrow(df); length(unique(df$geography))
```

Because the number of unique values is the same as the number of rows, we can
safely assume that `geography` is a unique identifier for each row. It won't be
removed yet, because, as proven in a later section, a factor variable can be
derived from it.

## Duplicated Data Points

Before proceeding with the analysis, we should check if there are any duplicated
rows in the data set. If there are, we shall remove them.

```{r}
duplicated_row_count <- sum(duplicated(df))
if (duplicated_row_count > 0) {
    print(sprintf("There are %d duplicated rows.", duplicated_row_count))
    df <- unique(df)
}
```

No duplicated rows were found in the data set.

## Missing data

Glancing over the data set, one can see that there are some missing values:

```{r}
for (colname in colnames(df)) {
    na.count <- sum(is.na(df[[colname]]))
    if (na.count > 0) {
        cat(sprintf("%s has %s\n", colname, red(sprintf("%d N/As", sum(is.na(df[colname]))))))
    }
}
```

* Column `pctsomecol18_24` has 1376 N/As, which makes for more than 75% of the
  data. Due to that, it would most likely provide little to no meaningful
  information, so a decision was made to remove it from the study.
* Column `pctemployed16_over` has 82 N/As, which is a manageable amount. They
  can easily be imputed using the MICE method.
* Column `pctprivatecoveragealone` has 356 N/As, which acounts for less than 20%
  of the rows. This is a small enough amount to be imputed using the MICE 
  method. Nontheless, this column will be removed from the study later on, for
  reasons explained in the exploratory data analysis section.

```{r}
# Drop the column with too many missing values
df <- subset(df, select = -c(pctsomecol18_24))

# Impute missing values
res.mice <- mice(df)
complete_df = complete(res.mice, action = 1)
```

Before substituting the missing values, let's check the deciles of the variables
to see if the imputation makes sense.

```{r}
quantile(df$pctemployed16_over, na.rm = TRUE, probs = seq(0, 1, 0.1))
quantile(df$pctprivatecoveragealone, na.rm = TRUE, probs = seq(0, 1, 0.1))
```

Now let's substitute the missing values in the original data set. And check that
the deciles are still roughly the same. 

```{r}
df$pctemployed16_over <- complete_df$pctemployed16_over
df$pctprivatecoveragealone <- complete_df$pctprivatecoveragealone

quantile(df$pctemployed16_over, na.rm = TRUE, probs = seq(0, 1, 0.1))
quantile(df$pctprivatecoveragealone, na.rm = TRUE, probs = seq(0, 1, 0.1))
```

Which they are.

## Univariate Outliers

Column-wise, we can count how many univariate outliers each numeric variable 
has:

```{r}
for (colname in colnames(Filter(is.numeric, df))) {
  col = df[[colname]]
  q1 <- quantile(col, 0.25)
  q3 <- quantile(col, 0.75)
  iqr <- q3 - q1

  severe <- list(top = q3 + 3 * iqr, bot = q1 - 3 * iqr)
  mild <- list(top = q3 + 1.5 * iqr, bot = q1 - 1.5 * iqr)

  severe_out <- sum(col > severe$top | col < severe$bot)
  mild_out <- sum((col > mild$top & col < severe$top) | (col < mild$bot & col > severe$bot))
  if (mild_out > 0 | severe_out > 0) {
    cat(sprintf("Column %s has %d mild outliers and %d severe outliers\n", colname, mild_out, severe_out))
  }
}
```

Row-wise, we'll count the numeric variables in which each data point is an 
outlier, and create a new object called `univariate_outlier_count`. As a 
gut-driven criterion, we shall consider a row to be an outlier if it is an
outlier in 10 or more variables. Based on this criterion, only 9 counties are.

```{r}
count_outliers <- function(data) {
  # Function to check for outliers based on IQR
  is_outlier <- function(x) {
    Q1 <- quantile(x, 0.25, na.rm = TRUE)
    Q3 <- quantile(x, 0.75, na.rm = TRUE)
    IQR <- Q3 - Q1
    lower_bound <- Q1 - 1.5 * IQR
    upper_bound <- Q3 + 1.5 * IQR
    return(x < lower_bound | x > upper_bound)
  }
  
  # Apply the outlier function to each column and sum the results for each row using dplyr
  data %>%
    mutate(outlier_count = rowSums(sapply(., is_outlier), na.rm = TRUE))
}

univariate_outlier_count <- count_outliers(Filter(is.numeric, df))$outlier_count
df[which(univariate_outlier_count >= 10),]
```

All of them have high percentages of non-white population, both black and asian,
a low median age, a high mortality count and a high bias towards private and
employee health coverage. Of these 9 counties, 7 are wealthy (Low poverty
percent) and 2 have a large poor population (over 20%).

Outliers can sometimes provide valuable information, so they won't be removed
from the data set just yet.

### The case of medianage

`medianage` is a continuous variable which contains some data points that make 
no sense, for instance, median ages over 100. Thankfully, we have data for male
and female median age, which allow us to replace outlier points by a mean of 
male and female age.

```{r}
boxplot(df$medianage, horizontal = TRUE, main = "Median Age") 
out = which(df$medianage > 100)
df$medianage[out] <- (df$medianagemale[out] + df$medianagefemale[out]) / 2
```

As the following boxplot shows, the meaninglessly high values have taken a more
reasonable value.

```{r}
boxplot(df$medianage, horizontal = TRUE, main = "Median Age") 
```

## Multivariate Outliers

We will apply Moutlier on the numerical variables in order to find multivariate 
outliers. We have to perform the calculation excluding the variable studypercap 
because otherwise the method is unable to execute due to multicollinearity 
casuing a singularity matrix in the intermediate calculations. An extremely mild 
threshold is chosen (0.00005%) because even using this threshold we get a 
significant amount of multivariate outliers, 4% of the total sample. Lowering 
the threshold even further doesn't change much the amount of outliers and rising 
it higher makes the amount of outliers rise too much (10% outliers at 0.1% 
significance level).

```{r}
numeric.df <- Filter(is.numeric, df)
numeric.df <- numeric.df[, !colnames(numeric.df) %in% c("studypercap")]

res.out_95 <- Moutlier(numeric.df, quantile = 0.95, plot=F)
multi_outliers_95 = which((res.out_95$md > res.out_95$cutoff)&(res.out_95$rd > res.out_95$cutoff))
length(multi_outliers_95)

res.out <- Moutlier(numeric.df, quantile = 0.9999995, plot=F)
multi_outliers = which((res.out$md > res.out$cutoff)&(res.out$rd > res.out$cutoff))
length(multi_outliers)

par(mfrow = c(1,1))
plot(res.out$rd, res.out$md )
abline(h=res.out$cutoff, col="red")
abline(v=res.out$cutoff, col="red")
```

There are 91 multivariate outliers in the data set (265 if we take a 95% 
quantile).

# Exploratory Data Analysis

This section will be devided in two parts: single-variable analysis and
multi-variable analysis.

## Single-Variable analysis

This sub-section presents an analysis for each variable of the data set as a 
standalone sample.

We'll be performing lots of discretisation of continuous variables based on 
their quartiles, so let's create a function to do that.

```{r}
discretize_quartiles <- function(column, level_name) {
  res <- cut(column, breaks = quantile(column, probs = seq(0, 1, 0.25)), 
    include.lowest = T,
    labels=c(
      sprintf("Low%s", level_name),
      sprintf("LowMid%s", level_name),
      sprintf("HighMid%s", level_name),
      sprintf("High%s", level_name)
    )
  )
  print(table(res)) # Print the table
  return(res)
}
```

### Variable 1 - avganncount

This is a continuous ratio variable. The data does not look normally 
distributed, which is confirmed by the near-null p-value of the shapiro 
normallity test. A histogram is used to visualize the data. 

```{r}
summary(df$avganncount)

hist(df$avganncount, breaks = 30, freq = F)
curve(dnorm(x, mean(df$avganncount), sd(df$avganncount)), add = T)

shapiro.test(df$avganncount)
```

An additional factor `f.avganncount` is created to discretize the data according
to the quartiles.

```{r}
df$f.avganncount <- discretize_quartiles(df$avganncount, "CaseCount")
```

### Variable 2 - avgdeathsperyear

This is also a continuous ratio variable similar to variable 1. The data does 
not look normally distributed, which is confirmed by the near-null p-value of 
the shapiro normallity test. Again a histogram is used to visualize the data.


```{r}
summary(df$avgdeathsperyear)

hist(df$avgdeathsperyear, breaks = 30, freq = F)
curve(dnorm(x, mean(df$avgdeathsperyear), sd(df$avgdeathsperyear)), add = T)

shapiro.test(df$avgdeathsperyear)
```

An additional factor `f.avgdeathsperyear` is created to discretize the data
according to the quartiles.

```{r}
df$f.avgdeathsperyear <- discretize_quartiles(df$avgdeathsperyear, "MortCount")
```

### Variable 3 - target_deathrate

This is the response variable. This is also a continuous ratio variable similar 
to the previous variables. The data looks normally distributed, but it is not 
and will be further discussed in the next section. 

```{r}
summary(df$target_deathrate)

hist(df$target_deathrate, breaks = 30, freq = F)
curve(dnorm(x, mean(df$target_deathrate), sd(df$target_deathrate)), add = T)

shapiro.test(df$target_deathrate)
```

An additional factor `f.target_deathrate` is created to discretize the data
according to the quartiles.

```{r}
df$f.target_deathrate <- discretize_quartiles(df$target_deathrate, "DeathRate")
```

### Variable 4 - incidencerate

We have another continuous ratio variable similar to the previous variables. It 
is not normally distributed according to the Shapiro test.

```{r}
summary(df$incidencerate)

hist(df$incidencerate, breaks = 30, freq = F)
curve(dnorm(x, mean(df$incidencerate), sd(df$incidencerate)), add = T)

shapiro.test(df$incidencerate)
```

An additional factor `f.incidencerate` is created to discretize the data
according to the quartiles.

```{r}
df$f.incidencerate <- discretize_quartiles(df$incidencerate, "DiagnPerCap")
```

### Variable 5 - medincome

Very similar to all the previous variables we have a continuous ratio variable 
not normally distributed.

```{r}
summary(df$medincome)

hist(df$medincome, breaks = 30, freq = F)
curve(dnorm(x, mean(df$medincome), sd(df$medincome)), add = T)

shapiro.test(df$medincome)
```

An additional factor `f.medincome` is created to discretize the data according
to the quartiles.

```{r}
df$f.medincome <- discretize_quartiles(df$medincome, "MedianInc")
```

### Variable 6 - popest2015

Another continuous ratio variable not normally distributed.

```{r}
summary(df$popest2015)

hist(df$popest2015, breaks = 30, freq = F)
curve(dnorm(x, mean(df$popest2015), sd(df$popest2015)), add = T)

shapiro.test(df$popest2015)
```

An additional factor `f.popest2015` is created to discretize the data according
to the quartiles.

```{r}
df$f.popest2015 <- discretize_quartiles(df$popest2015, "MidPop")
```

### Variable 7 - povertypercent

Another continuous ratio variable not normally distributed. 

```{r}
summary(df$povertypercent)

hist(df$povertypercent, breaks = 30, freq = F)
curve(dnorm(x, mean(df$povertypercent), sd(df$povertypercent)), add = T)

shapiro.test(df$povertypercent)
```

An additional factor `f.povertypercent` is created to discretize the data
according to the quartiles.

```{r}
df$f.povertypercent <- discretize_quartiles(df$povertypercent, "Pov%")
```

### Variable 8 - studypercap

Another continuous ratio variable. This variable has the peculiarity of having a 
lot of 0s (median is also 0 so more than half of the counties don't perform 
cancer related clinical trials). It is not normally distributed.

```{r}
summary(df$studypercap)

hist(df$studypercap, breaks = 30, freq = F)
curve(dnorm(x, mean(df$studypercap), sd(df$studypercap)), add = T)

shapiro.test(df$studypercap)
```

An additional factor `f.studypercap` is created to discretize the data, this 
time groupping the data in only 3 groups: 0, and the two median splits of the
non-zero values.

```{r}
non_zero_studypercap_median <- median(df$studypercap[df$studypercap > 0])

df$f.studypercap <- cut(df$studypercap, breaks = c(-Inf, 0, non_zero_studypercap_median, Inf),
    include.lowest = T,
    labels=c("NoTrials", "MidTrials", "HighTrials")
  )
table(df$f.studypercap)
```

### Variable 9 - binnedinc

After having converted it from a string representation of the bin into a numeric
variable, analyzing its normality with Shapiro Test, we can safely say it's not
a normally-distributed variable.

```{r}
summary(df$binnedinc)

hist(df$binnedinc, breaks = 30, freq = F)
curve(dnorm(x, mean(df$binnedinc), sd(df$binnedinc)), add = T)

shapiro.test(df$binnedinc)
```

No further discretisation is needed for this variable, as it is already
categorised.

### Variable 10 - medianage

After having cleaned it, running it through the Shapiro test shows that it is
most likely not normally distributed, altough the histogram shows a closely 
bell-shaped curve.

```{r}
summary(df$medianage)

hist(df$medianage, breaks = 30, freq = F)
curve(dnorm(x, mean(df$medianage), sd(df$medianage)), add = T)

shapiro.test(df$medianage)
```

An additional factor `f.medianage` is created to discretize the data according
to the quartiles.

```{r}
df$f.medianage <- discretize_quartiles(df$medianage, "Age")
```

### Variable 11 - medianagemale

Very similar to the previous variable, this is a continuous interval variable, 
but with no apparent erroneous values. It is most likely not normally 
distributed, according to the Shapiro test, but, as with the previous variable,
the histogram shows a closely bell-shaped curve.

The summary shows that male median age is slightly lower than median age (we can
assume that it will also be lower than the female median age). 

```{r}
summary(df$medianagemale)

hist(df$medianagemale, breaks = 30, freq = F)
curve(dnorm(x, mean(df$medianagemale), sd(df$medianagemale)), add = T)

shapiro.test(df$medianagemale)
```

An additional factor `f.medianagemale` is created to discretize the data
according to the quartiles.

```{r}
df$f.medianagemale <- discretize_quartiles(df$medianagemale, "AgeMale")
```

### Variable 12 - medianagefemale

Repeating the analysis of the previous two variables, it is not normally 
distributed according to the Shapiro test, but the histogram, again, shows a
closely bell-shaped curve.

As expected, the female median age is slightly higher than the median age, as
well as the male median age.

```{r}
summary(df$medianagefemale)

hist(df$medianagefemale, breaks = 30, freq = F)
curve(dnorm(x, mean(df$medianagefemale), sd(df$medianagefemale)), add = T)

shapiro.test(df$medianagefemale)
```

An additional factor `f.medianagefemale` is created to discretize the data
according to the quartiles.

```{r}
df$f.medianagefemale <- discretize_quartiles(df$medianagefemale, "AgeFemale")
```

#### A small addendum on the median age variables

Leaving correlation analysis for later, let's check whether one can assume that
the expected value of the median age of a population is the same for male as is
for female populations. We'll use a set of wilcox tests (as we've already 
established that the data is not normally distributed) with the null hypothesis 
of their means being equal.

```{r}
wilcox.test(df$medianage, df$medianagefemale)
wilcox.test(df$medianage, df$medianagemale)
wilcox.test(df$medianagefemale, df$medianagemale)
```

The p-values are all very low, so we can safely reject the null hypothesis and
assume that the median age of a population is different depending on the gender.

### Variable 13 - geography

This is a string variable that is unique for each row of data. Since it is 
unique we could delete it, but it has info on not only the unique county of each 
observation, but also on its state. We will take this information and create a 
new variable named State that could be beneficial to our analysis. The new 
variable is a Nominal variable without missing values. However it has a lot of 
levels (50) with a few of them sparsly populated so it's not feasible to convert
it to factor. 

```{r}
sample(df$geography, 10)

# Use regex to get the state (everything after the comma and white space):
df$state <- sub(".*,\\s*", "", df$geography)

summary(df$state)

table(df$state)

unique(df$state)
```

### Variable 13 - percentmarried

Another continuous ratio variable not normally distributed.

```{r}
summary(df$percentmarried)

hist(df$percentmarried, breaks = 30, freq = F)
curve(dnorm(x, mean(df$percentmarried), sd(df$percentmarried)), add = T)

shapiro.test(df$percentmarried)
```

An additional factor `f.percentmarried` is created to discretize the data
according to the quartiles.

```{r}
df$f.percentmarried <- discretize_quartiles(df$percentmarried, "Married%")
```

### Variable 14 - pctnohs18_24

Another continuous ratio variable not normally distributed.

```{r}
summary(df$pctnohs18_24)

hist(df$pctnohs18_24, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctnohs18_24), sd(df$pctnohs18_24)), add = T)

shapiro.test(df$pctnohs18_24)
```

An additional factor `f.pctnohs18_24` is created to discretize the data
according to the quartiles.

```{r}
df$f.pctnohs18_24 <- discretize_quartiles(df$pctnohs18_24, "NoHighsc%")
```

### Variable 15 - pcths18_24

Another continuous ratio variable (related to the previous one) not normally 
distributed. There is one really severe outlier with 0 percent of High School 
Graduates, Greeley County, Kansas. It also has only 4.8% non High School 
Graduates (really low). It seems like those values could be incorrect. For now,
however, we will leave it as is and deal with it later.

```{r}
summary(df$pcths18_24)

hist(df$pcths18_24, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pcths18_24), sd(df$pcths18_24)), add = T)

shapiro.test(df$pcths18_24)
```

An additional factor `f.pcths18_24` is created to discretize the data according
to the quartiles.

```{r}
df$f.pcths18_24 <- discretize_quartiles(df$pcths18_24, "Highsc%")
```

### Variable 16 - pctsomecol18_24

This variable has been removed due to having too many missing values, so 
analyzing it is left outside the scope for this project.

### Variable 17 - pcths25_over

Another continuous ratio variable not normally distributed.

```{r}
summary(df$pcths25_over)

hist(df$pcths25_over, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pcths25_over), sd(df$pcths25_over)), add = T)

shapiro.test(df$pcths25_over)
```

An additional factor `f.pcths25_over` is created to discretize the data according
to the quartiles.

```{r}
df$f.pcths25_over <- discretize_quartiles(df$pcths25_over, "25Highsc%")
```

### Variable 18 - pctbachdeg25_over

Another continuous ratio variable (related to the previous one) not normally 
distributed. 

```{r}
summary(df$pctbachdeg25_over)

hist(df$pctbachdeg25_over, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctbachdeg25_over), sd(df$pctbachdeg25_over)), add = T)

shapiro.test(df$pctbachdeg25_over)
```

An additional factor `f.pctbachdeg25_over` is created to discretize the data
according to the quartiles.

```{r}
df$f.pctbachdeg25_over <- discretize_quartiles(df$pctbachdeg25_over, "Bach%")
```

### Variable 19 - pctemployed16_over

Another continuous ratio variable not normally distributed. 

```{r}
summary(df$pctemployed16_over)

hist(df$pctemployed16_over, breaks = 30, freq = F)

shapiro.test(df$pctemployed16_over)
```

An additional factor `f.pctemployed16_over` is created to discretize the data
according to the quartiles.

```{r}
df$f.pctemployed16_over <- discretize_quartiles(df$pctemployed16_over, "Employ%")
```

### Variable 20 - pctunemployed16_over

One might assume that this variable is 100 minus the previous variable, but 
looking at some observations this is proven to not be. It is a continuous ratio 
variable not normally distributed.

```{r}
summary(df$pctunemployed16_over)

hist(df$pctunemployed16_over, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctunemployed16_over), sd(df$pctunemployed16_over)), add = T)

shapiro.test(df$pctunemployed16_over)
```

An additional factor `f.pctunemployed16_over` is created to discretize the data
according to the quartiles.

```{r}
df$f.pctunemployed16_over <- discretize_quartiles(df$pctunemployed16_over, "Unemploy%")
```

### Variable 21 - pctprivatecoverage

Another continuous ratio variable not normally distributed. 

```{r}
summary(df$pctprivatecoverage)

hist(df$pctprivatecoverage, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctprivatecoverage), sd(df$pctprivatecoverage)), add = T)

shapiro.test(df$pctprivatecoverage)
```

An additional factor `f.pctprivatecoverage` is created to discretize the data
according to the quartiles.

```{r}
df$f.pctprivatecoverage <- discretize_quartiles(df$pctprivatecoverage, "Private%")
```

### Variable 22 - pctprivatecoveragealone

This is a continuous ratio variable very closely related with the previous 
variable. In the data quality section, this variable was shown to have a high 
amount of missing data, but it was imputed nontheless. However, it has a 0.93
correlation with variable `pctprivatecoverage`, which is high enough to consider
removing it for being redundant.

```{r}
summary(df$pctprivatecoveragealone)

cor.test(df$pctprivatecoverage, df$pctprivatecoveragealone)

df <- subset(df, select = -pctprivatecoveragealone)
```

### Variable 23 - pctempprivcoverage

Another continuous ratio variable normally distributed (with a 99% confidence
level for the shapiro test). 

```{r}
summary(df$pctempprivcoverage)

hist(df$pctempprivcoverage, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctempprivcoverage), sd(df$pctempprivcoverage)), add = T)

shapiro.test(df$pctempprivcoverage)
```

An additional factor `f.pctempprivcoverage` is created to discretize the data
according to the quartiles.

```{r}
df$f.pctempprivcoverage <- discretize_quartiles(df$pctempprivcoverage, "EmployeeHealth%")
```

### Variable 24 - pctpubliccoverage

Another continuous ratio variable normally distributed, this time with a very
high p-value for the shapiro test.

```{r}
summary(df$pctpubliccoverage)

hist(df$pctpubliccoverage, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctpubliccoverage), sd(df$pctpubliccoverage)), add = T)

shapiro.test(df$pctpubliccoverage)
```

An additional factor `f.pctpubliccoverage` is created to discretize the data
according to the quartiles.

```{r}
df$f.pctpubliccoverage <- discretize_quartiles(df$pctpubliccoverage, "GovHealth%")
```

### Variable 25 - pctpubliccoveragealone

Another continuous ratio variable related to the previous variable with a 0.87
correlation. It is not normally distributed. 

```{r}
summary(df$pctpubliccoveragealone)

cor.test(df$pctpubliccoverage, df$pctpubliccoveragealone)

hist(df$pctpubliccoveragealone, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctpubliccoveragealone), sd(df$pctpubliccoveragealone)), add = T)

shapiro.test(df$pctpubliccoveragealone)
```

An additional factor `f.pctpubliccoveragealone` is created to discretize the data
according to the quartiles.

```{r}
df$f.pctpubliccoveragealone <- discretize_quartiles(df$pctpubliccoveragealone, "GovHealthAlone%")
```

### Variable 25 - pctwhite

Another continuous ratio variable clearly not normally distributed. 

```{r}
summary(df$pctwhite)

hist(df$pctwhite, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctwhite), sd(df$pctwhite)), add = T)

shapiro.test(df$pctwhite)
```

An additional factor `f.pctwhite` is created to discretize the data according
to the quartiles.

```{r}
df$f.pctwhite <- discretize_quartiles(df$pctwhite, "White%")
```

### Variable 26 - pctblack

This one is really similar to the previous variable, with a correlation of 
-0.84. It is another continuous ratio variable clearly not normally distributed.

```{r}
summary(df$pctblack)

cor.test(df$pctwhite, df$pctblack)

hist(df$pctblack, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctblack), sd(df$pctblack)), add = T)

shapiro.test(df$pctblack)
```

An additional factor `f.pctblack` is created to discretize the data according
to the quartiles.

```{r}
df$f.pctblack <- discretize_quartiles(df$pctblack, "Black%")
```

### Variable 27 - pctasian

Also related to the previous 2 variables. It is a continuous ratio variable 
clearly not normally distributed. Looking at the boxplot, there are some points
with a high asian population percentage (probably those from asian ghetto 
counties), but none of them higher than 100%.

```{r}
summary(df$pctasian)
boxplot(df$pctasian, horizontal=T)

hist(df$pctasian, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctasian), sd(df$pctasian)), add = T)

shapiro.test(df$pctasian)
```

An additional factor `f.pctasian` is created to discretize the data according
to the quartiles.

```{r}
df$f.pctasian <- discretize_quartiles(df$pctasian, "Asian%")
```

### Variable 28 - pctotherrace

This variable should be 100 minus the sum of the three previous variables but 
looking at a sample of observations it is clearly not, and also if we check for 
multicollinearity using VIF, since the values are lower than 5 we can use the 
rule of thumb to say that there is not a severe multicollinearity so we will 
keep the variable for now (if it was always equal to 100 we would erase it since 
it wouldn't add any new info). 

The variable is a continuous ratio variable clearly not normally distributed.

```{r}
summary(df$pctotherrace)

model <- lm(pctotherrace ~ pctwhite + pctblack + pctasian, data=df)
vif(model)

hist(df$pctotherrace, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctotherrace), sd(df$pctotherrace)), add = T)

shapiro.test(df$pctotherrace)
```

An additional factor `f.pctotherrace` is created to discretize the data according
to the quartiles.

```{r}
df$f.pctotherrace <- discretize_quartiles(df$pctotherrace, "OtherRace%")
```

#### County race clustering

Having discretized the previous race-related variables, we'll define a new
factor variable called `f.race` which will probably come in handy in future 
analysis. This variable will have 4 levels: "White", "Black", "Asian" and 
"Other", which will be decided based on the maximum value of the 4 columns.

```{r}
getRace <- function (row) {
  races = row[c("pctwhite", "pctblack", "pctasian", "pctotherrace")]
  max_race = which.max(races)
  return(c("White", "Black", "Asian", "Other")[max_race])
}

df$f.race <- as.factor(apply(df, 1, getRace))
table(df$f.race)
```

As expected, the majority of the counties are predominantly white, followed by
those with a black majority. The number of counties with an asian majority is
negligible, and there are no counties with an "other" majority.

### Variable 29 - pctmarriedhouseholds

Another continuous ratio variable not normally distributed.

```{r}
summary(df$pctmarriedhouseholds)

hist(df$pctmarriedhouseholds, breaks = 30, freq = F)
curve(dnorm(x, mean(df$pctmarriedhouseholds), sd(df$pctmarriedhouseholds)), add = T)

shapiro.test(df$pctmarriedhouseholds)
```

An additional factor `f.pctmarriedhouseholds` is created to discretize the data
according to the quartiles.

```{r}
df$f.pctmarriedhouseholds <- discretize_quartiles(df$pctmarriedhouseholds, "Married%")
```

### Variable 30 - birthrate

The last variable is yet another continuous ratio variable not normally 
distributed.

```{r}
summary(df$birthrate)

hist(df$birthrate, breaks = 30, freq = F)
curve(dnorm(x, mean(df$birthrate), sd(df$birthrate)), add = T)

shapiro.test(df$birthrate)
```

An additional factor `f.birthrate` is created to discretize the data according
to the quartiles.

```{r}
df$f.birthrate <- discretize_quartiles(df$birthrate, "Birth%")
```

## Autocorrelation

Before proceeding with the multivariate analysis, let's check for 
autocorrelation in the target variable. We'll use the `acf` function to plot the
correlation of the target variable with itself at different lags.

```{r}
acf(df$target_deathrate, type="correlation", plot=T, main="Autocorrelation of Target Death Rate")
```

The plot shows that there is a slight positive correlation for lag values lower
than 22, although none of them exceeds 0.35. This is not a significant
autocorrelation.

## Profiling

Let us start by profiling the target variable with respect to the others, using
the function `condes` from the `FactoMineR` package.

```{r}
num.df = Filter(is.numeric, df)
res.con = condes(num.df,num.var = which(colnames(num.df) == "target_deathrate"))

correlation_df = as.data.frame(res.con$quanti)
high_correlation_df <- correlation_df[abs(correlation_df$correlation) > 0.35, ]
sorted_correlation_df <- high_correlation_df[order(abs(high_correlation_df$correlation), decreasing = TRUE), ]
sorted_correlation_df
```

Setting an arbitrary threshold of 0.35, we can see that the variables with the
highest correlation with the target variable are `pctbachdeg25_over`, 
`medincome`, `pctemployed16_over`, `binnedinc` and `pctprivatecoverage` 
negatively, and `pctpubliccoveragealone`, `povertypercent`, `incidencerate`, 
`pctpubliccoverage`, `pcths25_over` and `pctunemployed16_over` positively; all
of them with a significance level much lower than 1%.

## Multicolinearity Analysis

Before proceeding, we must identify those variables that are very correlated and
combine them. We do so because high correlations (close to 1 or -1) between two 
or more predictors indicate potential multicollinearity.

```{r}
cor_matrix <- cor(Filter(is.numeric, df))
corrplot(cor_matrix, method = "circle")
```

Initially, let us combine or eliminate those variables with a correlation above
0.9 or below -0.9.

```{r}
high_corr_vars <- data.frame(l=character(), r=character())

for(i in 1:ncol(cor_matrix)){
  for(j in i:ncol(cor_matrix)){
    if(abs(cor_matrix[i,j]) > 0.9 & i != j){
      high_corr_vars <- rbind(high_corr_vars, data.frame(l=colnames(cor_matrix)[i], r=colnames(cor_matrix)[j]))
    }
  }
}

high_corr_vars
```

We can see that the following variables have a high likelihood of 
multicollinearity:

```{r}
unique(c(high_corr_vars$l, high_corr_vars$r))
```

Let us treat each high correlation pair individually:

* `binnedinc` is highly correlated with `medincome`, as expected. We won't remove
any of them, as one is presumably a factorized version of the other.

* `medianage`, `medianagemale` and `medianagefemale` are highly correlated with
  each other. They are almost the same variable. Also, all three of them are
  very poorly correlated with the target variable. We will remove at least the
  two gender-specific variables.

```{r}
par(mfrow = c(1, 2))
plot(df$medianage, df$medianagemale)
plot(df$medianage, df$medianagefemale)
```

```{r}
par(mfrow = c(1, 3))
plot(df$medianage, df$target_deathrate)
plot(df$medianagemale, df$target_deathrate)
plot(df$medianagefemale, df$target_deathrate)
```

```{r}
df <- subset(df, select = -c(medianagemale, medianagefemale))
```

* `avganncount`, `avgdeathsperyear` and `popest2015` are highly correlated with
  each other. This is expected, as the number of cases and deaths is directly
  related to the population. However, we won't be removing any of them, as there
  might be some information in the ratio of cases to population.

```{r}	
par(mfrow = c(1, 2))
plot(df$popest2015, df$avgdeathsperyear)
plot(df$popest2015, df$avganncount)
```

```{r, echo=FALSE}
# Roll back to the original layout
par(mfrow = c(1, 1))
```

## Combining Variables

Finally, we will combine those variables that are syntactically related, such as
the percentage of people with a high school education in the 18-24 age range and
the percentage of people with a bachelor's degree in the 25 and over age range
into a single variable representing the percentage of people with a high school
education.

```{r}
# Education-related variables
df$pcths <- df$pcths18_24 + df$pcths25_over
df$pctbach <- df$pctbachdeg18_24 + df$pctbachdeg25_over

# Race and Ethnicity-related Variables
df$racindex <- df$pctblack + df$pctasian + df$pctotherrace

# Public Coverage and Poverty-related Variables
df$social_welfare <- df$pctpubliccoverage + df$povertypercent

new_cor_matrix <- cor(Filter(is.numeric, df))
corrplot(new_cor_matrix, method = "circle")
```

# Model Fitting

Let's start by exploring a linear model with all the variables, with the 
intention of getting a first glance at the most significant predictors.

```{r}
base_df = Filter(is.numeric, df)
base_df$f.race = df$f.race
model <- lm(target_deathrate ~ ., data = base_df)
summary(model)
```

According to this first model, the following variables seem to be very 
significant (p-value < 0.01):

```{r}
coefs <- summary(model)$coefficients
significant_vars <- coefs[coefs[,'Pr(>|t|)'] < 0.01,]
significant_vars
```

## Analyzing the behaviour of the main predictors

The analysis found significant relationships between target death rates and 
socioeconomic factors. Marital status, incidence rates, and median income all 
showed strong associations, with higher marriage and income levels linked to 
lower death rates, while higher incidence rates correlated with higher death 
rates. These results were supported by both pairwise Wilcoxon tests and ANOVA. 

```{r}
# Boxplot for most relevant predictors with appropriate labels
boxplot(target_deathrate ~ f.incidencerate, data = df, main = "Death Rate vs. Incidence Rate", 
        xlab = "Incidence Rate (Factor)", ylab = "Death Rate")
boxplot(target_deathrate ~ f.medincome, data = df, main = "Death Rate vs. Median Income", 
        xlab = "Median Income (Factor)", ylab = "Death Rate")
boxplot(target_deathrate ~ f.percentmarried, data = df, main = "Death Rate vs. Percent Married", 
        xlab = "Percent Married (Factor)", ylab = "Death Rate")
boxplot(target_deathrate ~ pcths, data = df, main = "Death Rate vs. Percent High School", 
        xlab = "Percent High School", ylab = "Death Rate")
boxplot(target_deathrate ~ f.povertypercent, data = df, main = "Death Rate vs. Poverty Percent", 
        xlab = "Poverty Percent (Factor)", ylab = "Death Rate")
boxplot(target_deathrate ~ f.pctpubliccoverage, data = df, main = "Death Rate vs. Public Coverage", 
        xlab = "Public Coverage (Factor)", ylab = "Death Rate")
boxplot(target_deathrate ~ f.pctpubliccoveragealone, data = df, main = "Death Rate vs. Public Coverage Alone", 
        xlab = "Public Coverage Alone (Factor)", ylab = "Death Rate")

# Pairwise tests and ANOVA for corresponding variables
# Percent Married
pairwise.wilcox.test(df$target_deathrate, df$f.percentmarried, p.adjust.method = "bonferroni")
oneway.test(target_deathrate ~ f.percentmarried, data = df)

# Incidence Rate
pairwise.wilcox.test(df$target_deathrate, df$f.incidencerate, p.adjust.method = "bonferroni")
oneway.test(target_deathrate ~ f.incidencerate, data = df)

# Median Income
pairwise.wilcox.test(df$target_deathrate, df$f.medincome, p.adjust.method = "bonferroni")
oneway.test(target_deathrate ~ f.medincome, data = df)

# Poverty Percent
pairwise.wilcox.test(df$target_deathrate, df$f.povertypercent, p.adjust.method = "bonferroni")
oneway.test(target_deathrate ~ f.povertypercent, data = df)

# Public Coverage
pairwise.wilcox.test(df$target_deathrate, df$f.pctpubliccoverage, p.adjust.method = "bonferroni")
oneway.test(target_deathrate ~ f.pctpubliccoverage, data = df)

# Public Coverage Alone
pairwise.wilcox.test(df$target_deathrate, df$f.pctpubliccoveragealone, p.adjust.method = "bonferroni")
oneway.test(target_deathrate ~ f.pctpubliccoveragealone, data = df)

```

## Models without variable interactions

This preliminary analysis involves building multiple linear regression models to
explore how different predictors (e.g., poverty rate, marriage rate, incidence
rate, median income, public coverage) relate to a target variable, 
target_deathrate. Each model tests the relationship between target_deathrate and
one predictor, while controlling for other factors.

The analysis of various socio-economic and health-related factors reveals
significant relationships with target_deathrate. Poverty percentage has a strong
positive association with death rates, where a 1% increase in poverty
corresponds to a 23.42 increase in death rates. Marriage rates show a negative
relationship, with higher marriage percentages linked to lower death rates,
particularly in the linear and quadratic terms. Incidence rates exhibit a
positive effect, suggesting that higher disease rates contribute to higher death
rates. Median income also has a negative relationship, with higher income
associated with lower death rates, though the effect becomes more complex with
higher-order terms. Interestingly, high school graduation rates are positively
correlated with death rates, which may reflect underlying socio-economic factors
not captured in the model. Public health coverage, both overall and alone, shows
a positive association with death rates, indicating that more coverage could be
linked to higher mortality, possibly due to disparities in healthcare access or
reporting. Overall, these variables highlight key socio-economic and health
influences on mortality, providing insights for further analysis and potential
intervention strategies.

In the analysis, the decision to modify only the pcths variable is based on the optimal lambda value obtained from the Box-Cox transformation. For most variables, the lambda values were either close to 1, indicating that no transformation is needed, or close to values suggesting the relationship with the target variable is already approximately linear. However, for the pcths variable, the optimal lambda value was 0.66, which suggests that a moderate transformation would be beneficial. A lambda value of 0.66 indicates that a mild transformation, such as squaring or taking the square root, could help linearize the relationship between pcths and target_deathrate.

```{r}
# Linear model: target_deathrate ~ percentmarried 
lm_percentmarried <- lm(target_deathrate ~ percentmarried, data = df)
summary(lm_percentmarried)
lambda_percentmarried <- boxcox(target_deathrate ~ percentmarried, data = df)
optimal_lambda_percentmarried <- lambda_percentmarried$x[which.max(lambda_percentmarried$y)]
optimal_lambda_percentmarried
#df$percentmarried <- df$percentmarried^optimal_lambda_percentmarried

# Linear model: target_deathrate ~ incidencerate
lm_incidencerate <- lm(target_deathrate ~ incidencerate, data = df)
summary(lm_incidencerate)
lambda_incidencerate <- boxcox(target_deathrate ~ incidencerate, data = df)
optimal_lambda_incidencerate <- lambda_incidencerate$x[which.max(lambda_incidencerate$y)]
optimal_lambda_incidencerate
#df$incidencerate <- df$incidencerate^optimal_lambda_incidencerate

# Linear model: target_deathrate ~ medincome
lm_medincome <- lm(target_deathrate ~ medincome, data = df)
summary(lm_medincome)
lambda_medincome <- boxcox(target_deathrate ~ medincome, data = df)
optimal_lambda_medincome <- lambda_medincome$x[which.max(lambda_medincome$y)]
optimal_lambda_medincome
#df$medincome <- df$medincome^optimal_lambda_medincome

# Linear model: target_deathrate ~ pcths
lm_pcths <- lm(target_deathrate ~ pcths, data = df)
summary(lm_pcths)
lambda_pcths <- boxcox(target_deathrate ~ pcths, data = df)
optimal_lambda_pcths <- lambda_pcths$x[which.max(lambda_pcths$y)]
optimal_lambda_pcths
df$pcths_raised <- df$pcths^optimal_lambda_pcths

# Linear model: target_deathrate ~ pctpubliccoverage
lm_pctpubliccoverage <- lm(target_deathrate ~ pctpubliccoverage, data = df)
summary(lm_pctpubliccoverage)
lambda_pctpubliccoverage <- boxcox(target_deathrate ~ pctpubliccoverage, data = df)
optimal_lambda_pctpubliccoverage <- lambda_pctpubliccoverage$x[which.max(lambda_pctpubliccoverage$y)]
optimal_lambda_pctpubliccoverage
#df$pctpubliccoverage <- df$pctpubliccoverage^optimal_lambda_pctpubliccoverage

# Linear model: target_deathrate ~ pctpubliccoveragealone
lm_pctpubliccoveragealone <- lm(target_deathrate ~ pctpubliccoveragealone, data = df)
summary(lm_pctpubliccoveragealone)
lambda_pctpubliccoveragealone <- boxcox(target_deathrate ~ pctpubliccoveragealone, data = df)
optimal_lambda_pctpubliccoveragealone <- lambda_pctpubliccoveragealone$x[which.max(lambda_pctpubliccoveragealone$y)]
optimal_lambda_pctpubliccoveragealone
#df$pctpubliccoveragealone <- df$pctpubliccoveragealone^optimal_lambda_pctpubliccoveragealone

# Linear model: target_deathrate ~ povertypercent
lm_povertypercent <- lm(target_deathrate ~ povertypercent, data = df)
summary(lm_povertypercent)
lambda_povertypercent <- boxcox(target_deathrate ~ povertypercent, data = df)
optimal_lambda_povertypercent <- lambda_povertypercent$x[which.max(lambda_povertypercent$y)]
optimal_lambda_povertypercent
#df$povertypercent <- df$povertypercent^optimal_lambda_povertypercent


```

## Models with variable interactions 

We began by building a comprehensive linear model to predict the target death
rate using a set of selected variables, including percentmarried,
incidencerate, medincome, pcths, pctpubliccoverage,
pctpubliccoveragealone, and povertypercent. The idea behind this is to
examine the influence of various socioeconomic factors and health coverage on
the target variable. Given the complexity of this model, we then applied
stepwise regression (in both directions) to simplify it, seeking to retain only
the most influential variables while removing any that may be redundant or
contribute little explanatory power. Stepwise selection is a common approach to
model simplification as it systematically evaluates each predictor's
contribution, helping to ensure that the final model is parsimonious without
losing predictive accuracy. The stepwise procedure both adds and removes terms
based on the Akaike Information Criterion (AIC), optimizing the model by
reducing unnecessary complexity while preserving the key predictors. By
analyzing the results, we aim to identify a simpler model that still provides a
meaningful and effective prediction of the death rate, with the potential for
clearer interpretations and easier implementation.

Then, we repeat the process adding more variables and accounting for interactions
to obtain a more completed model. We obtain a model with 56,7% r-square value, which
is quite high. However,

```{r}
# Create a new model with all selected variables
full_selected_model <- lm(target_deathrate ~ percentmarried + incidencerate + medincome + pcths + pctpubliccoverage + pctpubliccoveragealone + povertypercent, 
                          data = df)
summary(full_selected_model)

# Perform stepwise selection (both directions)
lm_stepwise <- step(full_selected_model, direction = "both", k=log(nrow(df))); lm_stepwise

## BEST MODEL PREDICTED BY FUNCTION STEP
#Step:  AIC=11068.72
#target_deathrate ~ incidencerate + pcths + pctpubliccoveragealone + 
#    povertypercent

# Plot the stepwise regression result
plot(lm_stepwise)

#Repeat the process adding more variables and accounting for interactions

full_selected_model <- lm(target_deathrate ~ (percentmarried + incidencerate + medincome + pcths + povertypercent + pctpubliccoverage + pctpubliccoveragealone +  pctbach)^2, data = df)
full_selected_model_step <- step(full_selected_model, k=log(nrow(df)))

#Last model: 

#Step:  AIC=10996.07
#target_deathrate ~ percentmarried + incidencerate + medincome + 
#    pcths + povertypercent + pctpubliccoverage + pctpubliccoveragealone + 
#    pctbach + percentmarried:medincome + incidencerate:pctpubliccoverage + 
#    incidencerate:pctpubliccoveragealone + medincome:pctpubliccoveragealone + 
#    povertypercent:pctpubliccoverage + povertypercent:pctpubliccoveragealone

summary(full_selected_model)

```

In this section, we are building a series of linear regression models to predict
the target death rate. We started with a simple model (m1) that used only median
income as a predictor, which provided some insight into its relationship with
the death rate. Next, we expanded the model (m2) to include incidence rate as an
additional predictor, and confirmed that no transformation of the target
variable was needed through the Box-Cox transformation. The inclusion of
incidence rate significantly improved the model, as evidenced by the higher
R-squared and F-statistic.

We continued to refine the model by adding poverty percentage (f.povertypercent)
in model (m3), which further improved the fit, showing a significant reduction
in residual sum of squares (RSS). The inclusion of poverty percentage, along
with median income and incidence rate, captured more variance in the target
death rate, as confirmed by an ANOVA comparison between m2 and m3. In model
(m4), we added high school graduation rate (pcths), which was also found to be a
significant predictor. The ANOVA results indicated that this addition further
improved the model's fit, with a significant reduction in RSS and an increased
F-statistic.

Throughout this process, we monitored potential issues such as multicollinearity
and heteroscedasticity. The Variance Inflation Factor (VIF) values for all
predictors were low (well below 5), indicating that multicollinearity was not a
concern. However, the Breusch-Pagan test for heteroscedasticity showed evidence
of non-constant variance in the residuals in m4, which could violate one of the
assumptions of linear regression.

Overall, the model selection process allowed us to identify the most relevant
socio-economic and health-related variables influencing the target death rate.
The final model (m4) included median income, incidence rate, poverty percentage,
and high school graduation rate. Although the model fits the data well, we noted
the presence of heteroscedasticity, which we may need to address in future
iterations. This approach has ensured that we retain the most statistically
significant variables while minimizing the risk of over fitting.Note the fact prior
to this code we tested other possible arrangements and used other variables in order to
find the best model. For sake of simplicity, this code only shows the last results. 

```{r}
# target_deathrate ~ incidencerate
m1 <- lm(target_deathrate ~ incidencerate, data = df)
summary(m1)
# plot(m1)
boxcox(target_deathrate ~ incidencerate , data=df) #No transformation of the target needed

# target_deathrate ~ povertypercent + incidencerate
m2 <- lm(target_deathrate ~ povertypercent + incidencerate, data= df)
boxcox(target_deathrate ~ povertypercent + incidencerate, data=df) #No transformation of the target needed.
summary(m2)
# plot(m2)
t <- summary(m2)
vif(m2)
1/(1-t$r.squared)

anova(m1,m2)

# target_deathrate ~ pcths + incidencerate + povertypercent
m3 <- lm(target_deathrate ~  pcths + incidencerate + povertypercent, data= df)
summary(m3)
plot(m3)

anova(m1,m3)
anova(m2,m3)

# target_deathrate ~ incidencerate + pcths + pctpubliccoveragealone + povertypercent
m4 <- lm(target_deathrate ~ incidencerate + pcths_raised + pctpubliccoveragealone + povertypercent, data= df)
summary(m4)
plot(m4)

#The ANOVA results show that adding the variable pcths (percentage of high school graduates) significantly improves the model in all #three comparisons. In each case, the p-values are extremely low (< 2.2e-16), indicating that the inclusion of pcths leads to a #statistically significant reduction in the residual sum of squares, improving the model's fit. This suggests that pcths provides #valuable explanatory power in predicting target_deathrate.
anova(m1,m4)
anova(m2,m4)
anova(m3,m4)

# The Breusch-Pagan test was conducted to check for heteroscedasticity 
# (non-constant variance of residuals). With a p-value of 0.0086, the test
# indicates evidence of heteroscedasticity in the model. This suggests that the
# residuals' variance is not constant, potentially violating one of the
# assumptions of linear regression
bptest(m4)

# The VIF values for all predictors in the model (medincome, incidencerate, and
# pcths) are below 1.2, far below the common threshold of 5 or 10 for
# multicollinearity concerns. This indicates that the predictors are not highly
# correlated and do not pose a multicollinearity issue in the regression
vif(m4)
t <- summary(m4)
1/(1-t$r.squared)

model <- m4
```

## Visualization and model diagnostics of the chosen model
```{r}
lmBest <- model
plot(model)
crPlots(model)
marginalModelPlots(model)

```

## Influential data

Before proceeding with the current model, let us determine whether there are any
data points that are particularly influential on the regression coefficients.

### A-priori influential data

These are data points that are considerably far from the rest of the cloud of
points. They tend to have a high leverage and can significantly affect the
regression coefficients.

A common measure of leverage is the hat value, which is the diagonal element of
the hat matrix. Observations with a hat value greater than 2p/n, where p is the
number of predictors and n is the number of observations, are considered
influential as a rule of thumb.

```{r}
hat_values = hatvalues(model)
hat_threshold = 2 * length(coefficients(model)) / nrow(df)
influential_data = which(hat_values > hat_threshold)
length(influential_data)
```

144 data points are found to be highly influential according to the hat value
criterion. We can see them visually via a simple Multidimensional Scaling (MDS)
plot.

```{r}
par(mfrow = c(1, 1))
used_variables = attr(model$terms, "term.labels")
mds <- cmdscale(daisy(df[, used_variables]), k = 2) # Use dasy for mixed data types
plot(mds, col = ifelse(1:nrow(df) %in% influential_data, "red", "black"))
```

As we can see, the influential data points are scattered throughout the plot,
indicating that they are not clustered in any particular region of the feature
space. However, as expected, most of them are far from the center of the cloud
of points.

Let us remove these influential data points and re-fit the model to see if the
results change significantly.

```{r}
model_no_priori = update(model, data = df[-influential_data, ])
summary(model_no_priori)
```

Surprisingly, the model's $R^2$ value has decreased slightly after removing the
influential data points. This suggests that the influential data points were
actually contributing to the model's fit. We will keep the original model for
now.

### A-posteriori influential data

Having already a model defined, we can now search data points that have
actually significantly altered the regression coefficients.The Cook's distance 
is a measure of the influence of each observation on the regression 
coefficients. Observations with a high Cook's distance are considered
influential and can significantly affect the regression coefficients.

```{r}
cooks_distance = cooks.distance(model)
Boxplot(cooks_distance, id=list(labels=df$geography))
```

Only one data point seems to have a Cook's distance significantly higher than
the rest: the county of "Williamsburg city, Viriginia". This can be further 
visualized using an influence plot.

```{r}
influencePlot(model, id=list(labels=df$geography),
  main="Influence Plot", sub="Circle size is proportional to Cook's distance")
```

This plot shows as well the points that were determined as a-priori influential
in the previous step. 

As a rule of thumb, observations with a Cook's distance greater than 4/n are
considered influential. Let us remove the a-posteriori influential points and
re-fit the model.

```{r}
influential_data = which(cooks_distance > 4/nrow(df)); length(influential_data)
model_no_posteriori = update(model, data = df[-influential_data, ])
summary(model_no_posteriori)
```

135 data points are found to be highly influential according to the Cook's
distance criterion. We can see them visually via a simple Multidimensional
Scaling (MDS) plot.

```{r}
par(mfrow = c(1, 1))
used_variables = attr(model$terms, "term.labels")
mds <- cmdscale(daisy(df[, used_variables]), k = 2) # Use dasy for mixed data types
plot(mds, col = ifelse(1:nrow(df) %in% influential_data, "red", "black"))
```

As before, the influential data points are scattered throughout the plot, not
clustered, and most of them are far from the center of the cloud of points. When
removing these influential data points, the model's $R^2$ value has increased
significantly (To a 0.56), suggesting that the influential data points were
indeed distorting the model's fit. We will keep the model without the
a-posteriori influential data points.

```{r}
model <- model_no_posteriori
```

# Model Validation

All the validation steps will be performed using the final model on the test
dataset. Let's start by loading the test dataset and applying the same
preprocessing steps as we did for the training dataset.

```{r}
test <- read_csv("data/test.csv")

# Type conversion
test$f.binnedinc <- as.factor(test$binnedinc)
test$binnedinc <- sapply(
  strsplit(gsub("[\\[\\]()]", "", test$binnedinc, perl = T), ","),
  function(x) mean(as.numeric(x))
)
test$state <- sub(".*,\\s*", "", test$geography)
test$f.race <- as.factor(apply(test, 1, getRace))

# Missing values
test <- subset(test, select = -c(pctprivatecoveragealone, pctsomecol18_24))
test$pctemployed16_over <- complete(mice(test), action = 1)$pctemployed16_over

# Outliers
test_out = which(test$medianage > 100)
test$medianage[test_out] <- (test$medianagemale[test_out] + test$medianagefemale[test_out]) / 2
test <- subset(test, select = -c(medianagemale, medianagefemale))

# Variable discretization
discretize_based_on <- function(col, base_col, level_name) {
  # Discretize the column based on the quartiles of another column
  res <- cut(col, breaks = quantile(base_col, probs = seq(0, 1, 0.25)),
    include.lowest = T,
    labels=c(
      sprintf("Low%s", level_name),
      sprintf("LowMid%s", level_name),
      sprintf("HighMid%s", level_name),
      sprintf("High%s", level_name)
    )
  )
  return(res)
}

test$f.avganncount <- discretize_based_on(test$avganncount, df$avganncount, "CaseCount")
test$f.avgdeathsperyear <- discretize_based_on(test$avgdeathsperyear, df$avgdeathsperyear, "MortCount")
test$f.target_deathrate <- discretize_based_on(test$target_deathrate, df$target_deathrate, "DeathRate")
test$f.incidencerate <- discretize_based_on(test$incidencerate, df$incidencerate, "DiagnPerCap")
test$f.medincome <- discretize_based_on(test$medincome, df$medincome, "MedianInc")
test$f.popest2015 <- discretize_based_on(test$popest2015, df$popest2015, "MidPop")
test$f.povertypercent <- discretize_based_on(test$povertypercent, df$povertypercent, "Pov%")
test$f.studypercap <- cut(
  test$studypercap, breaks = c(-Inf, 0, non_zero_studypercap_median, Inf), # Use the breakpoints from training data
  include.lowest = T,
  labels=c("NoTrials", "MidTrials", "HighTrials")
)
test$f.medianage <- discretize_based_on(test$medianage, df$medianage, "Age")
test$f.percentmarried <- discretize_based_on(test$percentmarried, df$percentmarried, "Married%")
test$f.pctnohs18_24 <- discretize_based_on(test$pctnohs18_24, df$pctnohs18_24, "NoHighsc%")
test$f.pcths18_24 <- discretize_based_on(test$pcths18_24, df$pcths18_24, "Highsc%")
test$f.pcths25_over <- discretize_based_on(test$pcths25_over, df$pcths25_over, "25Highsc%")
test$f.pctbachdeg25_over <- discretize_based_on(test$pctbachdeg25_over, df$pctbachdeg25_over, "Bach%")
test$f.pctemployed16_over <- discretize_based_on(test$pctemployed16_over, df$pctemployed16_over, "Employ%")
test$f.pctunemployed16_over <- discretize_based_on(test$pctunemployed16_over, df$pctunemployed16_over, "Unemploy%")
test$f.pctprivatecoverage <- discretize_based_on(test$pctprivatecoverage, df$pctprivatecoverage, "Private%")
test$f.pctempprivcoverage <- discretize_based_on(test$pctempprivcoverage, df$pctempprivcoverage, "EmployeeHealth%")
test$f.pctpubliccoverage <- discretize_based_on(test$pctpubliccoverage, df$pctpubliccoverage, "GovHealth%")
test$f.pctwhite <- discretize_based_on(test$pctwhite, df$pctwhite, "White%")
test$f.pctblack <- discretize_based_on(test$pctblack, df$pctblack, "Black%")
test$f.pctasian <- discretize_based_on(test$pctasian, df$pctasian, "Asian%")
test$f.pctotherrace <- discretize_based_on(test$pctotherrace, df$pctotherrace, "OtherRace%")
test$f.pctmarriedhouseholds <- discretize_based_on(test$pctmarriedhouseholds, df$pctmarriedhouseholds, "Married%")
test$f.birthrate <- discretize_based_on(test$birthrate, df$birthrate, "Birth%")

# Combining variables
test$pcths <- test$pcths18_24 + test$pcths25_over
test$pctbach <- test$pctbachdeg18_24 + test$pctbachdeg25_over
test$racindex <- test$pctblack + test$pctasian + test$pctotherrace
test$social_welfare <- test$pctpubliccoverage + test$povertypercent

# Raising to the optimal lambda
test$pcths_raised <- test$pcths^optimal_lambda_pcths
```

## Model Evaluation

Let's evaluate the model on the test dataset. We will start by predicting the
target variable using the model and calculating the mean squared error (MSE) and
the R-squared value.

```{r}
test$predicted_deathrate <- predict(model, newdata = test)

test$predicted_residuals <- test$target_deathrate - test$predicted_deathrate
mse <- mean((test$predicted_residuals)^2)
r_squared <- 1 - mse / var(test$target_deathrate)
cat("Mean Squared Error:", mse, "\n")
cat("R-squared:", r_squared, "\n")
```

The model has an R-squared value of 0.46 on the test dataset, indicating that it
can account for 46% of the variance in the target death rate.

Another interesting metric to look at for determining the model's performance is
the mean absolute error (MAE), which gives a better sense of the model's
accuracy in predicting the target variable.

```{r}
mae <- mean(abs(test$predicted_residuals))
cat("Mean Absolute Error:", mae, "\n")
```

The model has a mean absolute error of 15.2 on the test dataset, which means 
that, on average, the model's predictions are off by 15.2 units from the actual
death rate.

This becomes more evident when plotting the predicted death rate against the
actual death rate, as well as the residuals against the actual death rate.

```{r}
par(mfrow = c(2, 2))
plot(test$target_deathrate, test$predicted_deathrate,
  main = "Predicted vs. Actual Death Rate",
  xlab = "Actual Death Rate", ylab = "Predicted Death Rate"
)
abline(0, 1, col = "red")

plot(test$predicted_residuals, test$predicted_deathrate,
  main = "Predicted Death Rate vs. Residuals",
  ylab = "Actual Death Rate", xlab = "Residuals"
)
abline(v = 0, col = "red")

plot(test$target_deathrate, test$predicted_residuals,
  main = "Residuals vs. Actual Death Rate",
  xlab = "Actual Death Rate", ylab = "Residuals"
)
abline(h = 0, col = "red")

hist_data = hist(test$predicted_residuals, plot = FALSE)
barplot(hist_data$counts,
  names.arg = hist_data$breaks[-length(hist_data$breaks)],
  axes= TRUE,
  space = 0,
  xlab = "Residuals", main = "Histogram of Residuals"
)


# Check whether the residuals are centered around 0 (if p>0.01, we can reject
# the null hypothesis that the residuals are centered around 0)
t.test(test$predicted_residuals)
```

As expected from an Ordinary Least Squares (OLS) model, the residuals are 
centered around 0, and the predicted death rate is close to the actual death
rate. However, there is a clear trend in the residuals when plotting them 
against the actual death rate. The residuals are higher for higher death rates,
indicating that the model is not capturing all the variance in the target 
variable and may need further refinement.

# A word from the authors

In this analysis, we have explored the relationship between various 
socio-economic and health-related factors and the rate of death related to 
cancer of US counties. We have built a linear regression model that predicts the
death rate based on these factors and evaluated its performance on a test
dataset. 

This work was an interesting exercise in data analysis and modeling, and 
possibly our first glance at the complexity of finding the best techniques to
model a real-world problem. We have learned a lot about the importance of data
preprocessing, feature selection, and model evaluation in building a predictive
model.

Pretty much the entirety of the analysis was done in collaboration between the
three of us, although some parts were more heavily influenced by one of us. For
instance, Dani Reverter lead the way with the preliminary data analysis, whilst 
Albert Puiggròs centered their efforts on model discovery and fitting, and Marc
Parcerisa focused on model validation and influence analysis.