-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathIncomeStudy.Rmd
167 lines (129 loc) · 6.24 KB
/
IncomeStudy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
title: "The effect of education on income"
author: Marc Parcerisa
date: November 17, 2024
output:
md_document:
variant: markdown_github
---
<!-- README.md is generated from IncomeStudy.Rmd. Please edit that document. -->
```{r opts, echo = FALSE}
knitr::opts_chunk$set(fig.path = "images/")
```
Today I saw some kid on *Tik Tok* saying that there's no point in studying anything
because doing so wouldn't increase your income. I thought that was a pretty bold
statement, so I did what any reasonable person would do: I went to Kaggle to find
a dataset that would help me prove him wrong.
I found a dataset called `income.csv` that contains information about american
people's income and education.
You can find it here:
https://www.kaggle.com/datasets/amirhosseinmirzaie/americancitizenincome?resource=download
Here's an explanation of the columns:
**Column** | **Description**
:-------------:|:-------------:
age | Age
workclass | A general term indicating the employment status of an individual.
fnlwgt | Final weight, representing the number of individuals that this row represents (a representative sample).
education | Highest level of education achieved by an individual.
education.num | Highest level of education achieved by an individual in numerical form.
marital.status | Marital status of an individual. Note that Married-civ-spouse refers to a civilian spouse, and Married-AF-spouse refers to a spouse in the Armed Forces.
occupation | General type of occupation of an individual.
relationship | Relationship of this individual with others, for example, spouse (Husband). Each data point has only one relationship.
race | Race
sex | Biological sex of an individual.
capital.gain | Capital gains of an individual.
capital.loss | Capital losses of an individual.
hours.per.week | Number of hours the individual reported working per week.
native.country | Country of origin.
income | Income, less than or equal to $50,000 (<=50K) or more than that (>50K).
For this study I'll be using the following libraries:
```{r}
library(readr)
library(FactoMineR)
```
### Loading and preparing the data
```{r}
df <- read_csv("income.csv")
summary(df)
```
Let's go column by column to clean the data and ensure that it's ready for analysis.
```{r}
# workclass column should be a factor. "?" values are N/A's
df$workclass <- as.factor(df$workclass)
df$workclass[df$workclass == "?"] <- NA
# Education is also a factor.
df$education <- as.factor(df$education)
# Also, education.num is an ordinal variable, that represents the same information
# as education. To check that this holds through the dataset, we'll simply check
# that there's the same amount of unique values in "education" that there are in
# the combination of "education" and "education.num"
nrow(unique(df[, c("education", "education.num")])) # 16
length(unique(df$education)) # 16
# Marital status, occupation, relationship, race and sex are all factors
df$marital.status <- as.factor(df$marital.status)
df$occupation <- as.factor(df$occupation)
df$relationship <- as.factor(df$relationship)
df$race <- as.factor(df$race)
df$sex <- as.factor(df$sex)
# Capital gain is numeric, but it looks like there may be some values that should
# be N/A's (Those that are 99999)
df$capital.gain[df$capital.gain == 99999] <- NA
# Same with hours.per.week, 99 hours per week seems like a lot
df$hours.per.week[df$hours.per.week == 99] <- NA
# Native country is a factor
df$native.country <- as.factor(df$native.country)
# Income is a factor for some reason.
df$income <- as.factor(df$income)
```
Here's how the data looks like now:
```{r}
summary(df)
```
### Exploratory Data Analysis
Now that the data is clean, let's do some exploratory data analysis to see if we can
find any interesting patterns.
Let's start by looking at the distribution of income
```{r}
table(df$income, df$education.num)
```
It looks like there's a clear pattern here. Looks like the higher the education,
the more likely you are to earn more than 50K a year.
There's an interesting tool to check whether two categorical values are or not
independent. It's called the Chi-Square test. Let's use it to check if there's a
statistically significant relationship between income and education.
```{r}
chisq.test(table(df$income, df$education.num))
```
This test assumes the null hypothesis that there is no association between the two
variables. The p-value is virtually zero, so we can reject the null hypothesis and
conclude that there is a significant association between income and education.
Having proven that, we want to check, not only if there's an association, but also
that there is a positive correlation. We'll check this in two ways.
First, we'll convert income to a numeric variable that takes values `0` and `1` and
then we'll check the correlation between this new variable and `education.num`.
```{r}
income <- as.numeric(df$income) # Easy way to create a numeric column the same size as df$income
income[df$income == "<=50K"] <- 0
income[df$income == ">50K"] <- 1
cor.test(df$education.num, income, method = "spearman")
```
Simply by looking at the p-value, which is virtually zero, we can conclude that
the hypothesis of there being no correlation between these newly created numeric
variables can be rejected. This means that we, again, can conclude that there is
a correlation, which the method, also, tells us to be positive (0.33).
Finally, we'll use a logistic regression to check if we can predict income based on
education.
```{r}
model <- glm(income ~ education.num, data = df, family = "binomial")
summary(model)
plot(table(df$education.num, income))
```
From this output, we really only care about the coefficient $\beta_1$, which refers
to the change in the log-odds of the income being greater than 50K for a single
education level increase. The value for this coefficient is 0.37, with a p-value
of virtually zero (meaning this is a statistically significant result). This means
that for every level of education you increase, the odds of you earning more than
50K a year increase by 44% (since $e^{0.37} = 1.44$, which is the ratio of proportions).
### Conclusion
So, in conclusion, the kid on *Tik Tok* was wrong. Studying does increase your income.
Although, I must admit, I was expecting a bigger effect.