-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathdata_viz.Rmd
883 lines (625 loc) · 35.3 KB
/
data_viz.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
# Visualizing Data {#data_viz}
Visualizing your data is hands down the most important thing you can learn to do. There are links to additional resources at the end of this document for additional learning.
There are two audiences in mind when creating data visualizations:
1. For your eyes only (FYEO). These are quick and dirty plots, without annotation. Meant to be looked at once or twice.
2. To share with others. These need to completely stand on their own. Axes labels, titles, colors as needed, possibly captions.
You will see, and slowly learn, how to add these annotations and how to clean up your graphics to make them sharable. `ggplot2` already does a lot of this work for you.
We will also use the two most common methods used to create plots. 1) Base graphics, 2) the `ggplot2` package. Each have their own advantages and disadvantages. If you have not done so already, go ahead and install the `ggplot2` package now.
For **almost** every plot discussed we will create two types of plots
1. FYEO - using base graphics. (Base == Comes with R) Very powerful, but can be technical.
2. FYEO - using `ggplot2`. Each have their own advantages and disadvantages.
As time permits I will update each section with a third type of plot -
3. Sharable - Contains all bells and whistles needed to make it presentable to others.
Your task, should you choose to accept, is to follow along through this tutorial and at each step try to reproduce the plot shown. You can accomplish this by simply copying and pasting the syntax into a new R code (or R Markdown) document.
## The syntax of `ggplot`
The reason we use the functions in `ggplot2` is for consistency in the structure
of it's arguments. Here is a bare bones generic plotting function:
```{r, eval=FALSE}
ggplot(data, aes(x=x, y=y, col=col, fill=fill, group=group)) + geom_THING()
```
### Required arguments
* `data`: What data set is this plot using? This is ALWAYS the first argument.
* `aes()`: This is the _aestetics_ of the plot. What's variable is on the x, what is on
the y? Do you want to color by another variable, perhaps fill some box by the value
of another variable, or group by a variable.
* `geom_THING()`: Every plot has to have a geometry. What is the shape of the thing you
want to plot? Do you want to plot points - use `geom_points()`. Want to connect those
points with a line? Use `geom_lines()`. We will see many varieties in this lab.
### Optional but helpful arguments
* `ggtitle`: This is the overall plot title
* `xlab()` and `ylab()` axis titles.
* scale_xy_blah to extend limits
* scale_fill_blah to specifying a fixed color, and change auto legend title
* themes
For a **full** , and comprehensive tutorial and reference guide on how to do nearly anything in ggplot -- this is by far my favorite reference http://www.cookbook-r.com/Graphs/ I reference things in there (like how to remove or change the title of a legend) constantly.
## The Data
We will use a subset of the `diamonds` dataset that comes with the `ggplot2` package. This dataset contains the prices and other attributes of almost 54,000 diamonds. Review `?diamonds` to learn about the variables we will be using.
```{r}
data("diamonds")
set.seed(1410) # Make the sample reproducible
dsmall <- diamonds[sample(nrow(diamonds), 1000), ]
```
## Univariate Visualizations
### Categorical variables
Both Nominal and Ordinal data types can be visualized using the same methods: tables, barcharts and pie charts.
#### Tables
Tables are the most common way to get summary statistics of a categorical variable. The `table()` function produces a frequency table, where each entry represents the number of records in the data set holding the corresponding labeled value.
```{r}
table(dsmall$cut)
```
There are 27 Fair quality diamonds, 83 good quality and 387 Ideal quality diamonds in this sample.
#### Barcharts / Barplots
A Barchart or barplot takes these frequencies, and draws bars along the X-axis where the height of the bars is determined by the frequencies seen in the table.
**base**
To create a barplot/barchart in base graphics requires the data to be in summarized in a table form first. Then the result of the table is plotted. The first argument is the table to be plotted, the `main` argument controls the title.
```{r}
dc <- table(dsmall$cut)
barplot(dc, main="Barchart using base graphics")
```
**ggplot**
The geometry needed to draw a barchart in ggplot is `geom_bar()`.
```{r}
ggplot(dsmall, aes(x=cut)) + geom_bar()
```
**pretty**
The biggest addition to a barchart is the numbers on top of the bars. This isn't mandatory, but it does make it nice.
```{r}
ggplot(dsmall, aes(x=cut)) + theme_bw() +
geom_bar(aes(y = ..count..)) + ggtitle("Frequnency of diamonds by cut type") +
geom_text(aes(y=..count.. + 10, label=..count..), stat='count', size = 5)
```
#### Plotting Proportions
Often you don't want to compare counts but percents. To accomplish this, we have to aggregate the data to calculate the proportions first, then plot the aggregated data using `geom_col` to create the columns.
```{r}
cut.props <- data.frame(prop.table(table(dsmall$cut)))
cut.props # what does this data look like?
ggplot(cut.props, aes(x=Var1, y=Freq)) + geom_col() +
ylab("Proportion") + xlab("Cut type") +
ggtitle("Proportion of diamonds by cut type")
```
#### Cleveland Dot Plots
Another way to visualize categorical data that takes up less ink than bars is a Cleveland dot plot. Here again we are plotting summary data instead of the raw data. This uses the `geom_segment` that draws the lines from x=0 to the value of the proportion (named `Freq` because of the way `data.frame` works).
```{r}
ggplot(cut.props, aes(x=Freq, y=Var1)) +
geom_point(size = 3) + xlab("Proportion of diamonds") +
theme_bw() + ylab("Cut Type") +
geom_segment(aes(x=0, xend=Freq, y=Var1, yend=Var1), color='grey50')
```
#### Pie Chart
Just like `barplot()`, `pie()` takes a table object as it's argument.
**base**
```{r}
dc <- table(dsmall$cut)
pie(dc)
```
Pie charts are my _least_ favorite plotting type. Human eyeballs can't distinguish between angles as well as we can with heights. A mandatory piece needed to make the wedges readable is to add the percentages of each wedge.
```{r}
pie(dc, labels = paste0(names(dc), ' (', prop.table(dc)*100, "%)"))
```
**ggplot**
And here I thought pie charts couldn't get worse... i'm not a fan at all of the ggplot version. So i'm not even going to show it. Here's a link to another great tutorial that does show you how to make one.
http://www.sthda.com/english/wiki/ggplot2-pie-chart-quick-start-guide-r-software-and-data-visualization
However -- Never say never. Here's an example of a *good* use of pie charts.
http://www.storytellingwithdata.com/blog/2019/8/8/forty-five-pie-charts-never-say-never
#### Waffle Chart
This type of chart is not natively found in the `ggplot2` package, but it's own `waffle` package. These are great for infographics.
Reference: https://www.r-bloggers.com/making-waffle-charts-in-r-with-the-new-waffle-package/
```{r}
library(waffle)
waffle(dc/10, rows=5, size=0.5,
title="Cut quality of diamond",
xlab="1 square == 10 diamonds")
```
### Continuous Measures
Here we can look at the price, carat, and depth of the diamonds.
#### Dotplot
```{r}
plot(dsmall$depth)
```
The base function `plot()` creates a **dotplot** for a continuous variable. The value of the variable is plotted on the y axis, and the index, or row number, is plotted on the x axis. This gives you a nice, quick way to see the values of the data.
Often you are not interested in the individual values of each data point, but the _distribution_ of the data. In other words, where is the majority of the data? Does it look symmetric around some central point? Around what values do the bulk of the data lie?
#### Histograms
Rather than showing the value of each observation, we prefer to think of the value as belonging to a \emph{bin}. **The height of the bars in a histogram display the frequency of values that fall into those of those bins.** For example if we cut the poverty rates into 7 bins of equal width, the frequency table would look like this:
```{r}
table(cut(dsmall$depth, 7))
```
In a histogram, the binned counts are plotted as bars into a histogram. Note that the x-axis is continuous, so the bars touch. This is unlike the barchart that has a categorical x-axis, and vertical bars that are separated.
**base**
You can make a histogram in base graphics super easy.
```{r}
hist(dsmall$depth)
```
And it doesn't take too much to clean it up. Here you can specify the number of bins by specifying how many `breaks` should be made in the data (the number of breaks controls the number of bins, and bin width) and use `col` for the fill color.
```{r}
hist(dsmall$depth, xlab="depth", main="Histogram of diamond depth", col="cyan", breaks=20)
```
**ggplot**
```{r}
ggplot(dsmall, aes(x=depth)) + geom_histogram(binwidth = 2.2)
```
The binwidth here is set by looking at the cut points above that were used to create 7 bins. Notice that darkgrey is the default fill color, but makes it hard to differentiate between the bars. So we'll make the outline black using `colour`, and `fill` the bars with white.
```{r}
ggplot(dsmall, aes(x=depth)) + geom_histogram(colour="black", fill="white") +
ggtitle("Distribution of diamond depth")
```
Note I did **not** specify the `binwidth` argument here. The size of the bins can hide features from your graph, the default value for ggplot2 is range/30 and usually is a good choice.
#### Density plots
To get a better idea of the true shape of the distribution we can "smooth" out the bins and create what's called a `density` plot or curve. Notice that the shape of this distribution curve is much more... "wigglier" than the histogram may have implied.
**base**
```{r}
plot(density(dsmall$depth))
```
Awesome title huh? (NOT)
**ggplot2**
```{r}
ggplot(dsmall, aes(x=depth)) + geom_density()
```
#### Histograms + density
Often is is more helpful to have the density (or kernel density) plot _on top of_ a histogram plot.
**Base**
Since the height of the bars in a histogram default to showing the frequency of records in the data set within that bin, we need to 1) scale the height so that it's a _relative frequency_, and then use the `lines()` function to add a `density()` line on top.
```{r}
hist(dsmall$depth, prob=TRUE)
lines(density(dsmall$depth), col="blue")
```
**ggplot**
The syntax starts the same, we'll add a new geom, `geom_density` and color the line blue. Then we add the histogram geom using `geom_histogram` but must specify that the y axis should be on the density, not frequency, scale. Note that this has to go inside the aesthetic statement `aes()`. I'm also going to get rid of the fill by using `NA` so it doesn't plot over the density line.
```{r}
ggplot(dsmall, aes(x=depth)) + geom_density(col="blue") +
geom_histogram(aes(y=..density..), colour="black", fill=NA)
```
#### Boxplots
Another very common way to visualize the distribution of a continuous variable is using a boxplot. Boxplots are useful for quickly identifying where the bulk of your data lie. R specifically draws a "modified" boxplot where values that are considered outliers are plotted as dots.
**base**
```{r}
boxplot(dsmall$depth)
```
Notice that the only axis labeled is the y=axis. Like a dotplot the x axis, or "width", of the boxplot is meaningless here. We can make the axis more readable by flipping the plot on it's side.
```{r}
boxplot(dsmall$depth, horizontal = TRUE, main="Distribution of diamond prices", xlab="Dollars")
```
Horizontal is a bit easier to read in my opinion.
**ggplot**
What about ggplot? ggplot doesn't really like to do univariate boxplots. We can get around that by specifying that we want the box placed at a specific x value.
```{r}
ggplot(dsmall, aes(x=1, y=depth)) + geom_boxplot()
```
To flip it horizontal you may think to simply swap x and y? Good thinking. Of course it wouldn't be that easy. So let's just flip the whole darned plot on it's coordinate axis.
```{r}
ggplot(dsmall, aes(x=1, y=depth)) + geom_boxplot() + coord_flip()
```
#### Violin plots
```{r}
ggplot(dsmall, aes(x=1, y=depth)) + geom_violin()
```
#### Boxplot + Violin plots
Overlaying a boxplot and a violin plot serves a similar purpose to Histograms + Density plots.
```{r}
ggplot(dsmall, aes(x=1, y=depth)) + geom_violin() + geom_boxplot()
```
Better appearance - different levels of transparency of the box and violin.
```{r}
ggplot(dsmall, aes(x=1, y=depth)) + xlab("") + theme_bw() +
geom_violin(fill="blue", alpha=.1) +
geom_boxplot(fill="blue", alpha=.5, width=.2) +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
```
#### Normal QQ plots
The last useful plot that we will do on a single continuous variable is to assess the _normality_ of the distribution. Basically how close the data follows a normal distribution.
**base**
```{r}
qqnorm(dsmall$price)
qqline(dsmall$price, col="red")
```
The line I make red because it is a reference line. The closer the points are to following this line, the more "normal" the shape of the distribution is. Price has some pretty strong deviation away from that line. Below I have plotted what a normal distribution looks like as an example of a "perfect" fit.
```{r}
z <- rnorm(1000)
qqnorm(z)
qqline(z, col="blue")
```
**ggplot**
qq (or qnorm) plots specifically plot the data against a theoretical distribution. That means in the `aes()` aesthetic argument we don't specify either x or y, but instead the `sample=` is the variable we want to plot.
```{r}
ggplot(dsmall, aes(sample=price)) + stat_qq()
```
Additional references on making qqplots in ggplot: http://www.sthda.com/english/wiki/ggplot2-qq-plot-quantile-quantile-graph-quick-start-guide-r-software-and-data-visualization
## Bivariate Visualizations
### Categorical v. Categorical
#### Two-way Frequency tables
Cross-tabs, cross-tabulations and two-way tables (all the same thing, different names) can be created by using the `table()` function.
The frequency table is constructed using the `table()` function.
```{r}
table(dsmall$cut, dsmall$color)
```
There are 4 Fair diamonds with color D, and 21 Ideal quality diamonds with color J.
#### Two-way Proprtion tables
Choose your percentages depending on your research question. What are you wanting to compare?
Best practices:
* Explanatory variable on the rows
* Response variable on the columns
* Calculate row %'s as the % of the response for each explanatory group.
Here are demonstrations of how the interpretation of the percents change depending on what the denominator is.
**Cell proportions**
Wrapping `prop.table()` around a table gives you the **cell** proportions.
```{r}
prop.table(table(dsmall$cut, dsmall$color))
```
0.4% of all diamonds are D color and Fair cut, 2.1% are J color and Ideal cut.
**Row proportions**
To get the **row** proportions, you specify `margin=1`. The percentages now add up to 1 across the rows.
```{r}
round(prop.table(table(dsmall$cut, dsmall$color), margin=1),3)
```
14.8% of all Fair quality diamonds are color D. 5.4% of all Ideal quality diamonds have color J.
**Column proportions**
To get the **column** proportions, you specify `margin=2`. The percentages now add up to 1 down the columns.
```{r}
round(prop.table(table(dsmall$cut, dsmall$color), margin=2),3)
```
2.7% of all D color diamonds are of Fair quality. 44.7% of all J color diamonds are of Ideal quality.
#### Grouped bar charts
To compare proportions of one categorical variable within the same level of another, is to use grouped barcharts.
**base**
As before, the object to be plotted needs to be the result of a table.
```{r}
cc <- table(dsmall$cut, dsmall$color)
barplot(cc)
```
Stacked bars can be difficult to interpret, and very difficult to compare values between groups. A side by side barchart is preferable.
The `beside=TRUE` is what controls the placement of the bars.
```{r}
barplot(cc, main="quick side by side barchart using base graphics", beside=TRUE)
```
**ggplot**
Again plot the cut on the x axis, but then `fill` using the second categorical variable. This has the effect of visualizing the **row** percents from the table above. The percent of color, within each type of cut.
```{r}
ggplot(dsmall, aes(x=cut, fill=color)) + geom_bar()
```
Again the default is a stacked barchart. So we just specify `position=dodge` to put the bars side by side.
```{r}
ggplot(dsmall, aes(x=cut, fill=color)) + geom_bar(position = "dodge")
```
And look, an automatic legend. What if I wanted to better compare cut within color group? This is the **column** percentages. Just switch which variable is the x axis and which one is used to fill the colors!
```{r}
ggplot(dsmall, aes(x=color, fill=cut)) + geom_bar(position = "dodge")
```
For more than 2 colors I do not recommend choosing the colors yourself. I know little about color theory so I use the built-in color palettes. Here is a [great cheatsheet](https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/colorPaletteCheatsheet.pdf) about using color palettes.
And this easy change is why we love `ggplot2`.
### Grouped bar charts with percentages
Not as easy as one would hope, but the solution is to calculate the desired percentages first and then plot the summary data using either `geom_bar(stat='identity')` or `geom_col()`.
```{r}
calc.props <- diamonds %>% group_by(color, cut) %>%
summarise(count=n()) %>%
mutate(pct=round(count/sum(count),3))
calc.props
```
Since we're plotting summary data, the height of the bars is specified using `y=pct`.
```{r}
ggplot(calc.props, aes(x=color, fill=cut, y=pct)) +
geom_col(position="dodge") + theme_bw()
```
Now set some options to the y axis using `scale_y_continuous()` to make the graph more accurate and readable. The `labels=percent` comes from the `scales` package.
```{r}
library(scales)
ggplot(calc.props, aes(x=color, fill=cut, y=pct)) +
geom_col(position="dodge") + theme_bw() +
scale_y_continuous(limits=c(0,1), labels=percent)
```
#### `sjPlot`
sjPlot does a very nice job of being able to cleanly show not only n's but percents.
```{r}
library(sjPlot)
plot_xtab(dsmall$color, dsmall$cut, margin="row", coord.flip = TRUE)
```
#### Mosaic plots
But what if you want to know how two categorical variables are related and you don't want to look at two different barplots? Mosaic plots are a way to visualize the proportions in a table. So here's the two-way table we'll be plotting.
```{r}
table(dsmall$cut, dsmall$color)
```
The syntax for a mosaic plot uses _model notation_, which is basically y ~ x where the ~ is read as "twiddle" or "tilde". It's to the left of your **1** key.
```{r}
mosaicplot(cut~color, data=dsmall)
```
Helpful, ish. Here are two very useful options. In reverse obviousness, `color` applies shades of gray to one of the factor levels, and `shade` applies a color gradient scale to the cells in order of what is less than expected (red) to what is more than expected (blue) if these two factors were completely independent.
```{r, fig.width=10}
par(mfrow=c(1,2)) # display the plots in 1 row and 2 columns
mosaicplot(cut~color, data=dsmall, color=TRUE)
mosaicplot(cut~color, data=dsmall, shade=TRUE)
```
For example, there are fewer 'Very Good' cut diamonds that are color 'G', and fewer 'Premium' cut diamonds that are color 'H'. As you can see, knowing what your data means when trying to interpret what the plots are telling you is essential.
That's about all the ways you can plot categorical variables.
If you are wondering why there was no 3D barcharts demonstrated see
[here](http://faculty.atu.edu/mfinan/2043/section31.pdf),
[here](http://www.bbc.co.uk/schools/gcsebitesize/maths/statistics/representingdata2rev5.shtml), and
[here](https://en.wikipedia.org/wiki/Misleading_graph) for other ways you can really screw up your visualization.
### Continuous v. Continuous
#### Scatterplot
The most common method of visualizing the relationship between two continuous variables is by using a scatterplot.
**base**
Back to the `plot()` command. Here we use model notation again, so it's $y~x$.
```{r}
plot(price~carat, data=dsmall)
```
Looks like for the most part as the carat value increases so does price. That makes sense.
**ggplot**
With ggplot we specify both the x and y variables, and add a point.
```{r}
ggplot(dsmall, aes(x=carat, y=price)) + geom_point()
```
**Other Resources**
* http://www.statmethods.net/graphs/scatterplot.html
* https://www.r-bloggers.com/scatterplot-matrices/
#### Adding lines to the scatterplots
Two most common trend lines added to a scatterplots are the "best fit" straight line and the "lowess" smoother line.
**base**
The best fit line (in blue) gets added by using the `abline()` function wrapped around the linear model function `lm()`. Note it uses the same model notation syntax and the `data=` statement as the `plot()` function does. The lowess line is added using the `lines()` function, but the `lowess()` function itself doesn't allow for the `data=` statement so we have to use `$` sign notation.
```{r}
plot(price~carat, data=dsmall)
abline(lm(price~carat, data=dsmall), col="blue")
lines(lowess(dsmall$price~dsmall$carat), col="red")
```
**ggplot**
With ggplot, we just add a `geom_smooth()` layer.
```{r}
ggplot(dsmall, aes(x=carat, y=price)) + geom_point() + geom_smooth()
```
Here the point-wise confidence interval for this lowess line is shown in grey. If you want to turn the confidence interval off, use `se=FALSE`. Also notice that the smoothing geom uses a different function or window than the `lowess` function used in base graphics.
Here it is again using the `ggplot` plotting function and adding another `geom_smooth()` layer for the `lm` (linear model) line in blue, and the lowess line (by not specifying a method) in red.
```{r}
ggplot(dsmall, aes(x=carat, y=price)) + geom_point() +
geom_smooth(se=FALSE, method="lm", color="blue") +
geom_smooth(se=FALSE, color="red")
```
#### Line plots
Line plots connect each dot with a straight line. This is most often done when measuring trends of the response as the value of x increases (such as a time series)
We saw earlier that `carat` and `price` seemed possibly linear. Let see how the average price changes with carat.
```{r}
library(dplyr)
price.per.carat <- dsmall %>% group_by(carat) %>% summarise(mean = mean(price))
```
**base**
For base graphics, type='b' means both points and lines, 'l' gives you just lines and 'p' gives you only points. You can find more plotting character options under `?pch`.
```{r}
plot(mean~carat, data=price.per.carat, type='l')
```
**ggplot**
With ggplot we specify that we want a line geometry only.
```{r}
ggplot(price.per.carat, aes(x=carat, y=mean)) + geom_line()
```
How does this relationship change with cut of the diamond? First lets
get the average price per combination of carat and cut.
```{r}
ppc2 <- dsmall %>% group_by(cut, carat) %>% summarise(mean = mean(price))
```
**base**
This plot can be created in base graphics, but it takes an advanced
knowledge of the graphics system to do so. So I do not show it here.
**ggplot**
This is where ggplot starts to excel in it's ease of creating more
complex plots. All we have to do is specify that we want the lines
colored by the cut variable.
```{r}
ggplot(ppc2, aes(x=carat, y=mean, col=cut)) + geom_line()
```
And we get one line per cut.
### Continuous v. Categorical
Create an appropriate plot for a continuous variable, and plot it for each
level of the categorical variable.
#### Dotplot/strip chart
Dotplots can be very useful when plotting dots against several categories. They can also be called stripcharts.
**base**
```{r}
stripchart(carat ~ cut, data=dsmall)
```
Doesn't look to pretty, but kinda gets the point across. Few fair quality diamonds in the data set, pretty spread out across the carat range except one high end outlier.
**ggplot**
We can reproduce the same thing by plotting one continuous variable against one categorical variable, and adding a layer of points. I'd argue that horizontal looks better due to the axis-labels.
```{r}
a <- ggplot(dsmall, aes(y=carat, x=cut)) + geom_point()
b <- ggplot(dsmall, aes(y=cut, x=carat)) + geom_point()
grid.arrange(a, b, ncol=2)
```
#### Grouped boxplots
**base**
Base graphics plots grouped boxplots with also just the addition of a twiddle (tilde) `~`.
Another example of where model notation works.
```{r}
boxplot(carat~color, data=dsmall)
```
**ggplot**
A simple addition, just define your x and y accordingly.
```{r}
ggplot(dsmall, aes(x=color, y=carat, fill=color)) + geom_boxplot()
```
**Adding violins**
Violin plots can be overlaid here as well.
```{r}
ggplot(dsmall, aes(x=color, y=carat, fill=color)) +
geom_violin(alpha=.1) +
geom_boxplot(alpha=.5, width=.2)
```
#### Grouped histograms
**base**
There is no easy way to create grouped histograms in base graphics we will skip it.
**ggplot**
By default ggplot wants to overlay all plots on the same grid. This doesn't look to good with histograms. Instead you can overlay density plots
```{r, fig.width=10}
a <- ggplot(dsmall, aes(x=carat, fill=color)) + geom_histogram()
b <- ggplot(dsmall, aes(x=carat, fill=color)) + geom_density()
grid.arrange(a,b, ncol=2)
```
The solid fills are still difficult to read, so we can either turn down the alpha (turn up the transparency) or only color the lines and not the fill.
```{r, fig.width=10}
c <- ggplot(dsmall, aes(x=carat, fill=color)) + geom_density(alpha=.2)
d <- ggplot(dsmall, aes(x=carat, col=color)) + geom_density()
grid.arrange(c,d, ncol=2)
```
### Joy plots / Ridgelines
Somewhat new (2017), joylines have not been added to the base distribution of `ggplot2` yet. For now it's available in the `ggjoy` package. Really good way to visualize density plots without the overlapping issue.
```{r, fig.width=10}
library(ggjoy)
ggplot(dsmall, aes(x=carat, y=color)) + geom_joy()
```
## Faceting / paneling
This is a good place to introduce a term called `faceting`. The definition is _a particular aspect or feature of something_, or _one side of something many-sided, especially of a cut gem_. Basically instead of plotting the grouped graphics on the same plotting area, we let each group have it's own plot, or facet.
We add a `facet_wrap()` and specify that we want to panel on the color group. Note the twiddle in front of color.
```{r}
ggplot(dsmall, aes(x=carat, fill=color)) + geom_density() + facet_wrap(~color)
```
The grid placement can be semi-controlled by using the `ncol` argument in the `facet_wrap()` statement.
```{r, fig.height=6}
ggplot(dsmall, aes(x=carat, fill=color)) + geom_density() + facet_wrap(~color, ncol=4)
```
It is important to compare distributions across groups on the same scale, and our eyes can compare items vertically better than horizontally. So let's force `ncol=1`.
```{r, fig.height=6}
ggplot(dsmall, aes(x=carat, fill=color)) + geom_density() + facet_wrap(~color, ncol=1)
```
## Multiple plots per window
**base**
I use `par(mfrow=c(r,c))` for base graphics, where `r` is the number of rows and `c` the number of columns.
```{r}
par(mfrow=c(1,3))
plot(dsmall$carat)
plot(dsmall$color)
plot(dsmall$price ~ dsmall$carat)
```
Other resources including learning about `layouts`. Multipanel plotting with base graphics http://seananderson.ca/courses/11-multipanel/multipanel.pdf
**ggplot**
Use the `grid.arrange` function in the `gridExtra` package. I've done it several times above. You assign the output of a ggplot object to an object (here it's `plot1` and `plot2`). Then you use `grid.arrange()` to arrange them either side by side or top and bottom.
```{r}
a <- ggplot(dsmall, aes(x=carat, fill=color)) + geom_density(alpha=.2)
b <- ggplot(dsmall, aes(x=carat, col=color)) + geom_density()
grid.arrange(a,b, ncol=2)
```
## Multivariate (3+ variables)
This is not much more complicated than taking an appropriate bivariate plot and adding a third variable through paneling, coloring, or changing a shape.
This is trivial to do in ggplot, not trivial in base graphics. So I won't show those examples.
### Three continuous
Continuous variables can also be mapped to the size of the point. Here I set the alpha on the points so we could see the overplotting (many points on a single spot). So the darker the spot the more data points on that spot.
```{r}
ggplot(dsmall, aes(x=carat, y=price, size=depth)) + geom_point(alpha=.2)
```
### Scatterplot matrix
A scatterplot matrix allows you to look at the bivariate comparison of multiple pairs of variables simultaneously. First we need to trim down the data set to only include the variables we want to plot, then we use the `pairs()` function.
```{r}
c.vars <- dsmall[,c('carat', 'depth', 'price', 'x', 'y', 'z')]
pairs(c.vars)
```
We can see price has a non-linear relationship with X, Y and Z and x & y have a near perfect linear relationship.
### Two categorical and one continuous
This is very similar to side by side boxplots, one violin plot per `cut`, within each level of color. This is difficult to really see due to the large number of categories each factor has.
```{r}
ggplot(dsmall, aes(x=color, y=price, fill=cut)) + geom_violin()
```
Best bet here would be to panel on color and change the x axis to cut.
```{r}
ggplot(dsmall, aes(x=cut, y=price, fill=cut)) + geom_violin() + facet_wrap(~color)
```
### Two continuous and one categorical
```{r}
a <- ggplot(dsmall, aes(x=carat, y=price, color=cut)) + geom_point() + ggtitle("Colored by cut")
d <- ggplot(dsmall, aes(x=carat, y=price, color=cut)) + geom_point() +
geom_smooth(se=FALSE) +ggtitle("Lowess line per cut")
grid.arrange(a, d, nrow=1)
```
Change the shape
```{r}
ggplot(dsmall, aes(x=carat, y=price, shape=cut)) + geom_point() + ggtitle("Shape by cut")
```
Or we just panel by the third variable
```{r}
ggplot(dsmall, aes(x=carat, y=price)) + geom_point() + facet_wrap(~cut)
```
## Paneling on two variables
Who says we're stuck with only faceting on one variable? A variant on `facet_wrap` is `facet_grid`. Here we can specify multiple variables to panel on.
```{r, fig.width=10, fig.height=5}
ggplot(dsmall, aes(x=carat, fill=color)) + geom_density() + facet_grid(cut~color)
```
How about plotting price against caret, for all combinations of color and clarity, with the points further separated by cut?
```{r}
ggplot(dsmall, aes(x=carat, y=price, color=cut)) + geom_point() + facet_grid(clarity~color)
```
And lastly let's look back at how we can play with scatterplots of using a third categorical variable (using `ggplot2` only). We can color the points by cut,
```{r}
ggplot(dsmall, aes(x=carat, y=price, color=cut)) + geom_point()
```
We could add a smoothing lowess line for each cut separately,
```{r}
ggplot(dsmall, aes(x=carat, y=price, color=cut)) + geom_point() + geom_smooth(se=FALSE)
```
We could change the color by clarity, and shape by cut.
```{r}
ggplot(dsmall, aes(x=carat, y=price, color=clarity, shape=cut)) + geom_point()
```
That's pretty hard to read. So note that just because you **can** change an aesthetic, doesn't mean you should. And just because you can plot things on the same axis, doesn't mean you have to.
Before you share your plot with any other eyes, always take a step back and try to explain what it is telling you. If you have to take more than a minute to get to the point then it may be too complex and simpler graphics are likely warranted.
-----
## Troubleshooting
**Problem:** Missing data showing up as a category in ggplot?
```{r, echo=FALSE, warning=FALSE, message=FALSE}
NCbirths <- read.csv("https://norcalbiostat.netlify.com/data/NCbirths.csv", header=TRUE)
email <- read.table("https://norcalbiostat.netlify.com/data/email.txt", header=TRUE, sep="\t")
```
Get rid of that far right bar!
```{r}
ggplot(NCbirths, aes(x=marital)) + geom_bar()
```
**Solution:** Use `dplyr` to select only the variables you are going to plot, then pipe in the `na.omit()` at the end. It will create a temporary data frame (e.g) `plot.data` that you then provide to `ggplot()`.
```{r}
plot.data <- NCbirths %>% select(marital) %>% na.omit()
ggplot(plot.data, aes(x=marital)) + geom_bar()
```
**Problem:** Got numerical binary 0/1 data but want to plot it as categorical?
> Other related error messages:
> * Continuous x aesthetic -- did you forget aes(group=...)?
Consider a continuous variable for the number of characters in an email `num_char`, and a 0/1 binary variable `spam`.
**Solution:** Create a second variable `var_factor` for plotting and keep the binary `var` as 0/1 for analysis.
```{r}
email$spam_cat <- factor(email$spam, labels=c("Ham", "Spam"))
ggplot(email, aes(y=num_char, x=spam_cat)) + geom_boxplot()
```
**Problem:** You want to change the legend title for a `fill` or `color` scale.
**Solution:** Add the `name=` argument to whatever layer you added that created the legend. Here I speciefied a `fill`, and it was a `discrete` variable. So I use the `scale_fill_discrete()` layer.
```{r}
ggplot(email, aes(y=num_char, x=spam_cat, fill=spam_cat)) + geom_boxplot() +
scale_fill_discrete(name="Ya like Spam?")
```
Here, I `col`ored the points by a discrete variable, so the layer is `scale_color_discrete()`.
```{r}
ggplot(email, aes(x=num_char, y=line_breaks, col=spam_cat)) + geom_point() +
scale_color_discrete(name="Ya like Spam?")
```
**Problem:** You want to add means to boxplots.
Boxplots are great. Even better with violin overlays. Know what makes them even better than butter? Adding a point for the mean. `stat_summary` is the layer you want to add. Check out [this stack overflow post](https://stackoverflow.com/questions/23942959/ggplot2-show-separate-mean-values-in-box-plot-for-grouped-data) for more context.
```{r}
ggplot(email, aes(x=spam_cat, y=num_char, fill=spam_cat)) +
geom_boxplot() +
stat_summary(fun.y="mean", geom="point", size=3, pch=17,color="blue")
```
I suggest playing around with `size` and plotting character `pch` to get a feel for how these work. You can also look at `?pch` (and scroll down in the help file) to see the 25 default plotting characters.
## But what about...
* Legend adjustment: remove it, move it to another side, rename it
* Custom specified colors and shapes
Go here http://www.cookbook-r.com/Graphs/ for these.
### Other plots not mentioned
* Heat maps https://www.r-bloggers.com/how-to-make-a-simple-heatmap-in-ggplot2/
* Word clouds https://rpubs.com/brandonkopp/creating-word-clouds-in-r , simpler: http://dangoldin.com/2016/06/06/word-clouds-in-r/
* Interactive plots - Look into `plotly()` and `ggplotly()`
* the circle type plots
## Additional Resources
For any Google Search - be sure to limit searches to within the past year or so. R packages get updated very frequently, and many functions change or become obsolete.
* R Graphics: https://www.stat.auckland.ac.nz/~paul/RGraphics/rgraphics.html The best book about using base graphics
* R Graphics Cookbook: http://www.cookbook-r.com/Graphs/ or http://amzn.com/1449316956 The best book for using ggplot2
* STHDA: Statisical tools for high-throughput data analysis. http://www.sthda.com/english/
* Quick-R: [Basic Graphs](http://www.statmethods.net/graphs/index.html)
* Quick-R: [ggplot2](http://www.statmethods.net/advgraphs/ggplot2.html)
* Books
- ggplot2 http://ggplot2.org/book/ or http://amzn.com/0387981403
- qplot http://ggplot2.org/book/qplot.pdf
* Help lists
- ggplot2 mailing list http://groups.google.com/group/ggplot2
- stackoverflow http://stackoverflow.com/tags/ggplot2
- Chico R users group