forked from edzer/sdsr
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path15-Measures.qmd
704 lines (571 loc) · 44.4 KB
/
15-Measures.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
# Measures of Spatial Autocorrelation {#sec-spatautocorr}
\index{spatial autocorrelation!measures of}
```{r setup_sa1, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, paged.print = FALSE)
owidth <- getOption("width")
xargs <- function(x) {
o <- capture.output(args(x))
oo <- gsub(" *", " ", paste(o[-length(o)], collapse=""))
ooo <- strwrap(oo, width=getOption("width"), indent=1, exdent=3)
cat(paste(ooo, collapse="\n"), "\n")
}
```
```{r echo=FALSE}
load("ch14.RData")
```
<!---
might options("width") be set to a smaller value to prevent right margin
overruns in cals to arg(), etc.? Maybe in each chapter? in R 80,
RStudio 82 for this session.
--->
When analysing areal data, it has long been recognised that, if present, spatial autocorrelation changes how we may infer, relative to the default assumption of independent observations. In the presence of spatial autocorrelation, we can predict the values of observation $i$ from the values observed at $j \in N_i$, the set of its proximate neighbours. Early results [@moran48; @geary:54] entered into research practice gradually, for example the social sciences [@duncanetal61]. These results were then collated and extended to yield a set of basic tools of analysis [@cliff+ord:73; @cliff+ord:81].
Cliff and Ord [-@cliff+ord:73] generalised and extended the expression of the spatial weights matrix representation as part of the framework for establishing the distribution theory for join-count, Moran's $I$ and Geary's $C$ statistics. This development of what have become known as global measures, returning a single value of autocorrelation for the total study area, has been supplemented by local measures returning values for each areal unit [@getis+ord:92; @anselin:95].
\index{spatial weights matrix}
The measures offered by the **spdep** package have been written partly to provide implementations, but also to permit the comparative investigation of these measures and their implementation. For this reason, the implementations are written in R rather than compiled code, and they are generally slower but more flexible than implementations in the newly released **rgeoda** package [@rgeoda-package; @https://doi.org/10.1111/gean.12311].
\index{rgeoda}
## Measures and process misspecification {#sec-measprocmisspec}
It is not and has never been the case that Tobler's first law of geography, "Everything is related to everything else, but near things are more related than distant things", always holds absolutely. This is and has always been an oversimplification, disguising possible underlying entitation, support, and other misspecification problems. Are the units of observation appropriate for the scale of the underlying spatial process? Could the spatial patterning of the variable of interest for the chosen entitation be accounted for by another variable?
@10.2307/143141 was published in the same special issue of *Economic Geography* as @10.2307/143140, but Olsson does grasp the important point that spatial autocorrelation is not inherent in spatial phenomena, but often, is engendered by inappropriate entitation, by omitted variables and/or inappropriate functional form. The key quote from Olsson is on p. 228:
> The existence of such autocorrelations makes it tempting to agree with Tobler (1970, 236 [the original refers to the pagination of a conference paper]) that 'everything is related to everything else, but near things are more related than distant things.' On the other hand, the fact that the autocorrelations seem to hide systematic specification errors suggests that the elevation of this statement to the status of 'the first law of geography' is at best premature. At worst, the statement may represent the spatial variant of the post hoc fallacy, which would mean that coincidence has been mistaken for a causal relation.
The status of the "first law" is very similar to the belief that John Snow induced from a map the cause of cholera as water-borne. It may be a good way of selling GIS, but it is inaccurate: Snow had a strong working hypothesis prior to visiting Soho, and the map was prepared after the Broad Street pump was disabled as documentation that his hypothesis held [@BRODY200064].
Measures of spatial autocorrelation unfortunately pick up other misspecifications in the way that we model data [@schabenberger+gotway:2005; @McMillen:2003]. For reference, Moran's $I$ is given as [@cliff+ord:81, page 17]:
\index{Moran's I}
$$
I = \frac{n \sum_{(2)} w_{ij} z_i z_j}{S_0 \sum_{i=1}^{n} z_i^2}
$$
where $x_i, i=1, \ldots, n$ are $n$ observations on the numeric variable of interest, $z_i = x_i - \bar{x}$, $\bar{x} = \sum_{i=1}^{n} x_i / n$, $\sum_{(2)} = \stackrel{\sum_{i=1}^{n} \sum_{j=1}^{n}}{i \neq j}$, $w_{ij}$ are the spatial weights, and $S_0 = \sum_{(2)} w_{ij}$.
First we test a random variable using the Moran test, here under the normality assumption (argument `randomisation=FALSE`, default `TRUE`). Inference is made on the statistic $Z(I) = \frac{I - E(I)}{\sqrt{\mathrm{Var}(I)}}$, the z-value compared with the Normal distribution for $E(I)$ and $\mathrm{Var}(I)$ for the chosen assumptions; this `x` does not show spatial autocorrelation with these spatial weights:
```{r}
library(spdep) |> suppressPackageStartupMessages()
library(parallel)
glance_htest <- function(ht) c(ht$estimate,
"Std deviate" = unname(ht$statistic),
"p.value" = unname(ht$p.value))
set.seed(1)
(pol_pres15 |>
nrow() |>
rnorm() -> x) |>
moran.test(lw_q_B, randomisation = FALSE,
alternative = "two.sided") |>
glance_htest()
```
The test however detects quite strong positive spatial autocorrelation when we insert a gentle trend into the data, but omit to include it in the mean model, thus creating a missing variable problem but finding spatial autocorrelation instead:
```{r}
beta <- 0.0015
coords |>
st_coordinates() |>
subset(select = 1, drop = TRUE) |>
(function(x) x/1000)() -> t
(x + beta * t -> x_t) |>
moran.test(lw_q_B, randomisation = FALSE,
alternative = "two.sided") |>
glance_htest()
```
If we test the residuals of a linear model including the trend, the apparent spatial autocorrelation disappears:
\index{spatial autocorrelation!of residuals}
```{r}
lm(x_t ~ t) |>
lm.morantest(lw_q_B, alternative = "two.sided") |>
glance_htest()
```
A comparison of implementations of measures of spatial autocorrelation shows that a wide range of measures is available in R in a number of packages, chiefly in the **spdep** package [@R-spdep], and that differences from other implementations can be attributed to design decisions [@Bivand2018]. The **spdep** package also includes the only implementations of exact and saddlepoint approximations to global and local Moran's I for regression residuals [@tiefelsdorf:02; @bivandetal:09].
\index{Moran's I!approximations}
\index{Moran's I!global or local}
## Global measures
Global measures consider the average level of spatial autocorrelation across all observations; they can of course be biased (as most spatial statistics) by edge effects where important spatial process components fall outside the study area.
### Join-count tests for categorical data
\index{joint-count test}
We will begin by examining join-count statistics, where `joincount.test` takes a `"factor"` vector of values `fx=` and a `listw` object, and returns a list of `htest` (hypothesis test) objects defined in the **stats** package, one `htest` object for each level of the `fx=` argument. The observed counts are of neighbours with the same factor levels, known as same-colour joins.
\index[function]{joincount.test}
```{r, echo=TRUE, eval=FALSE}
args(joincount.test)
```
```{r, echo=FALSE, eval=TRUE}
xargs(joincount.test)
```
The function takes an `alternative=` argument for hypothesis testing, a `sampling=` argument showing the basis for the construction of the variance of the measure, where the default `"nonfree"` choice corresponds to analytical permutation; the `spChk=` argument is retained for backward compatibility. For reference, the counts of factor levels for the type of municipality or Warsaw borough are:
```{r}
(pol_pres15 |>
st_drop_geometry() |>
subset(select = types, drop = TRUE) -> Types) |>
table()
```
Since there are four levels, we rearrange the list of `htest` objects to give a matrix of estimated results. The observed same-colour join-counts are tabulated with their expectations based on the counts of levels of the input factor, so that few joins would be expected between for example Warsaw boroughs, because there are very few of them. The variance calculation uses the underlying constants of the chosen `listw` object and the counts of levels of the input factor. The z-value is obtained in the usual way by dividing the difference between the observed and expected join-counts by the square root of the variance.
The join-count test was subsequently adapted for multi-colour join-counts [@upton+fingleton:85]. The implementation as `joincount.multi` in **spdep** returns a table based on non-free sampling, and does not report p-values.
\index[function]{joincount.multi}
```{r}
Types |> joincount.multi(listw = lw_q_B)
```
So far, we have used binary weights, so the sum of join-counts multiplied by the weight on that join remains integer. If we change to row standardised weights, where the weights are almost always fractions of 1, the counts, expectations and variances change, but there are few major changes in the z-values.
Using an inverse distance based `listw` object does, however, change the z-values markedly, because closer centroids are up-weighted relatively strongly:
```{r}
Types |> joincount.multi(listw = lw_d183_idw_B)
```
### Moran's $I$
The implementation of Moran's $I$ in **spdep** in the `moran.test` function has similar arguments to those of `joincount.test`, but `sampling=` is replaced by `randomisation=` to indicate the underlying analytical approach used for calculating the variance of the measure. It is also possible to use ranks rather than numerical values [@cliff+ord:81, p. 46]. The `drop.EI2=` argument may be used to reproduce results where the final component of the variance term is omitted as found in some legacy software implementations.
```{r, echo=TRUE, eval=FALSE}
args(moran.test)
```
```{r, echo=FALSE, eval=TRUE}
xargs(moran.test)
```
\index[function]{moran.test}
The default for the `randomisation=` argument is `TRUE`, but here we will simply show that the test under normality is the same as a test of least squares residuals with only the intercept used in the mean model. The analysed variable is first-round turnout proportion of registered voters in municipalities and Warsaw boroughs in the 2015 Polish presidential election. The spelling of randomisation is that of Cliff and Ord [-@cliff+ord:73].
```{r}
pol_pres15 |>
st_drop_geometry() |>
subset(select = I_turnout, drop = TRUE) -> I_turnout
```
```{r}
I_turnout |> moran.test(listw = lw_q_B, randomisation = FALSE) |>
glance_htest()
```
\index[function]{lm.morantest}
The `lm.morantest` function also takes a `resfun=` argument to set the function used to extract the residuals used for testing, and clearly lets us model other salient features of the response variable [@cliff+ord:81, p. 203]. To compare with the standard test, we are only using the intercept here and, as can be seen, the results are the same.
```{r}
lm(I_turnout ~ 1, pol_pres15) |>
lm.morantest(listw = lw_q_B) |>
glance_htest()
```
The only difference between tests under normality and randomisation is that an extra term is added if the kurtosis of the variable of interest indicates a flatter or more peaked distribution, where the measure used is the classical measure of kurtosis. Under the default randomisation assumption of analytical randomisation, the results are largely unchanged.
```{r}
(I_turnout |>
moran.test(listw = lw_q_B) -> mtr) |>
glance_htest()
```
\index[function]{glance\_htest}
From the very beginning in the early 1970s, interest was shown in Monte Carlo tests, also known as Hope-type tests and as permutation bootstrap. By default, `moran.mc` returns a `"htest"` object, but may simply use `boot::boot` internally and return a `"boot"` object when `return_boot=TRUE`. In addition the number of simulations needs to be given as `nsim=`; that is the number of times the values of the observations are shuffled at random.
\index[function]{moran.mc}
```{r}
set.seed(1)
I_turnout |>
moran.mc(listw = lw_q_B, nsim = 999,
return_boot = TRUE) -> mmc
```
The bootstrap permutation retains the outcomes of each of the random permutations, reporting the observed value of the statistic, here Moran's $I$, the difference between this value and the mean of the simulations under randomisation (equivalent to $E(I)$), and the standard deviation of the simulations under randomisation.
If we compare the Monte Carlo and analytical variances of $I$ under randomisation, we typically see few differences, arguably rendering Monte Carlo testing unnecessary.
\index{Geary's C}
\index{Moran's I!empirical Bayes}
\index{Getis-Org G}
\index[function]{geary.test}
```{r}
c("Permutation bootstrap" = var(mmc$t),
"Analytical randomisation" = unname(mtr$estimate[3]))
```
Geary's global $C$ is implemented in `geary.test` largely following the same argument structure as `moran.test`. The Getis-Ord $G$ test includes extra arguments to accommodate differences between implementations, as @Bivand2018 found multiple divergences from the original definitions, often to omit no-neighbour observations generated when using distance band neighbours. It is given by @getis+ord:92, on page 194. For $G^*$, the $i \neq j$ summation constraint is relaxed by including $i$ as a neighbour of itself (thereby also removing the no-neighbour problem, because all observations have at least one neighbour).
Finally, the empirical Bayes Moran's $I$ takes account of the denominator in assessing spatial autocorrelation in rates data [@assuncao+reis:99]. Until now, we have considered the proportion of valid votes cast in relation to the numbers entitled to vote by spatial entity, but using `EBImoran.mc` we can try to accommodate uncertainty in extreme rates in entities with small numbers entitled to vote. There is, however, little impact on the outcome in this case.
Global measures of spatial autocorrelation using spatial weights objects based on graphs of neighbours are, as we have seen, rather blunt tools, which for interpretation depend critically on a reasoned mean model of the variable in question. If the mean model is just the intercept, the global measures will respond to all kinds of misspecification, not only spatial autocorrelation. The choice of entities for aggregation of data will typically be a key source of misspecification.
## Local measures
\index{spatial autocorrelation!local measures of}
Building on insights from the weaknesses of global measures, local indicators of spatial association began to appear in the first half of the 1990s [@anselin:95; @getis+ord:92; @getis+ord:96].
In addition, the Moran plot was introduced, plotting the values of the variable of interest against their spatially lagged values, typically using row-standardised weights to make the axes more directly comparable [@anselin:96]. The `moran.plot` function also returns an influence measures object used to label observations exerting more than proportional influence on the slope of the line representing global Moran's $I$. In @fig-moranplot, we can see that there are many spatial entities exerting such influence. These pairs of observed and lagged observed values make up in aggregate the global measure, but can also be explored in detail. The quadrants of the Moran plot also show low-low pairs in the lower left quadrant, high-high in the upper right quadrant, and fewer low-high and high-low pairs in the upper left and lower right quadrants. In `moran.plot`, the quadrants are split on the means of the variable and its spatial lag; alternative splits are on zero for the centred variable and the spatial lag of the centred variable.
\index[function]{moran.plot}
```{r fig-moranplot, echo=!knitr::is_latex_output()}
#| fig.cap: "Moran plot of I round turnout, row standardised weights"
#| code-fold: true
I_turnout |>
moran.plot(listw = lw_q_W, labels = pol_pres15$TERYT,
cex = 1, pch = ".", xlab = "I round turnout",
ylab = "lagged turnout") -> infl_W
```
If we extract the hat value influence measure from the returned object, @fig-moranhat suggests that some edge entities exert more than proportional influence (perhaps because of row standardisation), as do entities in or near larger urban areas.
```{r fig-moranhat, echo=!knitr::is_latex_output()}
#| fig.cap: "Moran plot hat values, row standardised neighbours"
library(tmap)
pol_pres15$hat_value <- infl_W$hat
tm_shape(pol_pres15) + tm_fill("hat_value")
```
### Local Moran's $I_i$
\index{Moran's I!local}
@Bivand2018 discuss issues impacting the use of local indicators, such as local Moran's $I_i$ and local Getis-Ord $G_i$. Some issues affect the calculation of the local indicators, others inference from their values. Because $n$ statistics may be calculated from the same number of observations, there are multiple comparison problems that need to be addressed. @https://doi.org/10.1111/j.0016-7363.2006.00682.x conclude, based on a typical dataset and a simulation exercise, that the false discovery rate (FDR) adjustment of probability values will certainly give a better picture of interesting clusters than no adjustment. Following this up, @https://doi.org/10.1111/gean.12164 explores the combination of FDR adjustments with the use of redefined "significance" cutoffs [@benjaminetal:18], for example $0.01$, $0.005$, and $0.001$ instead of $0.1$, $0.05$, and $0.01$; the use of the term *interesting* rather than *significant* is also preferred. This is discussed further in @bivand:22. As in the global case, misspecification remains a source of confusion, and, further, interpreting local spatial autocorrelation in the presence of global spatial autocorrelation is challenging [@ord+getis:01; @tiefelsdorf:02; @bivandetal:09].
```{r, echo=TRUE, eval=FALSE}
args(localmoran)
```
```{r, echo=FALSE, eval=TRUE}
xargs(localmoran)
```
\index[function]{localmoran}
In an important clarification, @sauer_oshan_rey_wolf_2021 show that the comparison of standard deviates for local Moran's $I_i$ based on analytical formulae and conditional permutation in @Bivand2018 was based on a misunderstanding. @sokaletal:98 provide alternative analytical formulae for standard deviates of local Moran's $I_i$ based either on total or conditional permutation, but the analytical formulae used in @Bivand2018, based on earlier practice, only use total permutation, and consequently do not match the simulation conditional permutations. Thanks to a timely pull request, `localmoran` now has a `conditional=` argument (default `TRUE`) using alternative formulae from the appendix of @sokaletal:98. The `mlvar=` and `adjust.x=` arguments to `localmoran` are discussed in @Bivand2018, and permit matching with other implementations. Taking `"two.sided"` probability values (the default), we obtain:
```{r}
I_turnout |>
localmoran(listw = lw_q_W) -> locm
```
The $I_i$ local indicators when summed and divided by the sum of the spatial weights equal global Moran's $I$, showing the possible presence of positive and negative local spatial autocorrelation:
```{r}
all.equal(sum(locm[,1])/Szero(lw_q_W),
unname(moran.test(I_turnout, lw_q_W)$estimate[1]))
```
Using `stats::p.adjust` to adjust for multiple comparisons, we see that over 15\% of the 2495 local measures have p-values < 0.005 if no adjustment is applied, but only 1.5\% using Bonferroni adjustment to control the family-wise error rate, with two other choices shown: `"fdr"` is the @fdr-BH false discovery rate (almost 6\%) and `"BY"` [@10.1214/aos/1013699998], another false discovery rate adjustment (about 2.5\%):
```{r}
pva <- function(pv) cbind("none" = pv,
"FDR" = p.adjust(pv, "fdr"), "BY" = p.adjust(pv, "BY"),
"Bonferroni" = p.adjust(pv, "bonferroni"))
locm |>
subset(select = "Pr(z != E(Ii))", drop = TRUE) |>
pva() -> pvsp
f <- function(x) sum(x < 0.005)
apply(pvsp, 2, f)
```
In the global measure case, bootstrap permutations may be used as an alternative to analytical methods for possible inference, where both the theoretical development of the analytical variance of the measure, and the permutation scheme, shuffle all of the observed values. In the local case, conditional permutation should be used, fixing the value at observation $i$ and randomly sampling from the remaining $n-1$ values to find randomised values at neighbours. Conditional permutation is provided as function `localmoran_perm`, which may use multiple compute nodes to sample in parallel if provided, and permits the setting of a seed for the random number generator across the compute nodes. The number of simulations `nsim=` also controls the precision of the ranked estimates of the probability value based on the rank of observed $I_i$ among the simulated values:
\index[function]{localmoran\_perm}
```{r, message=FALSE, cache=TRUE}
library(parallel)
invisible(spdep::set.coresOption(max(detectCores()-1L, 1L)))
I_turnout |>
localmoran_perm(listw = lw_q_W, nsim = 9999,
iseed = 1) -> locm_p
```
The outcome is that over 15\% of observations have two sided p-values < 0.005 without multiple comparison adjustment, and about 1.5\% with Bonferroni adjustment, when the p-values are calculated using the standard deviate of the permutation samples and the normal distribution.
```{r}
locm_p |>
subset(select = "Pr(z != E(Ii))", drop = TRUE) |>
pva() -> pvsp
apply(pvsp, 2, f)
```
Since the variable under analysis may not be normally distributed, the p-values can also be calculated by finding the rank of the observed $I_i$ among the rank-based simulated values, and looking up the probability value from the uniform distribution taking the `alternative=` choice into account:
```{r}
locm_p |>
subset(select = "Pr(z != E(Ii)) Sim", drop = TRUE) |>
pva() -> pvsp
apply(pvsp, 2, f)
```
Now the `"BY"` and Bonferroni counts of *interesting* locations are zero with 9999 samples, but may be recovered by increasing the sample count to 999999 if required; the FDR adjustment and *interesting* cutoff 0.005 yields about 5\% locations.
```{r}
pol_pres15$locm_pv <- p.adjust(locm[, "Pr(z != E(Ii))"], "fdr")
pol_pres15$locm_std_pv <- p.adjust(locm_p[, "Pr(z != E(Ii))"],
"fdr")
pol_pres15$locm_p_pv <- p.adjust(locm_p[, "Pr(z != E(Ii)) Sim"],
"fdr")
```
```{r fig-localmoranPr, echo=!knitr::is_latex_output()}
#| code-fold: true
#| fig.cap: "Local Moran's I FDR probability values: left upper panel: analytical conditional p-values; right upper panel: permutation standard deviate conditional p-values; left lower panel: permutation rank conditional p-values, first-round turnout, row-standardised neighbours"
tm_shape(pol_pres15) +
tm_fill(c("locm_pv", "locm_std_pv", "locm_p_pv"),
breaks=c(0, 0.0005, 0.001, 0.005, 0.01,
0.05, 0.1, 0.2, 0.5, 0.75, 1),
title = "Pseudo p-values\nLocal Moran's I",
palette="-YlOrBr") +
tm_facets(free.scales = FALSE, ncol = 2) +
tm_layout(panel.labels = c("Analytical conditional",
"Permutation std. dev.",
"Permutation rank"))
```
Proceeding using the FDR adjustment and an *interesting* location cutoff of $0.005$, we can see from @fig-localmoranPr that the adjusted probability values for the analytical conditional approach, the approach using the moments of the sampled values from permutation sampling, and the approach using the ranks of observed values among permutation samples all yield similar maps, as the distribution of the input variable is quite close to normal.
\index{hotspots}
\index{Moran's I!hotspots}
In presenting local Moran's $I$, use is often made of "hotspot" maps. Because $I_i$ takes high values both for strong positive autocorrelation of low and high values of the input variable, it is hard to show where "clusters" of similar neighbours with low or high values of the input variable occur. The quadrants of the Moran plot are used, by creating a categorical quadrant variable interacting the input variable and its spatial lag split at their means. The quadrant categories are then set to NA if, for the chosen probability value and adjustment, $I_i$ would not be considered *interesting*. Here, for the FDR adjusted conditional analytical probability values (@fig-localmoranPr, upper left panel), 53 observations belong to `"Low-Low"` cluster cores, and 96 to `"High-High"` cluster cores, similarly for the standard deviate-based permutation p-values (@fig-localmoranPr, upper right panel), but the rank-based permutation p-values reduce the `"High-High"` count and increase the `"Low-Low"` count @fig-localmoranPr lower left panel:
```{r}
quadr <- attr(locm, "quadr")$mean
a <- table(addNA(quadr))
locm |> hotspot(Prname="Pr(z != E(Ii))", cutoff = 0.005,
droplevels=FALSE) -> pol_pres15$hs_an_q
locm_p |> hotspot(Prname="Pr(z != E(Ii))", cutoff = 0.005,
droplevels=FALSE) -> pol_pres15$hs_ac_q
locm_p |> hotspot(Prname="Pr(z != E(Ii)) Sim", cutoff = 0.005,
droplevels = FALSE) -> pol_pres15$hs_cp_q
b <- table(addNA(pol_pres15$hs_an_q))
c <- table(addNA(pol_pres15$hs_ac_q))
d <- table(addNA(pol_pres15$hs_cp_q))
t(rbind("Moran plot quadrants" = a, "Analytical cond." = b,
"Permutation std. cond." = c, "Permutation rank cond." = d))
```
```{r}
pol_pres15$hs_an_q <- droplevels(pol_pres15$hs_an_q)
pol_pres15$hs_ac_q <- droplevels(pol_pres15$hs_ac_q)
pol_pres15$hs_cp_q <- droplevels(pol_pres15$hs_cp_q)
```
```{r fig-Iihotspots, echo=!knitr::is_latex_output()}
#| code-fold: true
#| fig.cap: "Local Moran\\'s I FDR hotspot cluster core maps $\\alpha = 0.005$: left upper panel: analytical conditional p-values; right upper panel: permutation standard deviate conditional p-values; left lower panel: permutation rank conditional p-values, first-round turnout, row-standardised neighbours"
tm_shape(pol_pres15) +
tm_fill(c("hs_an_q", "hs_ac_q", "hs_cp_q"),
colorNA = "grey95", textNA="Not \"interesting\"",
title = "Turnout hotspot status \nLocal Moran's I",
palette = RColorBrewer::brewer.pal(4, "Set3")[-c(2,3)]) +
tm_facets(free.scales = FALSE, ncol = 2) +
tm_layout(panel.labels = c("Analytical conditional",
"Permutation std. cond.",
"Permutation rank cond."))
```
@fig-Iihotspots shows that there is very little difference between the FDR-adjusted *interesting* clusters with a choice of an $\alpha=0.005$ probability value cutoff for the three approaches of analytical conditional standard deviates, permutation-based standard deviates, and rank-based probability values; the `"High-High"` cluster cores are metropolitan areas.
@tiefelsdorf:02 argues that standard approaches to the calculation of the standard deviates of local Moran's $I_i$ should be supplemented by numerical estimates, and shows that saddlepoint approximations are a computationally efficient way of achieving this goal. The `localmoran.sad` function takes a fitted linear model as its first argument, so we first fit a null (intercept only) model, but use case weights because the numbers entitled to vote vary greatly between observations:
\index[function]{localmoran.sad}
\index{Moran's I!saddlepoint approximation}
```{r}
lm(I_turnout ~ 1) -> lm_null
```
Saddlepoint approximation is as computationally intensive as conditional permutation, because, rather than computing a simple measure on many samples, a good deal of numerical calculation is needed for each local approximation:
```{r, cache=TRUE}
lm_null |> localmoran.sad(nb = nb_q, style = "W",
alternative = "two.sided") |>
summary() -> locm_sad_null
```
The chief advantage of the saddlepoint approximation is that it takes a fitted linear model rather than simply a numerical variable, so the residuals are analysed. With an intercept-only model, the results are similar to local Moran's $I_i$, but we can weight the observations, here by the count of those entitled to vote, which should down-weight small units of observation:
```{r, cache=TRUE}
lm(I_turnout ~ 1, weights = pol_pres15$I_entitled_to_vote) ->
lm_null_weights
lm_null_weights |>
localmoran.sad(nb = nb_q, style = "W",
alternative = "two.sided") |>
summary() -> locm_sad_null_weights
```
Next we add the categorical variable distinguishing between rural, urban and other types of observational unit:
```{r, cache=TRUE}
lm(I_turnout ~ Types, weights=pol_pres15$I_entitled_to_vote) ->
lm_types
lm_types |> localmoran.sad(nb = nb_q, style = "W",
alternative = "two.sided") |>
summary() -> locm_sad_types
```
```{r}
locm_sad_null |> hotspot(Prname="Pr. (Sad)",
cutoff=0.005) -> pol_pres15$locm_sad0
locm_sad_null_weights |> hotspot(Prname="Pr. (Sad)",
cutoff = 0.005) -> pol_pres15$locm_sad1
locm_sad_types |> hotspot(Prname="Pr. (Sad)",
cutoff = 0.005) -> pol_pres15$locm_sad2
```
```{r fig-localmoranHsad, echo=!knitr::is_latex_output()}
#| code-fold: true
#| fig.cap: "Local Moran\\'s I FDR hotspot cluster core maps, two-sided, *interesting* cutoff $\\alpha = 0.005$: left upper panel: permutation rank conditional p-values; right upper panel: null (intercept only) model saddlepoint p-values; left lower panel: weighted null (intercept only) model saddlepoint p-values; right lower panel: weighted types model saddlepoint p-values, for first-round turnout, row-standardised neighbours"
tm_shape(pol_pres15) +
tm_fill(c("hs_cp_q", "locm_sad0", "locm_sad1", "locm_sad2"),
colorNA = "grey95", textNA = "Not \"interesting\"",
title = "Turnout hotspot status \nLocal Moran's I",
palette =RColorBrewer::brewer.pal(4, "Set3")[c(1, 4, 2)]) +
tm_facets(free.scales = FALSE, ncol = 2) +
tm_layout(panel.labels = c("Permutation rank",
"saddlepoint null", "saddlepoint weighted null",
"saddlepoint weighted types"))
```
```{r}
rbind(null = append(table(addNA(pol_pres15$locm_sad0)),
c("Low-High" = 0), 1),
weighted = append(table(addNA(pol_pres15$locm_sad1)),
c("Low-High" = 0), 1),
type_weighted = append(table(addNA(pol_pres15$locm_sad2)),
c("Low-High" = 0), 1))
```
\newpage
@fig-localmoranHsad includes the permutation rank cluster cores for comparison (upper left panel). Because saddlepoint approximation permits richer mean models to be used, and possibly because the approximation approach is inherently local, relating regression residual values at $i$ to those of its neighbours, the remaining three panels diverge somewhat. The intercept-only (null) model is fairly similar to standard local Moran's $I_i$, but weighting by counts of eligible voters removes most of the `"Low-Low"` cluster cores. Adding the type categorical variable strengthens the urban `"High-High"` cluster cores but removes the Warsaw boroughs as *interesting* cluster cores. The central boroughs are surrounded by other boroughs, all with high turnout, not driven by autocorrelation but by being metropolitan boroughs. It is also possible to use saddlepoint approximation where the global spatial process has been incorporated, removing the conflation of global and local spatial autocorrelation in standard approaches.
The same can also be accomplished using exact methods, but may require more tuning as numerical integration may fail, returning `NaN` rather than the exact estimate of the standard deviate [@bivandetal:09]:
```{r, cache=TRUE}
lm_types |> localmoran.exact(nb = nb_q, style = "W",
alternative = "two.sided", useTP=TRUE, truncErr=1e-8) |>
as.data.frame() -> locm_ex_types
```
```{r}
locm_ex_types |> hotspot(Prname = "Pr. (exact)",
cutoff = 0.005) -> pol_pres15$locm_ex
```
```{r fig-localmoranHsadex, echo=!knitr::is_latex_output()}
#| code-fold: true
#| fig.cap: "Local Moran's I FDR hotspot cluster core maps, two-sided, *interesting* cutoff $\\alpha = 0.005$: left panel: weighted types model saddlepoint p-values; right panel: weighted types model exact p-values, for first-round turnout, row-standardised neighbours"
tm_shape(pol_pres15) +
tm_fill(c("locm_sad2", "locm_ex"), colorNA = "grey95",
textNA = "Not \"interesting\"",
title = "Turnout hotspot status \nLocal Moran's I",
palette = RColorBrewer::brewer.pal(4, "Set3")[c(1, 4, 2)]) +
tm_facets(free.scales = FALSE, ncol = 2) +
tm_layout(panel.labels = c("saddlepoint weighted types",
"Exact weighted types"))
```
As @fig-localmoranHsadex shows, the exact and saddlepoint approximation methods yield almost identical cluster classifications from the same regression residuals, multiple comparison adjustment method, and cutoff level, with the exact method returning four more *interesting* observations:
```{r}
table(Saddlepoint = addNA(pol_pres15$locm_sad2),
exact = addNA(pol_pres15$locm_ex))
```
### Local Getis-Ord $G_i$
\index{Getis-Ord G!local}
\index[function]{localG}
The local Getis-Ord $G_i$ measure [@getis+ord:92; @getis+ord:96] is reported as a standard deviate, and, may also take the $G^*_i$ form where self-neighbours are inserted into the neighbour object using `include.self`. The observed and expected values of local $G$ with their analytical variances may also be returned if `return_internals=TRUE`.
```{r}
I_turnout |>
localG(lw_q_W, return_internals = TRUE) -> locG
```
Permutation inference is also available for this measure:
```{r, cache=TRUE}
I_turnout |>
localG_perm(lw_q_W, nsim = 9999, iseed = 1) -> locG_p
```
The correlation between the two-sided probability values for analytical and permutation-based standard deviates (first two columns and rows) and permutation rank-based probability values are very strong:
```{r}
cor(cbind(localG=attr(locG, "internals")[, "Pr(z != E(Gi))"],
attr(locG_p, "internals")[, c("Pr(z != E(Gi))",
"Pr(z != E(Gi)) Sim")]))
```
### Local Geary's $C_i$
\index{Geary's C!local}
\index[function]{localC\_perm}
@https://doi.org/10.1111/gean.12164 extends @anselin:95 and has been recently added to **spdep** thanks to contributions by Josiah Parry (pull request https://github.com/r-spatial/spdep/pull/66). The conditional permutation framework used for $I_i$ and $G_i$ is also used for $C_i$:
```{r, cache=TRUE}
I_turnout |>
localC_perm(lw_q_W, nsim=9999, iseed=1) -> locC_p
```
The permutation standard deviate-based and rank-based probability values are not as highly correlated as for $G_i$, in part reflecting the difference in view of autocorrelation in $C_i$ as represented by a function of the differences between values rather than the products of values:
```{r}
cor(attr(locC_p, "pseudo-p")[, c("Pr(z != E(Ci))",
"Pr(z != E(Ci)) Sim")])
```
```{r}
locC_p |> hotspot(Prname = "Pr(z != E(Ci)) Sim",
cutoff = 0.005) -> pol_pres15$hs_C
locG_p |> hotspot(Prname = "Pr(z != E(Gi)) Sim",
cutoff = 0.005) -> pol_pres15$hs_G
```
```{r fig-localIGC, echo=!knitr::is_latex_output(), message=FALSE}
#| code-fold: true
#| fig.cap: "FDR hotspot cluster core maps, two-sided, *interesting* cutoff $\\alpha = 0.005$: left panel: local Moran\\'s $I_i$; centre panel: local Getis-Ord $G_i$; right panel: local Geary\\'s $C_i$; first-round turnout, row-standardised neighbours"
m1 <- tm_shape(pol_pres15) +
tm_fill("hs_cp_q",
palette = RColorBrewer::brewer.pal(4, "Set3")[-c(2,3)],
colorNA = "grey95", textNA = "Not \"interesting\"",
title = "Turnout hotspot status\nLocal Moran I") +
tm_layout(legend.outside=TRUE, legend.outside.position="bottom")
m2 <- tm_shape(pol_pres15) +
tm_fill("hs_G",
palette = RColorBrewer::brewer.pal(4, "Set3")[-c(2,3)],
colorNA = "grey95", textNA="Not \"interesting\"",
title = "Turnout hotspot status\nLocal Getis/Ord G") +
tm_layout(legend.outside=TRUE, legend.outside.position="bottom")
m3 <- tm_shape(pol_pres15) +
tm_fill("hs_C",
palette = RColorBrewer::brewer.pal(4, "Set3")[c(4, 1, 3)],
colorNA = "grey95", textNA="Not \"interesting\"",
title = "Turnout hotspot status\nLocal Geary C") +
tm_layout(legend.outside=TRUE, legend.outside.position="bottom")
tmap_arrange(m1, m2, m3, nrow=1)
```
@fig-localIGC shows that the cluster cores identified as *interesting* using $I_i$, $G_i$ and $C_i$ for the same variable, first-round turnout, and the same spatial weights, for rank-based permutation FDR adjusted probability values and an $\alpha = 0.005$ cutoff, are very similar. In most cases, the `"High-High"` cluster cores are urban areas, and `"Low-Low"` cores are sparsely populated rural areas in the North, in addition to the German national minority areas close to the southern border. The three measures use slightly different strategies for naming cluster cores: $I_i$ uses quadrants of the Moran scatterplot, $G_i$ splits into `"Low"` and `"High"` on the mean of the input variable (which is the same as the first component in the $I_i$ tuple), and univariate $C_i$ on the mean of the input variable and zero for its lag. As before, cluster categories that do not occur are dropped.
For comparison, and before moving to multivariate $C_i$, let us take the univariate $C_i$ for the second (final) round turnout. One would expect that the run-off between the two top candidates from the first-round might mobilise some voters who did not have a clear first-round preference, but that it discourages some of those with strong loyalty to a candidate eliminated after the first round:
```{r, cache=TRUE}
pol_pres15 |>
st_drop_geometry() |>
subset(select = II_turnout) |>
localC_perm(lw_q_W, nsim=9999, iseed=1) -> locC_p_II
```
```{r}
locC_p_II |> hotspot(Prname = "Pr(z != E(Ci)) Sim",
cutoff = 0.005) -> pol_pres15$hs_C_II
```
Multivariate $C_i$ [@https://doi.org/10.1111/gean.12164] is taken as the sum of univariate $C_i$ divided by the number of variables, but permutation is fixed so that the correlation between the variables is unchanged:
```{r, cache=TRUE}
pol_pres15 |>
st_drop_geometry() |>
subset(select = c(I_turnout, II_turnout)) |>
localC_perm(lw_q_W, nsim=9999, iseed=1) -> locMvC_p
```
Let us check that the multivariate $C_i$ is equal to the mean of the univariate $C_i$:
```{r}
all.equal(locMvC_p, (locC_p+locC_p_II)/2,
check.attributes = FALSE)
```
```{r}
locMvC_p |> hotspot(Prname = "Pr(z != E(Ci)) Sim",
cutoff = 0.005) -> pol_pres15$hs_MvC
```
```{r fig-localCmulti, echo=!knitr::is_latex_output(), message=FALSE}
#| code-fold: true
#| fig.cap: 'FDR hotspot cluster core maps, two-sided, *interesting* cutoff $\alpha = 0.005$: left panel: local $C_i$, first-round turnout; centre panel: local $C_i$, second-round turnout; right panel: local multivariate $C_i$, both turnout rounds; row-standardised neighbours'
m3 <- tm_shape(pol_pres15) +
tm_fill("hs_C",
palette = RColorBrewer::brewer.pal(4, "Set3")[c(4, 1, 3, 2)],
colorNA = "grey95", textNA = "Not \"interesting\"",
title = "First round turnout\nLocal Geary C") +
tm_layout(legend.outside=TRUE, legend.outside.position="bottom")
m4 <- tm_shape(pol_pres15) +
tm_fill("hs_C_II",
palette = RColorBrewer::brewer.pal(4, "Set3")[c(4, 1, 3, 2)],
colorNA = "grey95", textNA = "Not \"interesting\"",
title="Second round turnout\nLocal Geary C") +
tm_layout(legend.outside=TRUE, legend.outside.position="bottom")
m5 <- tm_shape(pol_pres15) +
tm_fill("hs_MvC",
palette = RColorBrewer::brewer.pal(4, "Set3")[c(4, 1)],
colorNA = "grey95", textNA = "Not \"interesting\"",
title = "Both rounds turnout\nLocal Multivariate Geary C") +
tm_layout(legend.outside=TRUE, legend.outside.position="bottom")
tmap_arrange(m3, m4, m5, nrow=1)
```
@fig-localCmulti indicates that the multivariate measure picks up aggregated elements of observations found *interesting* in the two univariate measures. We can break this down by interacting the first- and second-round univariate measures, and tabulating against the multivariate measure.
```{r}
table(droplevels(interaction(addNA(pol_pres15$hs_C),
addNA(pol_pres15$hs_C_II), sep=":")),
addNA(pol_pres15$hs_MvC))
```
For these permutation outcomes, 47 observations in the multivariate case are found *interesting* where neither of the univariate $C_i$ were found *interesting* (FDR, cutoff $0.005$). Almost all of the observations found *interesting* in both first and second round, are also interesting in the multivariate case, but outcomes are more mixed when observations were only found interesting in one of the two rounds.
### The **rgeoda** package
\index{rgeoda}
Geoda has been wrapped for R as **rgeoda** [@R-rgeoda], and provides very similar functionalities for the exploration of spatial autocorrelation in areal data as matching parts of **spdep**. The active objects are kept as pointers to a compiled code workspace; using compiled code for all operations (as in Geoda itself) makes **rgeoda** perform fast, but makes it less flexible when modifications or enhancements are desired.
\index[function]{queen\_weights}
```{r, message=FALSE}
library(rgeoda)
Geoda_w <- queen_weights(pol_pres15)
summary(Geoda_w)
```
For comparison, let us take the multivariate $C_i$ measure of turnout in the two rounds of the 2015 Polish presidential election as above:
\index[function]{local\_multigeary}
```{r, message=FALSE}
lisa <- local_multigeary(Geoda_w,
pol_pres15[c("I_turnout", "II_turnout")],
cpu_threads = max(detectCores() - 1, 1),
permutations = 99999, seed = 1)
```
The contiguity neighbours are the same as those found by `poly2nb`:
\index[function]{lisa\_num\_nbrs}
```{r, message=FALSE}
all.equal(card(nb_q), lisa_num_nbrs(lisa),
check.attributes = FALSE)
```
as are the multivariate $C_i$ values the same as those found above:
\index[function]{lisa\_values}
```{r, message=FALSE}
all.equal(lisa_values(lisa), c(locMvC_p),
check.attributes = FALSE)
```
One difference is that the range of the folded two-sided rank-based permutation probability values used by **rgeoda** is $[0, 0.5]$, also reported in **spdep**:
```{r}
apply(attr(locMvC_p, "pseudo-p")[,c("Pr(z != E(Ci)) Sim",
"Pr(folded) Sim")], 2, range)
```
This means that the cutoff corresponding to $0.005$ over $[0, 1]$ is $0.0025$ over $[0, 0.5]$:
```{r}
locMvC_p |> hotspot(Prname = "Pr(folded) Sim",
cutoff = 0.0025) -> pol_pres15$hs_MvCa
```
So although `local_multigeary` used the default cutoff of $0.05$ in setting cluster core classes, we can sharpen the cutoff and apply the FDR adjustment on output components of the `lisa` object in the compiled code workspace:
```{r}
mvc <- factor(lisa_clusters(lisa), levels=0:2,
labels = lisa_labels(lisa)[1:3])
is.na(mvc) <- p.adjust(lisa_pvalues(lisa), "fdr") >= 0.0025
pol_pres15$geoda_mvc <- droplevels(mvc)
```
About 80 more observations are found *interesting* in the **rgeoda** permutation, and further analysis of implementation details is still in progress:
```{r}
addmargins(table(spdep = addNA(pol_pres15$hs_MvCa),
rgeoda = addNA(pol_pres15$geoda_mvc)))
```
```{r fig-localCmultix, echo=!knitr::is_latex_output(), message=FALSE}
#| out.width: 100%
#| code-fold: true
#| fig.cap: 'FDR local multivariate $C_i$ hotspot cluster core maps, two-sided, *interesting* cutoff $\alpha = 0.0025$ over $[0, 0.5]$: left panel: **spdep**, both turnout rounds; right panel: **rgeoda**, both turnout rounds; row-standardised neighbours'
m5 <- tm_shape(pol_pres15) +
tm_fill("hs_MvCa",
palette = RColorBrewer::brewer.pal(4, "Set3")[c(4, 1)],
colorNA = "grey95", textNA = "Not \"interesting\"",
title = "Both rounds turnout spdep\nLocal Multivariate Geary C")
m6 <- tm_shape(pol_pres15) +
tm_fill("geoda_mvc",
palette = RColorBrewer::brewer.pal(4, "Set3")[c(4, 1)],
colorNA = "grey95", textNA = "Not \"interesting\"",
title="Both rounds turnout rgeoda\nLocal Multivariate Geary C")
tmap_arrange(m5, m6, nrow=1)
```
@fig-localCmultix shows that while almost all of the 242 observations found *interesting* in the **spdep** implementation were also *interesting* for **rgeoda**, the latter found a further 86 *interesting*. Of course, permutation outcomes are bound to vary, but it remains to establish whether either or both implementations require revision.
## Exercises
1. Why are join-count measures on a chessboard so different between `rook` and `queen` neighbours?
2. Please repeat the simulation shown in @sec-measprocmisspec using the chessboard polygons and the row-standardised `queen` contiguity neighbours. Why is it important to understand that spatial autocorrelation usually signals (unavoidable) misspecification in our data?
3. Why is false discovery rate adjustment recommended for local measures of spatial autocorrelation?
4. Compare the local Moran's $I_i$ standard deviate values for the simulated data from exercise 2 (above) for the analytical conditional approach, and saddlepoint approximation. Consider the advantages and disadvantages of the saddlepoint approximation approach.
{{< include ga.qmd >}}