-
Notifications
You must be signed in to change notification settings - Fork 55
/
Copy pathconnections.Rmd
594 lines (436 loc) · 35.8 KB
/
connections.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
```{r include=FALSE, eval = TRUE}
knitr::opts_chunk$set(eval = FALSE)
source("r/render.R")
styles <- "
#padding: 16
#fontSize: 18
#direction: right
#lineWidth:2
#leading:1
#spacing: 20
#.hidden: visual=hidden
"
```
# Connections {#connections}
> They don’t get to choose.
>
> --- Daenerys Targaryen
[Chapter 6](#clusters) presented<!--((("clusters", "connecting to online", id="Conline07")))--> the major cluster computing trends, cluster managers, distributions, and cloud service providers to help you choose the Spark cluster that best suits your needs. In contrast, this chapter presents the internal components of a Spark cluster and how to connect to a particular Spark cluster.<!--((("clusters", "connecting to online", see="also connections")))-->
When reading this chapter, don’t try to execute every line of code; this would be quite hard since you would need to prepare different Spark environments. Instead, if you already have a Spark cluster or if the previous chapter gets you motivated enough to sign up for an on-demand cluster, now is the time to learn how to connect to it. This chapter helps you connect to your cluster, which you should have already chosen by now. Without a cluster, we recommend that you learn the concepts and come back to execute code later on.
In addition, this chapter provides various troubleshooting connection techniques. While we hope you won’t need to use them, this chapter prepares you to use them as effective techniques to resolve connectivity issues.
While<!--((("Apache Spark", "architecture")))--> this chapter might feel a bit dry—connecting and troubleshooting connections is definitely not the most exciting part of large-scale computing—it introduces the components of a Spark cluster and how they interact, often known as the _architecture_ of Apache Spark. This chapter, along with Chapters [8](#data) and
[9](#tuning), will provide a detailed view of how Spark works, which will help you move toward becoming an intermediate Spark user who can truly understand the exciting world of distributed computing using Apache Spark.
## Overview {#connections-overview}
The<!--((("connections", "overview of")))((("driver nodes")))((("worker nodes")))((("cluster managers", "purpose of")))--> overall connection architecture for a Spark cluster is composed of three types of compute instances: the _driver node_, the _worker nodes_, and the _cluster manager_. A cluster manager is a service that allows Spark to be executed in the cluster; this was detailed in [Clusters - Managers](#clusters-manager). The<!--((("executors")))--> worker nodes (also referred to as _executors_) execute compute tasks over partitioned data and communicate intermediate results to other workers or back to the driver node. The driver node is tasked with delegating work to the worker nodes, but also with aggregating their results and controlling computation flow. For the most part, aggregation happens in the worker nodes; however, even after the nodes aggregate data, it is often the case that the driver node would need to collect the worker’s results. Therefore, the driver node usually has at least, but often much more, compute resources (memory, CPUs, local storage, etc.) as the worker node.
Strictly<!--((("Spark context")))((("spark executor")))--> speaking, the driver node and worker nodes are just names assigned to machines with particular roles, while the actual computation in the driver node is performed by the _Spark context_. The Spark context is the main entry point for Spark functionality since it’s tasked with scheduling tasks, managing storage, tracking execution status, specifying access configuration settings, canceling jobs, and so on. In the worker nodes, the actual computation is performed under a _spark executor_, which is a Spark component tasked with executing subtasks against a specific data partition.
Figure \@ref(fig:connections-architecture) illustrates this concept, where the driver node orchestrates a worker’s work through the cluster manager.
```{r connections-architecture, eval=TRUE, echo=FALSE, message=FALSE, fig.align = 'center', fig.cap='Apache Spark connection architecture', out.width = 'auto', out.height = '280pt'}
render_nomnoml("
[Driver | [Spark Context]]
[Driver]-[Cluster Manager]
[Cluster Manager]-[Worker (1) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (2) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (3) | [Spark Executor(s)]]
", "images/connections-spark-architecture.png", "Apache Spark Architecture", styles)
```
If you already have a Spark cluster in your organization, you should request the connection information to this cluster from your cluster administrator, read their usage policies carefully, and follow their advice. Since a cluster can be shared among many users, you want to ensure that you request only the compute resources you need. We cover how to request resources in [Chapter 9](#tuning). Your system administrator will specify whether it’s an _on-premises_ or _cloud_ cluster, the cluster manager being used, supported _connections_, and supported _tools_. You can use this information to jump directly to [Standalone](#connections-standalone), [YARN](#connections-yarn), [Mesos](#connections-mesos), [Livy](#connections-livy), or [Kubernetes](#connections-kubernetes) based on which is appropriate for your situation.
**Note:** After<!--((("commands", "spark_connect()")))--> you've used `spark_connect()` to connect, you can use all the techniques described in previous chapters using the `sc` connection; for instance, you can do data analysis or modeling with the same code previous chapters presented.
### Edge Nodes {#connections-spark-edge-nodes}
Computing clusters<!--((("edge nodes")))((("connections", "to edge nodes")))--> are configured to enable high bandwidth and fast network connectivity between nodes. To optimize network connectivity, the nodes in the cluster are configured to trust one another and to disable security features. This improves performance but requires you to close all external network communication, making the entire cluster secure as a whole except for a few cluster machines that are carefully configured to accept connections from outside the cluster; conceptually, these machines are located in the "edge" of the cluster and are known as _edge nodes_.
Therefore, before connecting to Apache Spark, it is likely that you will first need to connect to an edge node in your cluster. There are two methods to connect:
Terminal
: Using a computer terminal application, you can use a [Secure Shell](http://bit.ly/2TE8cY9) to establish a remote connection into the cluster; after you connect into the cluster, you can launch R and then use `sparklyr`. However, a terminal can be cumbersome for some tasks, like exploratory data analysis, so it’s often used only while configuring the cluster or troubleshooting issues.
Web Browser
: While using `sparklyr` from a terminal is possible, it is usually more productive to install a _web server_ in an edge node that provides access to run R with `sparklyr` from a web browser. Most likely, you will want to consider using RStudio or Jupyter rather than connecting from the terminal.
Figure \@ref(fig:connections-spark-edge) explains these concepts visually. The left block is usually your web browser, and the right block is the edge node. Client and edge nodes communicate over HTTP when using a web browser or Secure Shell (SSH) when using the terminal.
```{r connections-spark-edge, eval=TRUE, echo=FALSE, message=FALSE, fig.align = 'center', fig.cap='Connecting to Spark’s edge node', out.height = '240pt', out.width = 'auto'}
render_nomnoml("
[Client|
[Terminal]
[Web Browser |
[RStudio]
[Jupyter]
]
]-[<hidden> Secure Shell / HTTP]
[Secure Shell / HTTP]-[Edge|
[Secure Shell Server] - [R]
[RStudio Server] - [R]
[Jupyter Server] - [R]
]", "images/connections-spark-edge-node.png", "Connecting to Spark's Edge Node", styles)
```
### Spark Home {#connections-spark-home}
After<!--((("Apache Spark", "home identifier")))((("SPARK_HOME")))--> you connect to an edge node, the next step is to determine where Spark is installed, a location known as the `SPARK_HOME`. In most cases, your cluster administrator will have already set the `SPARK_HOME` environment variable to the correct installation path. If not, you will need to get the correct _SPARK_HOME_ path. You must specify the `SPARK_HOME` path as an environment variable or explicitly when running `spark_connect()` using the `spark_home` parameter.
If your cluster provider or cluster administrator already provided `SPARK_HOME` for you, the following code should return a path instead of an empty string:
```{r connection-spark-home-check}
Sys.getenv("SPARK_HOME")
```
If this code returns an empty string, this would mean that the `SPARK_HOME` environment variable is not set in your cluster, so you will need to specify `SPARK_HOME` while using `spark_connect()`, as follows:
```{r connections-spark-home}
sc <- spark_connect(master = "<master>", spark_home = "local/path/to/spark")
```
In this example, `master` is set to the correct cluster manager master for [Spark Standalone](#connections-standalone), [YARN](#connections-yarn), [Mesos](#connections-mesos), [Kubernetes](#connections-kubernetes), or [Livy](#connections-livy).
## Local {#connections-local}
When<!--((("local clusters")))((("connections", "to local clusters")))((("clusters", "connecting to local")))--> you connect to Spark in local mode, Spark starts a single process that runs most of the cluster components like the Spark context and a single executor. This is ideal to learn Spark, work offline, troubleshoot issues, or test code before you run it over a large compute cluster. Figure \@ref(fig:connections-local-diagram) depicts a local connection to Spark.
```{r connections-local-diagram, eval=TRUE, echo=FALSE, message=FALSE, fig.align = 'center', fig.cap='Local connection diagram', out.height='260pt', out.width='auto'}
render_nomnoml("
[Driver|
[R]
[sparklyr]
[spark-submit]
[Spark Context]
[Spark Executor]
]", "images/connections-spark-local.png", "Local Connection Diagram", styles)
```
Notice that there is neither a cluster manager nor worker process since, in local mode, everything runs inside the driver application. It’s<!--((("sparklyr package", "spark-submit script")))((("spark-submit script")))--> also worth noting that `sparklyr` starts the Spark context through `spark-submit`, a script available in every Spark installation to enable users to submit custom applications to Spark. If you're curious, <<contributing>> explains the internal processes that take place in `sparklyr` to submit this application and connect properly from R.
To perform this local connection, we can use the following familiar code from previous chapters:
```{r connections-local-connect}
# Connect to local Spark instance
sc <- spark_connect(master = "local")
```
## Standalone {#connections-standalone}
Connecting<!--((("connections", "Spark Standalone")))((("Spark Standalone")))((("Standalone clusters")))((("cluster managers", "Spark Standalone")))--> to a Spark Standalone cluster requires the location of the cluster manager’s master instance, which you can find in the cluster manager web interface as described in the [Clusters - Standalone](#clusters-standalone) section. You can find this location by looking for a URL starting with `spark://`.
A connection in standalone mode starts from `sparklyr`, which launches `spark-submit`, which then submits the `sparklyr` application and creates the Spark Context, which requests executors from the Spark Standalone instance running under the given `master` address.
Figure \@ref(fig:connections-standalone-diagram) illustrates this process, which is quite similar to the overall connection architecture from Figure \@ref(fig:connections-architecture) but with additional details that are particular to standalone clusters and `sparklyr`.
```{r connections-standalone-diagram, eval=TRUE, echo=FALSE, message=FALSE, fig.align = 'center', fig.cap='Spark Standalone connection diagram', out.height='280pt', out.width='auto'}
render_nomnoml("
[Driver |
[R]
[sparklyr]
[spark-submit]
[Spark Context]
]
[Driver]-[Cluster Manager |
[Spark Standalone]]
[Cluster Manager]-[Worker (1) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (2) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (3) | [Spark Executor(s)]]
", "images/connections-spark-standalone.png", "Spark Standalone Connection Diagram", styles)
```
To connect, use `master = "spark://hostname:port"` in `spark_connect()` as follows:
```{r connections-standalone-connect}
sc <- spark_connect(master = "spark://hostname:port")
```
## Yarn {#connections-yarn}
Hadoop YARN<!--((("connections", "YARN")))--> is the cluster manager from the Hadoop project. It’s the most common cluster manager that you are likely to find in clusters, which started out as Hadoop clusters; with Cloudera, Hortonworks, and MapR distributions as when using Amazon EMR. YARN supports two connection modes: YARN client and YARN cluster. However, YARN client mode is much more common than YARN cluster since it’s more efficient and easier to set up.
### Yarn Client {#connections-yarn-client}
When<!--((("connections", "YARN")))((("YARN")))((("Hadoop YARN")))--> you connect in YARN client mode, the driver instance runs R, `sparklyr`, and the Spark context, which requests worker nodes from YARN to run Spark executors, as shown in Figure \@ref(fig:connections-yarn-client-diagram).
```{r connections-yarn-client-diagram, eval=TRUE, echo=FALSE, message=FALSE, fig.align = 'center', fig.cap='YARN client connection diagram', out.height='280pt', out.width='auto'}
render_nomnoml("
[Driver |
[R]
[sparklyr]
[spark-submit]
[Spark Context]
]
[Driver]-[Cluster Manager |
[YARN]]
[Cluster Manager]-[Worker (1) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (2) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (3) | [Spark Executor(s)]]
", "images/connections-spark-yarn-client.png", "YARN Client Connection Diagram", styles)
```
To connect, you simply run with `master = "yarn"`, as follows:
```{r connections-yarn-connect}
sc <- spark_connect(master = "yarn")
```
Behind the scenes, when you're running YARN in client mode, the cluster manager will do what you would expect a cluster manager would do: it allocates resources from the cluster and assigns them to your Spark application, which the Spark context will manage for you. The important piece to notice in Figure \@ref(fig:connections-yarn-client-diagram) is that the Spark context resides in the same machine where you run R code; this is different when you're running YARN in cluster mode.
### Yarn Cluster {#connections-yarn-cluster}
The<!--((("cluster managers", "Hadoop YARN")))--> main difference between running YARN in cluster mode and running YARN in client mode is that, in cluster mode, the driver node is not required to be the node where R and `sparklyr` were launched; instead, the driver node remains the designated driver node, which is usually a different node than the edge node where R is running. It can be helpful to consider using cluster mode when the edge node has too many concurrent users, when it is lacking computing resources, or when tools (like RStudio or Jupyter) need to be managed independently of other cluster resources.
Figure \@ref(fig:connections-yarn-cluster-diagram) shows how the different components become decoupled when running in cluster mode. Notice there is still a line connecting the client with the cluster manager since, first of all, resources still need to be allocated from the cluster manager; however, after they're allocated, the client communicates directly with the driver node, which communicates with the worker nodes. From Figure \@ref(fig:connections-yarn-cluster-diagram), you might think that cluster mode looks much more complicated than client mode—this would be a correct assessment; therefore, if possible, it’s best to avoid cluster mode due to its additional configuration overhead.
```{r connections-yarn-cluster-diagram, eval=TRUE, echo=FALSE, message=FALSE, fig.align = 'center', fig.cap='YARN cluster connection diagram', out.height='280pt', out.width='auto'}
render_nomnoml("
[Client |
[R]
[sparklyr]
[spark-submit]
]
[Client]-[Cluster Manager]
[Client]-[Driver |
[sparklyr]
[Spark Context]
]
[Cluster Manager |
[YARN]]
[Driver]-[Cluster Manager]
[Cluster Manager]-[Worker (1) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (2) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (3) | [Spark Executor(s)]]
", "images/connections-spark-yarn-cluster.png", "YARN Cluster Connection Diagram", styles)
```
To connect in YARN cluster mode, simply run the following:
```{r connections-yarn-cluster-connect, eval=FALSE}
sc <- spark_connect(master = "yarn-cluster")
```
Cluster mode assumes that the node running `spark_connect()` is properly configured, meaning that `yarn-site.xml` exists and the `YARN_CONF_DIR` environment variable is properly set. When using Hadoop as a file system, you will also need the `HADOOP_CONF_DIR` environment variable properly configured. In addition, you would need to ensure proper network connectivity between the client and the driver node—not just by having both machines reachable, but also by making sure that they have sufficient bandwidth between them. This configuration is usually provided by your system administrator and is not something that you would need to manually configure.
## Livy {#connections-livy}
As<!--((("Apache Livy")))((("Livy")))((("connections", "Livy")))--> opposed to other connection methods that require using an edge node in the cluster, [Livy](#clusters-livy) provides a _web API_ that makes the Spark cluster accessible from outside the cluster and does not require a Spark installation in the client. After it's connected through the web API, the _Livy Service_ starts the Spark context by requesting resources from the cluster manager and distributing work as usual. Figure \@ref(fig:connections-livy-diagram) illustrates a Livy connection; notice that the client connects remotely to the driver through a web API.
```{r connections-livy-diagram, eval=TRUE, echo=FALSE, message=FALSE, fig.align = 'center', fig.cap='Livy connection diagram', out.height='280pt', out.width='auto'}
render_nomnoml("
[Client |
[R]
[sparklyr]
]
[Client]-[<hidden> Web API]
[Web API]-[Driver |
[Livy Service]
[Spark Context]
]
[Cluster Manager]
[Driver]-[Cluster Manager]
[Cluster Manager]-[Worker (1) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (2) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (3) | [Spark Executor(s)]]
", "images/connections-spark-livy.png", "Livy Connection Diagram", styles)
```
Connecting through Livy requires the URL to the Livy service, which should be similar to `https://hostname:port/livy`. Since remote connections are allowed, connections usually require, at the very least, basic authentication:
```{r connections-livy-connect}
sc <- spark_connect(
master = "https://hostname:port/livy",
method = "livy", config = livy_config(
spark_version = "2.4.0",
username = "<username>",
password = "<password>"
))
```
To try out Livy on your local machine, you can install and run a Livy service as described under the [Clusters - Livy](#clusters-livy) section and then connect as follows:
```{r}
sc <- spark_connect(
master = "http://localhost:8998",
method = "livy",
version = "2.4.0")
```
After you're connected through Livy, you can make use of any `sparklyr` feature; however, Livy is not suitable for exploratory data analysis, since executing commands has a significant performance cost. That said, while running long-running computations, this overhead could be considered irrelevant. In general, you should prefer to avoid using Livy and work directly within an edge node in the cluster; when this is not feasible, using Livy could be a reasonable approach.
**Note:** Specifying the Spark version through the `spark_version` parameter is optional; however, when the version is specified, performance is significantly improved by deploying precompiled Java binaries compatible with the given version. Therefore, it is a best practice to specify the Spark version when connecting to Spark using Livy.
## Mesos {#connections-mesos}
Similar<!--((("connections", "Mesos")))((("Apache Mesos")))((("Mesos")))--> to YARN, Mesos supports client mode and a cluster mode; however, `sparklyr` currently supports only client mode under Mesos. Therefore, the diagram shown in Figure \@ref(fig:connections-mesos-diagram) is equivalent to YARN client’s diagram with only the cluster manager changed from YARN to Mesos.
```{r connections-mesos-diagram, eval=TRUE, echo=FALSE, message=FALSE, fig.align = 'center', fig.cap='Mesos connection diagram', out.height='280pt', out.width='auto'}
render_nomnoml("
[Driver |
[R]
[sparklyr]
[spark-submit]
[Spark Context]
]
[Driver]-[Cluster Manager |
[Mesos]]
[Cluster Manager]-[Worker (1) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (2) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (3) | [Spark Executor(s)]]
", "images/connections-spark-mesos.png", "Mesos Connection Diagram", styles)
```
Connecting requires the address to the Mesos master node, usually in the form of `mesos://host:port` or `mesos://zk://host1:2181,host2:2181,host3:2181/mesos` for Mesos using ZooKeeper:
```{r connections-mesos-connect, eval=FALSE}
sc <- spark_connect(master = "mesos://host:port")
```
The `MESOS_NATIVE_JAVA_LIBRARY` environment variable needs to be set by your system administrator or manually set when you are running Mesos on your local machine. For instance, in macOS, you can install and initialize Mesos from a terminal, followed by manually setting the `mesos` library and connecting with `spark_connect()`:
```{bash}
brew install mesos
/usr/local/Cellar/mesos/1.6.1/sbin/mesos-master --registry=in_memory
--ip=127.0.0.1 MESOS_WORK_DIR=. /usr/local/Cellar/mesos/1.6.1/sbin/mesos-slave
--master=127.0.0.1:5050
```
```{r}
Sys.setenv(MESOS_NATIVE_JAVA_LIBRARY =
"/usr/local/Cellar/mesos/1.6.1/lib/libmesos.dylib")
sc <- spark_connect(master = "mesos://localhost:5050",
spark_home = spark_home_dir())
```
## Kubernetes {#connections-kubernetes}
Kubernetes clusters<!--((("connections", "Kubernetes")))((("Kubernetes")))--> do not support client modes like Mesos or YARN; instead, the connection model is similar to YARN cluster, where the driver node is assigned by Kubernetes, as illustrated in Figure \@ref(fig:connections-kubernetes-diagram).
```{r connections-kubernetes-diagram, eval=TRUE, echo=FALSE, message=FALSE, fig.align = 'center', fig.cap='Kubernetes connection diagram', out.height='280pt', out.width='auto'}
render_nomnoml("
[Client |
[R]
[sparklyr]
[spark-submit]
]
[Client]-[Cluster Manager]
[Client]-[Driver]
[Driver |
[sparklyr]
[Spark Context]
]
[Driver]-[Cluster Manager |
[Kubernetes]]
[Cluster Manager]-[Worker (1) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (2) | [Spark Executor(s)]]
[Cluster Manager]-[Worker (3) | [Spark Executor(s)]]
", "images/connections-spark-kubernetes.png", "Kubernetes Connection Diagram", styles)
```
To use Kubernetes, you will need to prepare a virtual machine with Spark installed and properly configured; however, it is beyond the scope of this book to present how to create one. Once created, connecting to Kubernetes works as follows:
```{r connections-kubernetes-connect, eval=FALSE}
library(sparklyr)
sc <- spark_connect(config = spark_config_kubernetes(
"k8s://https://<apiserver-host>:<apiserver-port>",
account = "default",
image = "docker.io/owner/repo:version",
version = "2.3.1"))
```
If your computer is already configured to use a Kubernetes cluster, you can use the following command to find the `apiserver-host` and `apiserver-port`:
```{r connections-kubernetes-info}
system2("kubectl", "cluster-info")
```
## Cloud
When<!--((("connections", "cloud providers")))((("cloud providers")))--> you are working with cloud providers, there are a few connection differences. For instance, connecting from <!--((("Databricks")))((("cloud computing", "Databricks")))-->Databricks requires the following connection method:
```{r connections-clusters-databricks}
sc <- spark_connect(method = "databricks")
```
Since<!--((("Amazon EMR")))--> Amazon EMR makes use of YARN, you can connect using `master = "yarn"`:
```{r connections-clusters-emr}
sc <- spark_connect(master = "yarn")
```
Connecting<!--((("IBM Cloud")))((("cloud computing", "IBM")))--> to Spark when using IBM’s Watson Studio requires you to retrieve a configuration object through a `load_spark_kernels()` function that IBM provides:
```{r connections-clusters-ibm}
kernels <- load_spark_kernels()
sc <- spark_connect(config = kernels[2])
```
In<!--((("Microsoft Azure")))((("Azure")))((("Microsoft HDInsight")))((("HDInsight")))--> Microsoft Azure HDInsights and when using ML Services (R Server), a Spark connection is initialized as follows:
```{r connections-clusters-azure}
library(RevoScaleR)
cc <- rxSparkConnect(reset = TRUE, interop = "sparklyr")
sc <- rxGetSparklyrConnection(cc)
```
Connecting<!--((("Qubole")))((("cloud computing", "Qubole")))--> from Qubole requires using the `qubole` connection method:
```{r connections-clusters-qubole}
sc <- spark_connect(method = "qubole")
```
Refer to your cloud provider's documentation and support channels if you need help.
## Batches
Most<!--((("connections", "batches")))((("batches")))--> of the time, you use `sparklyr` interactively; that is, you explicitly connect with `spark_connect()` and then execute commands to analyze and model large-scale data. However, you can also automate processes by scheduling Spark jobs that use `sparklyr`. Spark does not provide tools to schedule data-processing tasks; instead, you would use other workflow management tools. This can be useful to transform data, prepare a model and score data overnight, or to make use of Spark by other systems.
As an example, you can create a file named `batch.R` with the following contents:
```{r}
library(sparklyr)
sc <- spark_connect(master = "local")
sdf_len(sc, 10) %>% spark_write_csv("batch.csv")
spark_disconnect(sc)
```
You can then submit this application to Spark in batch mode using `spark_submit()`, the `master` parameter should be set to the appropriately.
```{r}
spark_submit(master = "local", "batch.R")
```
You can also invoke `spark-submit` from the shell directly through the following:
```bash
/spark-home-path/spark-submit
--class sparklyr.Shell '/spark-jars-path/sparklyr-2.3-2.11.jar'
8880 12345 --batch /path/to/batch.R
```
The last parameters represent the port number `8880` and the session number `12345`, which you can set to any unique numeric identifier. You can use the following R code to get the correct paths:
```{r}
# Retrieve spark-home-path
spark_home_dir()
# Retrieve spark-jars-path
system.file("java", package = "sparklyr")
```
You can customize your script by passing additional command-line arguments to `spark-submit` and then read them back in R using `commandArgs()`.
## Tools
When<!--((("connections", "tools for")))((("commands", "spark_web()")))--> connecting to a Spark cluster using tools like Jupyter and RStudio, you can run the same connection parameters presented in this chapter. However, since many cloud providers make use of a web proxy to secure Spark’s web interface, to use `spark_web()` or the RStudio Connections pane extension, you need to properly configure the `sparklyr.web.spark` setting, which you would then pass to `spark_config()` through the `config` parameter.
For instance, when using Amazon EMR, you can configure `sparklyr.web.spark` and `sparklyr.web.yarn` by dynamically retrieving the YARN application and building the EMR proxy URL:
```{r connections-amazon-emr}
domain <- "http://ec2-12-345-678-9.us-west-2.compute.amazonaws.com"
config <- spark_config()
config$sparklyr.web.spark <- ~paste0(
domain, ":20888/proxy/", invoke(spark_context(sc), "applicationId"))
config$sparklyr.web.yarn <- paste0(domain, ":8088")
sc <- spark_connect(master = "yarn", config = config)
```
## Multiple
It<!--((("connections", "managing multiple")))--> is common to connect once, and only once, to Spark. However, you can also open multiple connections to Spark by connecting to different clusters or by specifying the `app_name` parameter. This can be helpful to compare Spark versions or validate your analysis before submitting to the cluster. The following example opens connections to Spark 1.6.3, 2.3.0 and Spark Standalone:
```{r connections-multiple}
# Connect to local Spark 1.6.3
sc_16 <- spark_connect(master = "local", version = "1.6")
# Connect to local Spark 2.3.0
sc_23 <- spark_connect(master = "local", version = "2.3", appName = "Spark23")
# Connect to local Spark Standalone
sc_standalone <- spark_connect(master = "spark://host:port")
```
Finally, you can disconnect from each connection:
```{r connections-multiple-disconnect}
spark_disconnect(sc_1_6_3)
spark_disconnect(sc_2_3_0)
spark_disconnect(sc_standalone)
```
Alternatively, you can disconnect from all connections at once:
```{r connections-multiple-disconnect-all}
spark_disconnect_all()
```
## Troubleshooting {#connections-troubleshooting}
Last<!--((("connections", "troubleshooting")))((("troubleshooting", "first steps")))--> but not least, we introduce the following troubleshooting techniques: _Logging_, _Spark Submit_, and _Windows_. When in doubt about where to begin, start with the Windows section when using Windows systems, followed by Logging and finally Spark Submit. These techniques are useful when running `spark_connect()` fails with an error message.
### Logging
The<!--((("troubleshooting", "logging")))((("logs and logging")))--> first technique to troubleshoot connections is to print Spark logs directly to the console to help you spot additional error messages:
```{r connections-troubleshoot-logging}
sc <- spark_connect(master = "local", log = "console")
```
In addition, you can enable verbose logging by setting the `sparklyr.verbose` option to `TRUE` when connecting:
```{r connections-troubleshoot-verbose}
sc <- spark_connect(master = "local", log = "console",
config = list(sparklyr.verbose = TRUE))
```
### Spark Submit {#connections-troubleshoot-spark-submit}
You<!--((("sparklyr package", "spark-submit script")))((("troubleshooting", "spark-submit script")))((("spark-submit script")))--> can diagnose whether a connection issue is specific to R or Spark in general by running an example job through `spark-submit` and validating that no errors are thrown:
```{r connections-troubleshoot-spark-home}
# Find the spark directory using an environment variable
spark_home <- Sys.getenv("SPARK_HOME")
# Or by getting the local spark installation
spark_home <- sparklyr::spark_home_dir()
```
Then, execute the sample compute Pi example by replacing "local" with the correct master parameter that you are troubleshooting:
```{r}
# Launching a sample application to compute Pi
system2(
file.path(spark_home, "bin", "spark-submit"),
c(
"--master", "local",
"--class", "org.apache.spark.examples.SparkPi",
dir(file.path(spark_home, "examples", "jars"),
pattern = "spark-examples", full.names = TRUE),
100),
stderr = FALSE
)
```
```
Pi is roughly 3.1415503141550314
```
If the preceding message is not displayed, you will need to investigate why your Spark cluster is not properly configured, which is beyond the scope of this book. As a start, rerun the Pi example but remove `stderr = FALSE`; this prints errors to the console, which you then can use to investigate what the problem might be. When using a cloud provider or a Spark distribution, you can contact their support team to help you troubleshoot this further; otherwise, Stack Overflow is a good place to start.
If you do see the message, this means that your Spark cluster is properly configured but somehow R is not able to use Spark, so you need to troubleshoot in detail, as we will explain next.
#### Detailed
To<!--((("troubleshooting", "detailed")))--> troubleshoot the connection process in detail, you can manually replicate the two-step connection process, which is often very helpful to diagnose connection issues. First, `spark-submit` is triggered from R, which submits the application to Spark; second, R connects to the running Spark application.
First, [identify the Spark installation directory](#connections-troubleshoot-spark-submit) and the path to the correct `sparklyr*.jar` file by running the following:
```{r connections-manual-submit}
dir(system.file("java", package = "sparklyr"),
pattern = "sparklyr", full.names = T)
```
Ensure that you identify the correct version that matches your Spark cluster—for instance, `sparklyr-2.1-2.11.jar` for Spark 2.1.
Then, from the terminal, run this:
```{r connections-manual-submit-prep, echo=FALSE}
recent_jars <- dir(system.file("java", package = "sparklyr"), pattern = gsub("\\.[0-9]", "", paste("sparklyr", sparklyr::spark_default_version()$spark, sep = "-")), full.names = T)
Sys.setenv(PATH_TO_SPARKLYR_JAR = recent_jars[[length(recent_jars)]])
Sys.setenv(SPARK_HOME = sparklyr::spark_home_dir())
```
```{bash}
$SPARK_HOME/bin/spark-submit --class sparklyr.Shell $PATH_TO_SPARKLYR_JAR 8880 42
```
```
18/06/11 12:13:53 INFO sparklyr: Session (42) found port 8880 is available
18/06/11 12:13:53 INFO sparklyr: Gateway (42) is waiting for sparklyr client
to connect to port 8880
```
The parameter `8880` represents the default port to use in `sparklyr`, while 42 is the session number, which is a cryptographically secure number generated by `sparklyr`, but for troubleshooting purposes can be as simple as `42`.
If this first connection step fails, it means that the cluster can’t accept the application. This usually means that there are not enough resources, or there are permission restrictions.
The second step is to connect from R as follows (notice that there is a 60-second timeout, so you’ll need to run the R command after running the terminal command; if needed, you can configure this timeout as described in [Chapter 9](#tuning)):
```{r connections-manual-submit-}
library(sparklyr)
sc <- spark_connect(master = "sparklyr://localhost:8880/42", version = "2.3")
```
```{r connections-manual-submit-disconnect, echo=FALSE}
spark_disconnect_all()
Sys.setenv(SPARK_HOME = "")
```
If this second connection step fails, it usually means that there is a connectivity problem between R and the driver node. You can try using a different connection port, for instance.
### Windows
Connecting <!--((("troubleshooting", "Windows connections")))((("Windows connections")))-->from Windows is, in most cases, as straightforward as connecting from Linux and macOS. However, there are a few common connection issues that you should be aware of:
- Firewalls and antivirus software might block ports for your connection. The default port used by `sparklyr` is `8880`; double-check that this port is not being blocked.
- Long path names can cause issues, especially with older Windows systems like Windows 7. When you're using these systems, try connecting with Spark installed with all folders, using at most eight characters and no spaces in their names.
## Recap
This chapter presented an overview of Spark’s architecture, connection concepts, and examples to connect in local mode, standalone, YARN, Mesos, Kubernetes, and Livy. It also presented edge nodes and their role while connecting to Spark clusters. This should have provided you with enough information to successfully connect to any Apache Spark cluster.
To troubleshoot connection problems beyond the techniques described in this chapter, we recommend that you search for the connection problem in Stack Overflow, the [`sparklyr` issues GitHub page](http://bit.ly/2Z72XWa), and, if needed, open a [new GitHub issue in `sparklyr`](http://bit.ly/2HasCmq) to assist further.
In <<data>>, we cover how to use Spark to read and write from a variety of data sources and formats, which allows you to be more agile when adding new data sources for data analysis. What used to take days, weeks, or even months, you now can complete in hours by embracing data lakes.<!--((("", startref="Conline07")))-->