-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathgrids.html
1257 lines (1182 loc) · 71.7 KB
/
grids.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html lang="en-US">
<head>
<title>ERDDAP™ - Heavy Loads, Grids, Clusters, Federations, and Cloud Computing</title>
<meta charset="UTF-8">
<link rel="shortcut icon" href="https://coastwatch.pfeg.noaa.gov/erddap/images/favicon.ico">
<link href="../images/erddap2.css" rel="stylesheet" type="text/css">
<meta name="viewport" content="width=device-width, initial-scale=1">
</head>
<body>
<table class="compact nowrap" style="width:100%; background-color:#128CB5;">
<tr>
<td style="text-align:center; width:80px;"><a rel="bookmark"
href="https://www.noaa.gov/"><img
title="National Oceanic and Atmospheric Administration"
src="../images/noaab.png" alt="NOAA"
style="vertical-align:middle;"></a></td>
<td style="text-align:left; font-size:x-large; color:#FFFFFF; ">
<strong>ERDDAP™</strong>
<br><small><small><small>Easier access to scientific data</small></small></small>
</td>
<td style="text-align:right; font-size:small;">
<br>Brought to you by
<a title="National Oceanic and Atmospheric Administration" rel="bookmark"
href="https://www.noaa.gov">NOAA</a>
<a title="National Marine Fisheries Service" rel="bookmark"
href="https://www.fisheries.noaa.gov">NMFS</a>
<a title="Southwest Fisheries Science Center" rel="bookmark"
href="https://www.fisheries.noaa.gov/about/southwest-fisheries-science-center">SWFSC</a>
<a title="Environmental Research Division" rel="bookmark"
href="https://www.fisheries.noaa.gov/about/environmental-research-division-southwest-fisheries-science-center">ERD</a>
</td>
</tr>
</table>
<div class="standard_width">
<h1 style="text-align:center">ERDDAP:
<br>
<a rel="chapter" href="#heavyLoads">Heavy Loads</a>,
<a rel="chapter" href="#grids">Grids, Clusters, Federations</a>,
<br>
and
<a rel="chapter" href="#cloudComputing">Cloud Computing</a></h1>
<a rel="help" href="https://coastwatch.pfeg.noaa.gov/erddap/index.html">ERDDAP™</a>
is a web application and a web service that aggregates scientific data from
diverse local and
remote sources and offers a simple, consistent way to download subsets of the
data in common file
formats and make graphs and maps.
This web page discusses issues related to heavy ERDDAP™ usage loads
and explores possibilities for dealing with extremely heavy loads
via grids, clusters, federations, and cloud computing.
<p>The original version was written in June 2009. There have been no significant
changes. This was last updated 2019-04-15.
<h2>Table of Contents</h2>
<ul>
<li><a rel="chapter" href="#DISCLAIMER">DISCLAIMER</a>
<li><a rel="chapter" href="#heavyLoads">Heavy Loads</a>
<li><a rel="chapter" href="#loadBalancingNo">Multiple Identical ERDDAP's with Load Balancing? No</a>
<li><a rel="chapter" href="#grids">Grids, Clusters, and Federations</a>
<li><a rel="chapter" href="#cloudComputing">Cloud Computing</a>
<li><a rel="chapter" href="#RemoteReplicationOfDatasets">Remote Replication of Datasets</a>
<li><a rel="chapter" href="#contact">Contact Information</a>
<br>
</ul>
<hr><h2><a class="selfLink" id="DISCLAIMER" href="#DISCLAIMER" rel="bookmark">DISCLAIMER</a></h2>
The contents of this web page are Bob Simons personal opinions and
do not necessarily reflect any position of the
Government or the National Oceanic and Atmospheric Administration.
The calculations are simplistic, but I think the conclusions are correct.
Did I use faulty logic or make a mistake in my calculations?
If so, the fault is mine alone.
Please send an email with the correction to <kbd>erd dot data at noaa dot gov</kbd>.
<br>
<!-- ******* -->
<hr><h2><a class="selfLink" id="heavyLoads" href="#heavyLoads" rel="bookmark">Heavy Loads / Constraints</a></h2>
With heavy use, a standalone ERDDAP™ will be constrained (from most to least likely) by:
<ol>
<li>A remote data source's bandwidth —
Even with an efficient connection (e.g., via OPeNDAP),
unless a remote data source has a very high bandwidth
Internet connection, ERDDAP's responses will be constrained by how fast ERDDAP™ can get
data from the data source. A solution is to copy the dataset onto ERDDAP's hard drive,
perhaps with
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#EDDGridCopy">EDDGridCopy</a>
or
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#EDDTableCopy">EDDTableCopy</a>.
<br>
<li>ERDDAP's server's bandwidth — Unless ERDDAP's server has a very high bandwidth Internet
connection, ERDDAP's responses will be constrained by how fast ERDDAP™ can get data from
the data sources and how fast ERDDAP™ can return data to the clients. The only solution
is to get a faster Internet connection.
<br>
<li><a class="selfLink" id="memory" href="#memory" rel="bookmark">Memory</a> —
If there are many simultaneous requests, ERDDAP™ can run out of memory
and temporarily refuse new requests.
(ERDDAP™ has a couple of mechanisms to avoid this and to minimize the
consequences if it does
happen.) So the more memory in the server the better.
On a 32-bit server, 4+ GB is really good, 2 GB is okay,
less is not recommended.
On a 64-bit server, you can almost entirely avoid the problem by getting
lots of memory.
See the
<a rel="help"
href="https://erddap.github.io/setup.html#initialSetup">-Xmx and -Xms settings</a>
for ERDDAP/Tomcat.
An ERDDAP™ getting heavy usage on a computer with a 64-bit server
with 8GB of memory and -Xmx set to 4000M is rarely, if ever, constrained by memory.
<br>
<li><a class="selfLink" id="hardDriveBandwidth" href="#hardDriveBandwidth" rel="bookmark">Hard drive bandwidth</a> —
Accessing data stored on the server's hard drive
is vastly faster than
accessing remote data. Even so, if the ERDDAP™ server has a very high
bandwidth Internet connection,
it is possible that accessing data on the hard drive will be a bottleneck.
A partial solution
is to use faster (e.g., 10,000 RPM) magnetic hard drives
or SSD drives (if it makes
sense cost-wise). Another solution is to store different datasets
on different drives, so that the cumulative hard drive bandwidth is much higher.
<br>
<li><a class="selfLink" id="tooManyFiles" href="#tooManyFiles" rel="bookmark">Too many files</a>
in a <a rel="help"
href="https://erddap.github.io/setup.html#cachedResponses">cache</a> directory —
ERDDAP™ caches all images, but only caches the
data for certain types of data requests. It is possible for the cache directory for a
dataset to have a large number of files temporarily. This will slow down requests to see
if a file is in the cache (really!). <kbd><cacheMinutes></kbd> in
<a rel="help"
href="https://erddap.github.io/setup.html#setup.xml">setup.xml</a>
lets you set how
long a file can be in the cache before it is deleted. Setting a smaller
number would minimize this problem.
<br>
<li><a class="selfLink" id="CPU" href="#CPU" rel="bookmark">CPU</a> —
Only two things take a lot of CPU time:
<ul>
<li>NetCDF 4 and HDF 5 now support internal compression of data.
Decompressing a large compressed NetCDF 4 / HDF 5 data file can take 10
or more seconds. (That's not an implementation fault. It's the nature of compression.)
So, multiple simultaneous requests to datasets with
data stored in compressed files can put a severe strain on any server.
If this is a problem, the solution is to store popular datasets
in uncompressed files, or get a server with a CPU with more cores.
<li>Making graphs (including maps): roughly 0.2 - 1 second per graph.
So if there were many simultaneous unique requests for graphs
(WMS clients often make 6 simultaneous requests!),
there could be a CPU limitation.
When multiple users are running WMS clients, this becomes a problem.
<br>
</ul>
</ol>
<hr><h2><a class="selfLink" id="loadBalancingNo" href="#loadBalancingNo" rel="bookmark"
><strong>Multiple Identical ERDDAPs with Load Balancing? No</strong></a></h2>
The question often comes up:
"To deal with heavy loads, can I set up multiple identical ERDDAPs with load balancing?"
It's an interesting question because it quickly gets to the core of ERDDAP's design.
The quick answer is "no".
I know that is a disappointing answer,
but there are a couple of direct reasons and some larger fundamental reasons
why I designed ERDDAP™ to use a different approach
(a federation of ERDDAPs, described in the bulk of this document),
which I believe is a better solution.
<p>Some direct reasons why you can't/shouldn't set up multiple identical ERDDAPs are:
<ul>
<li>A given ERDDAP™ reads each data file when it first becomes available
in order to find the ranges of data in the file. It then stores
that information in an index file.
Later, when a user request for data comes in,
ERDDAP™ uses that index to figure out which files to look in for the requested data.
If there were multiple identical ERDDAPs, they would each be doing
this indexing, which is wasted effort.
With the federated system described below, the indexing is only done once, by one of the ERDDAPs.
<li>For some types of user requests (e.g., for .nc, .png, .pdf files)
ERDDAP™ has to make the entire file before the response can be sent.
So ERDDAP™ caches these files for a short time. If an identical request
comes in (as it often does, especially for images where the URL is embedded in a web page),
ERDDAP™ can reuse that cached file.
In a system of multiple identical ERDDAPs, those cached files are not shared,
so each ERDDAP™ would needlessly and wastefully recreate the .nc, .png, or .pdf files.
With the federated system described below, the files are only made once, by one of the ERDDAPs, and reused.
<li>ERDDAP's subscription system is not set up to be shared by multiple ERDDAPs.
For example, if the load balancer sends a user to one ERDDAP™ and the user subscribes to a dataset,
then the other ERDDAPs won't be aware of that subscription. Later,
if the load balancer sends the user to a different ERDDAP™ and asks for
a list of his/her subscriptions, the other ERDDAP™ will say there are none
(leading him/her to make a duplicate subscription on the other EREDDAP).
With the federated system described below, the subscription system is
simply handled by the main, public, composite ERDDAP.
</ul>
Yes, for each of those problems, I could (with great effort) engineer a solution
(to share the information between ERDDAPs), but I think the
<a rel="chapter" href="#grids">federation-of-ERDDAPs approach</a>
(described in the bulk of this document) is a much better overall solution,
partly because it deals with other problems
that the multiple-identical-ERDDAPs-with-a-load-balancer approach does not even start to address,
notably the decentralized nature of the data sources in the world.
<p>It's best to accept the simple fact that I didn't design ERDDAP™ to be deployed as
multiple identical ERDDAPs with a load balancer. I consciously designed ERDDAP™
to work well within a federation of ERDDAPs, which I believe has many advantages.
Notably, a federation of ERDDAPs is perfectly aligned with the decentralized, distributed system of
data centers that we have in the real world (think of the different IOOS regions,
or the different CoastWatch regions, or the different parts of NCEI,
or the 100 other data centers in NOAA, or the different NASA DAACs,
or the 1000's of data centers throughout the world).
Instead of telling all the data centers
of the world that they need to abandon their efforts and put all their data
in a centralized "data lake" (even if it were possible, it is a horrible idea for numerous reasons
-- see the various analyses showing the numerous advantages of
<a rel="help" href="https://en.wikipedia.org/wiki/Decentralised_system">decentralized systems<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>),
ERDDAP's design works with the world as it is.
Each data center which produces data can continue to maintain, curate, and serve their data (as they should),
and yet, with ERDDAP™, the data can also be instantly available from a centralized ERDDAP,
without the need for transmitting the data to the centralized ERDDAP™ or
storing duplicate copies of the data.
Indeed, a given dataset can be simultaneously available
<br>from an ERDDAP™ at the organization that produced and actually stores the data (e.g., GoMOOS),
<br>from an ERDDAP™ at the parent organization
(e.g., IOOS central),
<br>from an all-NOAA ERDDAP™,
<br>from an all-US-federal government ERDDAP™,
<br>from a global ERDDAP™ (GOOS),
<br>and from specialized ERDDAPs (e.g.,
an ERDDAP™ at an institution devoted to HAB research),
<br>all essentially instantaneously,
and efficiently because only the metadata is transferred between ERDDAPs, not the data.
Best of all, after the initial ERDDAP™ at the originating organization, all of the
other ERDDAPs can be set up quickly (a few hours work), with minimal resources
(one server that doesn't need any RAIDs for data storage since it stores no data locally),
and thus at truly minimal cost.
Compare that to the cost of setting up and maintaining a centralized data center with a data lake
and the need for a truly massive, truly expensive, Internet connection),
plus the attendant problem of the centralized data center being a single point of failure.
To me, ERDDAPs decentralized, federated approach is far, far superior.
<p>In situations where a given data center needs multiple ERDDAPs to meet
high demand, ERDDAP's design is fully capable of matching or exceeding the performance
of the multiple-identical-ERDDAPs-with-a-load-balancer approach.
You always have the option of setting up
<a rel="help" href="#multipleCompositeERDDAPs"
>multiple composite ERDDAPs (as discussed below)</a>,
each of which gets all of their data from other ERDDAPs, without load balancing.
In this case, I recommend that you make a point of giving each of the composite
ERDDAPs a different name / identity
and if possible setting them up in different parts of the world
(e.g., different AWS regions),
e.g., ERD_US_East, ERD_US_West, ERD_IE, ERD_FR, ERD_IT,
so that users consciously, repeatedly, work with a specific ERDDAP,
with the added benefit that you have removed the risk from a single point of failure.
<br>
<hr><h2><a class="selfLink" id="grids" href="#grids" rel="bookmark"><strong>Grids, Clusters, and Federations</strong></a></h2>
Under very heavy use, a single standalone ERDDAP™ will run into one or more of the
<a rel="help" href="#heavyLoads">constraints</a> listed
above and even the suggested solutions will be insufficient. For such situations,
ERDDAP™ has
features that make it easy to construct scalable grids (also called clusters or federations)
of ERDDAPs which allow the system to handle very heavy use (e.g., for a large data center).
<p>I'm using
<a rel="help" href="https://en.wikipedia.org/wiki/Grid_computing">grid<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
as a general term to indicate a type of
<a rel="help" href="https://en.wikipedia.org/wiki/Computer_cluster">computer cluster<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
where all of the
parts may or may not be physically located in one facility and may or may not be centrally
administered. An advantage of co-located, centrally owned and administered grids (clusters)
is that they benefit from economies of scale (especially the human workload) and simplify
making the parts of the system work well together. An advantage of non-co-located grids,
non-centrally owned and administered (federations)
is that they distribute the human workload
and the cost, and may provide some additional fault tolerance.
The solution I propose below works well for all grid, cluster, and federation topographies.
<p>The basic idea of designing a scalable system is to identify the potential bottlenecks
and then design the system so that parts of the system can be replicated as needed to
alleviate the bottlenecks. Ideally, each replicated part increases the capacity of that
part of the system linearly (efficiency of scaling). The system isn't scalable unless
there is a scalable solution for every bottleneck.
<a rel="help" href="https://en.wikipedia.org/wiki/Scalability">Scalability<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
is different from efficiency (how quickly a task can be done — efficiency
of the parts). Scalability allows the system to grow to handle any level of demand.
<strong>Efficiency</strong> (of scaling and of the parts) determines how many servers, etc., will be needed
to meet a given level of demand. Efficiency is very important, but always has limits.
Scalability is the only practical solution to building a system that can handle <strong>very</strong>
heavy use. Ideally, the system will be scalable and efficient.
<p><a class="selfLink" id="goals" href="#goals" rel="bookmark">The goals of this design are:</a>
<ul>
<li>To make a scalable architecture
(one that is easily extensible by replicating any part that
becomes over-burdened). To make an efficient system that maximizes the
availability and
throughput of the data given the available computing resources.
(Cost is almost always an issue.)
<li>To balance the capabilities of the parts of the system so that one part
of the system won't overwhelm another part.
<li>To make a simple architecture so that the system is easy to set up and administer.
<li>To make an architecture that works well with all grid topographies.
<li>To make a system that fails gracefully
and in a limited way if any part becomes over-burdened.
(The time required to copy a large datasets will always limit
the system's ability to deal
with sudden increases in the demand for a specific dataset.)
<li>(If possible) To make an architecture that isn't tied to any specific
<a rel="help" href="#cloudComputing">cloud computing</a> service
or other external services (because it doesn't need them).
</ul>
<p><a class="selfLink" id="recommendations" href="#recommendations" rel="bookmark">Our recommendations are:</a>
<br><img src="https://erddap.github.io/cluster.png" alt="grid/cluster diagram" style="vertical-align:middle">
<ul>
<li>Basically, I suggest setting up a Composite ERDDAP™
(<strong>D</strong> in the diagram), which is a
regular ERDDAP™ except that it just serves data from other ERDDAPs.
The grid's architecture
is designed to shift as much work as possible
(CPU usage, memory usage, bandwidth usage)
from the Composite ERDDAP™ to the other ERDDAPs.
<li>ERDDAP™ has two special dataset types,
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#EDDGridFromErddap">EDDGridFromErddap</a>
and
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#EDDTableFromErddap">EDDTableFromErddap</a>,
which refer to
<br>datasets on other ERDDAPs.
<li>When the composite ERDDAP™ receives a request for data or images from
these datasets, the composite ERDDAP™
<a rel="help" href="https://en.wikipedia.org/wiki/URL_redirection">redirects<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
the data request to the other ERDDAP™ server. The result is:
<ul>
<li>This is very efficient (CPU, memory, and bandwidth), because otherwise
<ol>
<li>The composite ERDDAP™ has to send the data request to the other ERDDAP.
<li>The other ERDDAP™ has to get the data, reformat it,
and transmit the data to the composite ERDDAP.
<li>The composite ERDDAP™ has to receive the data (using extra bandwidth),
reformat it (using extra CPU time and memory),
and transmit the data to the user (using extra bandwidth).
</ol>
By redirecting the data request and allowing the other ERDDAP™ to send the
response directly
to the user, the composite ERDDAP™ spends essentially no CPU time, memory,
or bandwidth on data requests.
<li>The redirect is transparent to the user regardless of the client software
(a browser or any other software or command line tool).
</ul>
</ul>
<p><a class="selfLink" id="gridParts" href="#gridParts" rel="bookmark">The parts of the grid are:</a>
<p><strong><span style="color:#0000FF;">A</span></strong>) For every remote data source that
has a high-bandwidth OPeNDAP server, you can connect directly
to the remote server.
If the remote server is an ERDDAP™, use EDDGridFromErddap or
EDDTableFromERDDAP to serve the data in the Composite ERDDAP.
If the remote server is some other type of DAP server,
e.g., THREDDS, Hyrax, or GrADS, use EDDGridFromDap.
<p><strong><span style="color:#0000FF;">B</span></strong>) For every ERDDAP-able data source
(a data source from which ERDDAP
can read data) that has a high-bandwidth server, set up another ERDDAP™ in
the grid which
is responsible for serving the data from this data source.
<ul>
<li>If several such ERDDAPs aren't getting many requests for data, you can
consolidate them into one ERDDAP.
<li>If the ERDDAP™ dedicated to getting data from one remote source is
getting too many requests,
there is a temptation to add additional ERDDAPs to access the remote
data source. In special cases this may make sense,
but it is more likely that this will overwhelm the remote data
source (which is self-defeating) and also prevent other users
from accessing the remote data source (which isn't nice).
In such a case, consider setting up another ERDDAP™ to serve that
one dataset and copy the dataset on that ERDDAP's hard drive (see <strong>C</strong>),
perhaps with
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#EDDGridCopy">EDDGridCopy</a>
and/or
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#EDDTableCopy">EDDTableCopy</a>.
<li><strong>B</strong> servers must be publicly accessible.
</ul>
<p><strong><span style="color:#0000FF;">C</span></strong>) For every ERDDAP-able data source
that has a low-bandwidth server
(or is a slow service for other reasons),
consider setting up another ERDDAP™ and storing a copy of the dataset
on that ERDDAP's hard drives, perhaps with
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#EDDGridCopy">EDDGridCopy</a>
and/or
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#EDDTableCopy">EDDTableCopy</a>.
If several such ERDDAPs
aren't getting many requests for data, you can consolidate them into one ERDDAP.
<br><strong>C</strong> servers must be publicly accessible.
<p><a class="selfLink" id="compositeERDDAP" href="#compositeERDDAP"
rel="bookmark"><strong><span style="color:#0000FF;">D</span></strong>)</a>
The composite ERDDAP™ is a regular
ERDDAP™ except that it just serves data from other ERDDAPs.
<ul>
<li>Because the composite ERDDAP™ has information in memory about all of the
datasets, it can
quickly respond to requests for lists of datasets (full text searches, category searches,
the list of all datasets), and requests for an individual dataset's Data Access Form,
Make A Graph form, or WMS info page. These are all small, dynamically generated, HTML
pages based on information which is held in memory. So the responses are very fast.
<li>Because requests for actual data are quickly redirected to the other ERDDAPs,
the composite
ERDDAP™ can quickly respond to requests for actual data without using any CPU time, memory, or bandwidth.
<li>By shifting as much work as possible (CPU, memory, bandwidth)
from the Composite ERDDAP™ to
the other ERDDAPs, the composite ERDDAP™ can appear to serve data
from all of the datasets
and yet still keep up with very large numbers of data requests
from a large number of users.
<li>Preliminary tests indicate that the composite ERDDAP™ can respond to
most requests in ~1ms of
CPU time, or 1000 requests/second. So an 8 core processor should be able
to respond to about 8000 requests/second.
Although it is possible to envision bursts of higher activity
which would cause slowdowns, that is a lot of throughput.
It is likely that data center
bandwidth will be the bottleneck long before the composite ERDDAP™ becomes the bottleneck.
<li><a class="selfLink" id="upToDateMaxTime" href="#upToDateMaxTime"
rel="bookmark">Up-to-date max(time)?</a>
<br>The EDDGrid/TableFromErddap in the composite ERDDAP™ only changes its
stored information about each source dataset
when the source dataset is
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#reloadEveryNMinutes"
>"reload"ed</a>
and some piece of metadata changes (e.g.,
the time variable's actual_range), thereby generating a subscription notification.
If the source dataset has data that changes frequently (for example, new data every second)
and uses the
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#updateEveryNMillis"
>"update"</a>
system to notice frequent changes to the underlying data,
the EDDGrid/TableFromErddap won't be notified about these frequent changes
until the next dataset "reload",
so the EDDGrid/TableFromErddap won't be perfectly up-to-date.
You can minimize this problem by changing the
source dataset's <kbd><reloadEveryNMinutes></kbd> to a smaller value
(60? 15?) so that there are more subscription notifications to tell
the EDDGrid/TableFromErddap to update its information about the source dataset.
<p>Or, if your data management system knows when the source dataset has new data
(e.g., via a script that copies a data file into place), and if that isn't
super frequent (e.g., every 5 minutes, or less frequent), there's a better solution:
<ol>
<li>Don't use <updateEveryNMillis> to keep the source dataset up-to-date.
<li>Set the source dataset's <reloadEveryNMinutes> to a larger number (1440?).
<li>Have the script contact the source dataset's
<a rel="help"
href="https://erddap.github.io/setup.html#setDatasetFlag">flag URL</a>
right after it copies a new data file into place.
<br>
</ol>
That will lead to the source dataset being perfectly up-to-date
and cause it to generate a subscription notification,
which will be sent to the EDDGrid/TableFromErddap dataset.
That will lead the EDDGrid/TableFromErddap dataset to be perfectly up-to-date
(well, within 5 seconds of new data being added).
And all that will be done efficiently (without unnecessary dataset reloads).
<li><a class="selfLink" id="multipleCompositeERDDAPs" href="#multipleCompositeERDDAPs"
rel="bookmark">In very extreme cases,</a> or for fault tolerance,
you may want to set up more than one composite ERDDAP.
It is likely that other parts of the system (notably, the data center's bandwidth)
will become a problem long before the composite ERDDAP™ becomes a bottleneck.
So the solution is probably to set up additional, geographically diverse, data centers
(mirrors), each with one composite ERDDAP™ and servers with ERDDAPs and (at least) mirror
copies of the datasets which are in high demand. Such a setup also provides fault
tolerance and data backup (via copying).
In this case, it is best if the composite ERDDAPs have different URLs.
<p>If you really want all of the composite ERDDAPs to have the same URL,
use a front end system
that assigns a given user to just one of the composite ERDDAPs (based on the IP address),
so that all of the user's requests go to just one of the composite ERDDAPs.
There are two reasons:
<ul>
<li>When an underlying dataset is reloaded and the metadata changes
(e.g., a new data file in a gridded dataset causes the time variable's
actual_range to change),
the composite ERDDAPs will be temporarily slightly out of synch, but with
<a rel="help" href="https://en.wikipedia.org/wiki/Eventual_consistency"
>eventual consistency<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>.
Normally, they will re-synch within 5 seconds, but sometimes it will be longer.
If a user makes an automated system that relies on
<a rel="help" href="/erddap/subscriptions/index.html"
>ERDDAP™ subscriptions</a> that trigger actions, the brief synchronicity
problems will become significant.
<li>The 2+ composite ERDDAPs each maintain their own set of subscriptions
(because of the synch problem described above).
</ul>
So a given user should be directed to just one of the composite ERDDAPs
to avoid these problems.
If one of the composite ERDDAPs goes down, the front end system can
redirect that ERDDAP's users to another ERDDAP™ that is up.
However, if it is a capacity problem that causes the first composite ERDDAP™ to fail
(an overzealous user? a
<a rel="help" href="https://en.wikipedia.org/wiki/Denial-of-service_attack"
>denial-of-service attack<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>?),
this makes it very likely that redirecting its users to other composite ERDDAPs
will cause a
<a rel="help" href="https://en.wikipedia.org/wiki/Cascading_failure"
>cascading failure<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>.
Thus, the most robust setup is to have composite ERDDAPs with different URLs.
<p>Or, perhaps better, set up multiple composite ERDDAPs without load balancing.
In this case, you should make a point of giving each of the ERDDAPs a different
name / identity and if possible setting them up in different parts of the world
(e.g., different AWS regions),
e.g., ERD_US_East, ERD_US_West, ERD_IE, ERD_FR, ERD_IT,
so that users consciously, repeatedly work with a specific ERDDAP.
<li>[For a fascinating design of a high performance system running on one server,
see this <a rel="help"
href="https://mailinator.blogspot.com/2007/01/architecture-of-mailinator.html">detailed description of Mailinator<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>.]
</ul>
<p><a class="selfLink" id="copy" href="#copy" rel="bookmark">Datasets In Very High Demand</a> —
In the really unusual case that one of the
<strong>A</strong>, <strong>B</strong>, or <strong>C</strong> ERDDAPs
can't keep up with the requests because of bandwidth or hard drive limitations,
it makes sense to copy the data (again) on to another server+hardDrive+ERDDAP,
perhaps with
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#EDDGridCopy">EDDGridCopy</a>
and/or
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#EDDTableCopy">EDDTableCopy</a>.
While it may seem ideal to have the original dataset and the
copied dataset appear seamlessly as one dataset in the composite ERDDAP™, this is difficult
because the two datasets will be in slightly different states at different times (notably,
after the original gets new data, but before the copied dataset gets its copy).
Therefore, I recommend that the datasets be given slightly different titles (e.g.,
"... (copy #1)" and "... (copy #2)", or perhaps "(mirror #<i>n</i>)" or "(server #<i>n</i>)") and
appear as separate datasets in the composite ERDDAP.
Users are used to seeing lists of
<a rel="help" href="https://en.wikipedia.org/wiki/Website#mirror_site">mirror sites<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
at popular file download sites, so this shouldn't surprise or disappoint them.
Because of bandwidth limitations at a given site, it may make sense to have the mirror
located at another site. If the mirror copy is at a different data center, accessed just
by that data center's composite ERDDAP™, the different titles (e.g., "mirror #1) aren't
necessary.
<p><a class="selfLink" id="hardDrives" href="#hardDrives" rel="bookmark">RAIDs versus Regular Hard Drives</a> —
If a large dataset or a group of datasets are not heavily used,
it may make sense to store the data on a RAID since it offers fault tolerance and since
you don't need the processing power or bandwidth of another server. But if a dataset is
heavily used, it may make more sense to copy the data on another server + ERDDAP™ + hard
drive (similar to
<a rel="help" href="https://storagemojo.com/2007/02/19/googles-disk-failure-experience/">what Google does<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>)
rather than to use one server and a RAID to store
multiple datasets since you get to use both server+hardDrive+ERDDAPs in the grid until
one of them fails.
<p><a class="selfLink" id="failures" href="#failures" rel="bookmark">Failures</a> — What happens if...
<ul>
<li>There is a burst of requests for one dataset (e.g., all students in a class
simultaneously request similar data)?
<br>Only the ERDDAP™ serving that dataset will be overwhelmed and
slow down or refuse requests. The composite ERDDAP™ and other ERDDAPs won't be
affected. Since the limiting factor for a given dataset within the system is the hard
drive with the data (not ERDDAP), the only solution (not immediate) is to make a copy
of the dataset on a different server+hardDrive+ERDDAP.
<li>An <strong>A</strong>, <strong>B</strong>, or <strong>C</strong> ERDDAP™ fails (e.g., hard drive failure)?
<br>Only the dataset(s) served by that ERDDAP™ are affected.
If the dataset(s) is mirrored on another server+hardDrive+ERDDAP, the effect is minimal.
If the problem is a hard drive failure in a level 5 or 6 RAID, you just replace the
drive and have the RAID rebuild the data on the drive.
<li>The composite ERDDAP™ fails?
<br>If you want to make a system with very
<a rel="help" href="https://en.wikipedia.org/wiki/High_availability">high availability<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>,
you can set up
<a rel="help" href="#multipleCompositeERDDAPs"
>multiple composite ERDDAPs (as discussed above)</a>,
using something like
<a rel="help" href="https://www.nginx.com/">NGINX<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
or
<a rel="help" href="https://traefik.io/">Traefik<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
to handle load balancing.
Note that a given composite ERDDAP™ can handle a very large number of requests
from a large number of users because
<br>requests for metadata are small and are handled by information that is in memory,
and
<br>requests for data (which may be large) are redirected to the child ERDDAPs.
</ul>
<p><a class="selfLink" id="simple" href="#simple" rel="bookmark">Simple,</a>
<a class="selfLink" id="scalable" href="#scalable" rel="bookmark">Scalable</a>
— This system is easy to set up and administer,
and easily extensible when
any part of it becomes over-burdened. The only real limitations for a given data center
are the data center's bandwidth and the cost of the system.
<p><a class="selfLink" id="bandwidth" href="#bandwidth" rel="bookmark">Bandwidth</a> —
Note the approximate bandwidth of commonly used components of the system:
<table class="erd commonBGColor">
<tr><th>Component</th><th>Approximate Bandwidth (GBytes/s)</th></tr>
<tr><td>DDR memory</td><td>2.5</td></tr>
<tr><td>SSD drive</td><td>1</td></tr>
<tr><td>SATA hard drive</td><td>0.3</td></tr>
<tr><td>Gigabit Ethernet</td><td>0.1</td></tr>
<!--tr><td>OC-192 (ISP)</td><td>1</td></tr-->
<tr><td>OC-12</td><td>0.06</td></tr>
<tr><td>OC-3</td><td>0.015</td></tr>
<tr><td>T1</td><td>0.0002</td></tr>
</table>
<br>So, one SATA hard drive (0.3GB/s) on one server with one ERDDAP™ can probably saturate a
Gigabit Ethernet LAN (0.1GB/s).
And one Gigabit Ethernet LAN (0.1GB/s) can probably saturate an OC-12 Internet connection
(0.06GB/s).
And at least one source lists OC-12 lines costing about $100,000 per month.
(Yes, these calculations are based on pushing the system to its limits,
which is not good because it leads to very sluggish responses.
But these calculations are useful for planning and for balancing parts of the system.)
<strong>Clearly, a suitably fast Internet connection for your data center is
by far the most expensive part of the system.</strong>
You can easily and relatively cheaply build a grid with a dozen servers
running a dozen ERDDAPs
which is capable of pumping out lots of data quickly,
but a suitably fast Internet connection will be very, very expensive.
The partial solutions are:
<ul>
<li>Encourage clients to request subsets of the data if that is all that is needed.
If the client only needs data for a small region or at a lower resolution,
that is what they should request.
Subsetting is a central focus of the protocols ERDDAP™ supports for
requesting data.
<li>Encourage transmitting compressed data.
ERDDAP™ <a rel="help" href="https://coastwatch.pfeg.noaa.gov/erddap/information.html#compression">compresses</a>
a data transmission if it
finds "accept-encoding" in the HTTP GET request header. All web browsers use
"accept-encoding" and automatically decompress the response. Other clients
(e.g., computer programs) have to use it explicitly.
<li>Colocate your servers at an ISP or other site that offers relatively
less expensive bandwidth costs.
<li>Disperse the servers with the ERDDAPs to different institutions so that
the costs are dispersed.
You can then link your composite ERDDAP™ to their ERDDAPs.
</ul>
Note that <a rel="help" href="#cloudComputing">Cloud Computing</a> and web hosting services
offer all the Internet bandwidth
you need, but don't solve the price problem.
<p><a class="selfLink" id="Nygard" href="#Nygard" rel="bookmark"
>For general information on designing scalable,
high capacity, fault-tolerant systems,</a>
see Michael T. Nygard's book
<a rel="help"
href="https://www.amazon.com/Release-Production-Ready-Software-Pragmatic-Programmers/dp/0978739213">Release It<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>.
<p><a class="selfLink" id="LikeLegos" href="#LikeLegos" rel="bookmark">Like Legos</a>
— Software designers often try to use good
<a rel="help" href="https://en.wikipedia.org/wiki/Software_design_pattern">software design patterns<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
to solve problems. Good patterns are good because they encapsulate good,
easy to create and work with, general-purpose solutions that lead to systems
with good properties. Pattern names are not standardized, so I'll call
the pattern that ERDDAP™ uses
the Lego Pattern. Each Lego (each ERDDAP) is a simple, small,
standard, stand-alone, brick (data server) with a defined interface
that allows it to be linked to other legos (ERDDAPs).
The parts of ERDDAP™ that make up this system are:
the subscription and flagURL systems (which allows for
communication between ERDDAPs), the EDD...FromErddap redirect system,
and the system of RESTful requests for data which can be generated
by users or other ERDDAPs.
Thus, given two or more legos (ERDDAPs),
you can create a huge number of different shapes (network topologies of ERDDAPs).
Sure, the design and features of ERDDAP™ could have been done differently,
not Lego-like, perhaps just to enable and optimize for one specific topology.
But we feel that ERDDAP's Lego-like design offers a good,
general-purpose solution that enables any ERDDAP™ administrator
(or group of administrators)
to create all kinds of different federation topologies. For example, a
single organization could set up three (or more) ERDDAPs
as shown in the
<a rel="help" href="#recommendations">ERDDAP™ Grid/Cluster Diagram above</a>.
Or a distributed group
(IOOS? CoastWatch? NCEI? NWS? NOAA? USGS? DataONE? NEON? LTER? OOI? BODC? ONC? JRC? WMO?)
can set up one ERDDAP™
in each small outpost (so the data can stay close to the source)
and then set up a composite ERDDAP™ in the
central office with virtual datasets (which are always perfectly up-to-date)
from each of the small outpost ERDDAPs.
Indeed, all of the ERDDAPs, installed at various institutions around
the world, which get data from other ERDDAPs and/or provide data to
other ERDDAPs, form a giant network of ERDDAPs. How cool is that?!
So, as with Lego's, the possibilities are endless. That's why this is a
good pattern. That's why this is a good design for ERDDAP.
<p><a class="selfLink" id="DifferentTypesOfRequests" href="#DifferentTypesOfRequests" rel="bookmark">Different Types Of Requests</a>
— One of the real-life complications of this discussion of data server topologies
is that there are different types of requests and
different ways to optimize for the different types of requests.
This is mostly a separate issue
(How fast can the ERDDAP™ with the data respond to the request for data?)
from the topology discussion (which deals with the relationships between data servers
and which server has the actual data).
ERDDAP™, of course, tries to deal with all types of requests efficiently,
but handles some better than others.
<ul>
<li>Many requests are simple.
<br>For example: What is the metadata for this dataset?
Or: What are the values of the time dimension for this gridded dataset?
ERDDAP™ is designed to handle
these as quickly as possible (usually in <=2 ms) by keeping this information in memory.
<br>
<li>Some requests are moderately hard.
<br>For example: Give me this subset of a dataset
(which is in one data file). These requests can be handled relatively quickly
because they aren't that difficult.
<br>
<li>Some requests are hard and thus are time consuming.
<br>For example: Give me this subset of a dataset (which might be in any of the 10,000+
data files, or might be from compressed data files that each take 10 seconds to decompress).
ERDDAP™ v2.0 introduced some new, faster ways to deal with these requests, notably by
allowing the request-handling thread to spawn several worker threads
which tackle different subsets of the request. But there is another approach
to this problem which ERDDAP™ does not yet support: subsets of the data files
for a given dataset could be stored
and analyzed on separate computers, and then the results combined on the
original server. This approach is called
<a rel="help" href="https://en.wikipedia.org/wiki/MapReduce">MapReduce<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
and is exemplified by
<a rel="help" href="https://en.wikipedia.org/wiki/Apache_Hadoop">Hadoop<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>,
the first (?) open-source MapReduce program,
which was based on ideas from a Google paper. (If you need MapReduce in ERDDAP,
please send an email request to erd.data at noaa.gov.)
Google's
<a rel="help" href="https://cloud.google.com/bigquery/">BigQuery<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
is interesting because it seems to be an implementation of MapReduce applied
to subsetting tabular datasets, which is one of ERDDAP's main goals.
It is likely that you can create an ERDDAP™ dataset from a BigQuery dataset via
<a rel="help"
href="https://erddap.github.io/setupDatasetsXml.html#EDDTableFromDatabase">EDDTableFromDatabase</a>
because BigQuery can be accessed via a JDBC interface.
</ul>
<h3><a class="selfLink" id="TheseAreMyOpinions" href="#TheseAreMyOpinions" rel="bookmark">These are my opinions.</a></h3>
Yes, the calculations are simplistic (and now slightly dated), but I think the conclusions are correct.
Did I use faulty logic or make a mistake in my calculations? If so, the fault is mine alone.
Please send an email with the correction to erd dot data at noaa dot gov.
<br>
<!-- ******* -->
<hr><h2><a class="selfLink" id="cloudComputing" href="#cloudComputing" rel="bookmark"><strong>Cloud Computing</strong></a></h2>
Several companies offer cloud computing services
(e.g., <a rel="help" href="https://aws.amazon.com/">Amazon Web Services<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
and
<a rel="help" href="https://cloud.google.com/">Google Cloud Platform<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>).
<a rel="help" href="https://en.wikipedia.org/wiki/Web_hosting_service">Web hosting companies<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
have offered simpler services since the mid-1990's,
but the "cloud" services have greatly expanded the flexibility
of the systems and the range of services offered.
Since the ERDDAP™ grid just consists of ERDDAPs and
since ERDDAPs are Java web applications that can run in Tomcat (the most common
application server) or other application servers, it should be relatively easy to
set up an ERDDAP™ grid on a cloud service or web hosting site.
The advantages of these services are:
<ul>
<li>They offer access to very high bandwidth Internet connections.
This alone may justify using these services.
<li>They only charge for the services you use.
For example, you get access to a very high
bandwidth Internet connection, but you only pay for actual data transferred.
That lets you build a system that rarely gets overwhelmed (even at peak demand),
without having to pay for capacity that is rarely used.
<li>They are easily extensible. You can change server types or add
as many servers or as much storage as you want, in less than a minute.
This alone may justify using these services.
<li>They free you from many of the administrative duties of running the
servers and networks.
This alone may justify using these services.
</ul>
The disadvantages of these services are:
<ul>
<li>They charge for their services, sometimes a lot
(in absolute terms; not that it isn't a good value).
The prices listed here are for
<a rel="help" href="https://aws.amazon.com/ec2/pricing">Amazon EC2<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>.
These prices (as of June 2015) will come down.
<br>In the past, prices were higher,
but data files and the number of requests were smaller.
<br>In the future, prices will be lower,
but data files and the number of requests will be larger.
<br>So the details change, but the situation stays relatively constant.
<br>And it isn't that the service is overpriced,
it is that we are using and buying a lot of the service.
<ul>
<li>Data Transfer — Data transfers into the system are now free (Yea!).
<br>Data transfers out of the system are $0.09/GB.
<br>One SATA hard drive (0.3GB/s) on one server with one ERDDAP™ can probably
saturate a Gigabit Ethernet LAN (0.1GB/s).
<br>One Gigabit Ethernet LAN (0.1GB/s) can probably saturate an OC-12 Internet
connection (0.06GB/s).
<br>If one OC-12 connection can transmit ~150,000 GB/month, the Data Transfer costs
could be as much as 150,000 GB @ $0.09/GB = $13,500/month,
which is a significant cost.
Clearly, if you have a dozen hard-working ERDDAPs on a cloud service, your
monthly Data Transfer fees could be substantial (up to $162,000/month).
(Again, it isn't that the service is overpriced,
it is that we are using and buying a lot of the service.)
<li>Data storage — Amazon charges $50/month per TB.
(Compare that to buying a 4TB enterprise drive outright for ~$50/TB,
although the RAID to put it in and administrative costs add to the total cost.)
So if you need to store lots of data in the cloud,
it might be fairly expensive (e.g., 100TB would cost $5000/month).
But unless you have a really large amount of data,
this is a smaller issue than the bandwidth/data transfer costs.
(Again, it isn't that the service is overpriced,
it is that we are using and buying a lot of the service.)
<br>
</ul>
<li><a class="selfLink" id="subsetting" href="#subsetting" rel="bookmark">The subsetting problem:</a>
The only way to efficiently distribute data from data files
is to have the program which is distributing the data (e.g., ERDDAP) running on
a server which has the data stored on a local hard drive
(or similarly fast access to a SAN or local RAID).
Local file systems allow ERDDAP™ (and underlying libraries, such as netcdf-java)
to request specific byte ranges from the files and get responses very quickly.
Many types of data requests from ERDDAP™ to the file
(notably gridded data requests where the stride value
is > 1) can't be done efficiently if the program
has to request the entire file or big chunks of a file
from a non-local (hence slower) data storage system and then extract a subset.
If the cloud setup doesn't give ERDDAP™ fast access to byte ranges of the files
(as fast as with local files),
ERDDAP's access to the data will be a severe bottleneck
and negate other benefits of using a cloud service.
</ul>
<a class="selfLink" id="HostedData" href="#HostedData" rel="bookmark"
>Hosted Data</a> -
<br>An alternative to the above cost benefit analysis
(which is based on the data owner (e.g., NOAA)
paying for their data to be stored in the cloud)
arrived around 2012, when Amazon
(and to a lesser extent, some other cloud providers)
started hosting some datasets in their cloud (AWS S3) for free
(presumably with the hope that
they could recover their costs
if users would rent AWS EC2 compute instances to work with that data).
Clearly, this makes cloud computing vastly more cost effective,
because the time and cost up uploading the data and hosting it are now zero.
With ERDDAP™ v2.0, there are new features to facilitate running ERDDAP
in a cloud:
<ul>
<li>Now, a EDDGridFromFiles or EDDTableFromFiles dataset can be
created from data files which are remote and accessible via the internet
(e.g., AWS S3 buckets) by using the <kbd><cacheFromUrl></kbd> and
<kbd><cacheSizeGB></kbd> options.
ERDDAP™ will maintain a local cache of the most recently used data files.
<li>Now, if any EDDTableFromFiles source files are compressed (e.g., .tgz),
ERDDAP™ will automatically decompress them when it reads them.
<li>Now, the ERDDAP™ thread responding to a given request will spawn worker threads
to work on subsections of the request if you use the
<kbd><nThreads></kbd> options. This parallelization should
allow faster responses to difficult requests.
</ul>
These changes solve the problem of AWS S3 not offering local, block-level
file storage and the (old) problem of access to S3 data having a significant lag.
(Years ago (~2014), that lag was significant, but is now much shorter and so not as significant.)
All in all, it means that setting up ERDDAP™ in the cloud works much better now.
<p><strong>Thanks</strong> —
Many thanks to Matthew Arrott and his group in the original OOI effort
for their work on putting ERDDAP™ in
the cloud and the resulting discussions.
<br>
<hr><h2><a class="selfLink" id="RemoteReplicationOfDatasets" href="#RemoteReplicationOfDatasets" rel="bookmark">Remote Replication of Datasets</a></h2>
There is a common problem that is related to the above discussion of grids and federations of ERDDAPs:
remote replication of datasets.
The basic problem is: a data provider maintains a dataset that changes occasionally
and a user wants to maintain an up-to-date local copy of this dataset (for any of
a variety of reasons). Clearly, there are a huge number of variations of this.
Some variations are much harder to deal with than others.
<ul>
<li>Fast Updates
<br>It's harder to keep the local dataset up-to-date <i>immediately</i> (e.g., within 3 seconds)
after every change to the source, rather than, for example, within a few hours.
<br>
<li>Frequent Changes
<br>Frequent changes are harder to deal with than infrequent changes.
For example, once-a-day changes are
much easier to deal with than changes every 0.1 second.
<br>
<li>Small Changes
<br>Small changes to a source file are harder to deal with than an entirely new file.
This is especially true if the small changes may be anywhere in the file.
Small changes are harder to detect and make it hard to isolate the data that needs to be replicated.
New files are easy to detect and efficient to transfer.
<br>
<li>Entire Dataset
<br>Keeping an entire dataset up-to-date is harder than maintaining just recent data.
Some users just need recent data (e.g., the last 8 day's worth).
<br>
<li>Multiple Copies
<br>Maintaining multiple remote copies at different sites is harder than maintaining one remote copy.
This is the scaling problem.
<br>
</ul>
There are obviously a huge number of variations of possible types of changes to
the source dataset and of the user's needs and expectations. Many of the variations are
very difficult to solve. The best solution for one situation is often not the
best solution for another situation — there isn't yet a universal great solution.
<h3><a class="selfLink" id="RemoteReplicationOfDatasets_ErddapTools" href="#RemoteReplicationOfDatasets_ErddapTools"
rel="bookmark"><strong>Relevant ERDDAP™ Tools</strong></a></h3>
ERDDAP™ offers several tools which can be used as part of a system which
seeks to maintain a remote copy of a dataset:
<ul>
<li>ERDDAP's <a rel="help" href="https://en.wikipedia.org/wiki/RSS"
>RSS (Rich Site Summary?) service<img
src="../images/external.png" alt=" (external link)"
title="This link to an external website does not constitute an endorsement."></a>
<br>offers a quick way to check if a dataset on a remote ERDDAP™ has changed.
<br>