Merge pull request #36 from EcoJulia/improvement-query-speed

Improve query speed
PoisotLab · Jul 28, 2020 · 973a724 · 973a724 · tpoisot · Jul 28, 2020
2 parents db992e9 + 7dc3903
commit 973a724
Show file tree

Hide file tree

Showing 22 changed files with 271 additions and 303 deletions.
diff --git a/Project.toml b/Project.toml
@@ -1,7 +1,7 @@
 name = "GBIF"
 uuid = "ee291a33-5a6c-5552-a3c8-0f29a1181037"
 authors = ["Timothée Poisot <timothee.poisot@umontreal.ca>"]
-version = "0.2.4"
+version = "0.3.0"
 
 [deps]
 Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
@@ -17,7 +17,8 @@ julia = "1.3"
 
 [extras]
 DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
+Query = "1a8c2f83-1ff3-5112-b086-8aa67b057ba1"
 Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"
 
 [targets]
-test = ["Test", "DataFrames"]
+test = ["Test", "DataFrames", "Query"]
diff --git a/README.md b/README.md
@@ -23,7 +23,6 @@ Pkg.add("GBIF")
 - get a single occurrence (`occurrence`)
 - look for multiple occurrences (`occurrences`)
 - paging function to get the next batch of occurrences (`occurrences!`)
-- quality control (`filter!`) and arbitrary filters
 - species and taxon lookup (`species`)
 
 ## How to contribute

diff --git a/docs/Project.toml b/docs/Project.toml
@@ -1,7 +1,11 @@
 [deps]
-Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
 DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
+Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
+Query = "1a8c2f83-1ff3-5112-b086-8aa67b057ba1"
+StatsPlots = "f3b207a7-027a-5e70-b257-86293d7955fd"
 
 [compat]
-Documenter = "0.24"
 DataFrames = "0.21"
+Documenter = "0.25"
+Query = "0.12"
+StatsPlots = "0.14"
diff --git a/docs/make.jl b/docs/make.jl
@@ -10,10 +10,12 @@ makedocs(
         "Home" => "index.md",
         "Manual" => [
             "Getting data" => "data.md",
-            "Filtering records" => "filter.md",
-            "DataFrames.jl support" => "dataframes.md"
+            "Types" => "types.md"
         ],
-        "Types" => "types.md"
+        "Examples" => [
+            "Northern cardinal" => "examples/cardinal.md",
+            "Bats rank-abundance" => "examples/bats.md"
+        ]
     ]
 )
 

diff --git a/docs/src/data.md b/docs/src/data.md
@@ -6,19 +6,21 @@
 taxon
 ```
 
-## Getting occurrence data
+## Searching for occurrence data
 
-The most common task is to retrieve a number of occurrences. The core type
-of this package is `GBIFRecord`, which stores a number of data and metadata
-associated with observations of occurrences.
+The most common task is to retrieve many occurrences according to a query. The
+core type of this package is `GBIFRecord`, which is a very lightweight type
+containing information about the query, and a list of `GBIFRecord` for every
+matching occurrence. Note that the GBIF "search" API is limited to 100000
+results, and will not return more than this amount.
 
 ### Single occurrence
 
 ```@docs
 occurrence
 ```
 
-This can be used to retrieve occurrence `1258202889`, with
+As an example, we can retrieve the occurrence with the key `1258202889`, using the following code:
 
 ```@example
 using GBIF
@@ -33,38 +35,45 @@ occurrences(t::GBIFTaxon)
 ```
 
 When called with no arguments, this function will return a list of the latest 20
-occurrences recorded in GBIF. Note that the `GBIFRecords` type, returned by
-`occurrences`, implements all the necessary methods to iterate over collections.
-For example, this allows writing the following:
+occurrences recorded in GBIF. Note that the `GBIFRecords` type, which is the
+return type of `occurrences`, implements the iteration interface. For example,
+this allows writing the following:
 
 ```@example
 using GBIF
 o = occurrences()
 for single_occ in o
-  print(single_occ)
+  print(single_occ.taxon.name)
 end
 ```
 
 ### Query parameters
 
+The queries must be given as pairs of 
+
 ```@docs
 occurrences(query::Pair...)
 occurrences(t::GBIFTaxon, query::Pair...)
 ```
 
 For example, we can get the data on observations of bats between -30 and 30 of
-latitudes using the following syntax:
+latitude using the following syntax:
 
 ```@example
 using GBIF
-bats = GBIF.taxon("Chiroptera"; strict=false)
+bats = GBIF.taxon("Chiroptera"; rank=:ORDER)
 for occ in occurrences(bats, "decimalLatitude" => (-30.0, 30.0))
   println("$(occ.scientific) -- latitude = $(occ.latitude)")
 end
 ```
 
 ### Batch-download of occurrences
 
+When calling `occurrences`, the list of possible `GBIFRecord` will be
+pre-allocated. Any subsequent call to `occurrences!` (on the `GBIFRecords`
+variable) will retrieve the next "page" of results, and add them to the
+collection:
+
 ```@docs
 occurrences!
 ```

diff --git a/docs/src/dataframes.md b/docs/src/dataframes.md
diff --git a/docs/src/examples/bats.md b/docs/src/examples/bats.md
@@ -0,0 +1,38 @@
+# Rank-abundance curves of bats in Europe
+
+In this example, we will use the `GBIF` package to produce a rank-abundance
+curve of Chiroptera species in Europe, based on data from 2000 to 2005.
+
+```@example bt
+using GBIF
+using DataFrames
+using Query
+using StatsPlots
+
+bats = GBIF.taxon("Chiroptera"; strict=false)
+occ = occurrences(bats, "continent" => "EUROPE", "year" => (2000, 2005), "limit" => 300)
+while length(occ) < size(occ)
+occurrences!(occ)
+end
+```
+
+```@example bt
+by_country = occ |>
+  @filter(_.rank == "SPECIES") |>
+  @map({_.key, _.country, species=_.taxon.name}) |>
+  @filter(!ismissing(_.species)) |>
+  @filter(!ismissing(_.country)) |>
+  @groupby((_.country, _.species)) |>
+  @map({country = first(unique(_.country)), species = first(unique(_.species)), count = length(_)}) |>
+  @groupby(_.country) |>
+  @map({country = key(_), abundance = sort(_.count, rev=true), rank = 1:length(_)}) |>
+  @filter(length(_.abundance) > 5) |>
+  DataFrame |>
+  (d) -> flatten(d, [:abundance, :rank]) |>
+  (d) -> sort(d, :rank)
+
+theme(:wong)
+@df by_country plot(:rank, :abundance, group = :country, m=:circle, legend=:outertopright)
+xaxis!("Rank", :log)
+yaxis!("Observations", :log)
+```
diff --git a/docs/src/examples/cardinal.md b/docs/src/examples/cardinal.md
@@ -0,0 +1,85 @@
+# Observations of Northern Cardinal over time
+
+In this example, we will use the `GBIF` package to compare the number of
+observations of a species over two years. Specifically, we will look at records of the Northern Cardinal (*Cardinalis cardinalis*) in Québec, from 2011 to 2013. This example will allow us to highlight how `GBIFRecords` can be used with `Query`, to select records and transform them.
+
+```@example qc
+using GBIF
+using DataFrames
+using Query
+using StatsPlots
+using Dates
+```
+
+We can get the taxonomic object for *Cardinalis cardinalis*:
+
+```@example qc
+sp_code = taxon("Cardinalis cardinalis", rank = :SPECIES)
+```
+
+The `rank = :SPECIES` argument is not required, as it is the default behaviour
+of the API. Yet, it helps the readability of the code to specify what we should
+be expecting. With this object created, we can define a rough bounding box for
+Québec:
+
+```@example qc
+lat, lon = (44.0, 62.0), (-80.0, -56.0)
+```
+
+This bounding box will also include a few parts of the continental USA, but this
+is not an issue as we will filter them out when we have done the occurrences
+retrieval. It would also be possible to add a `"country" => "CA"` parameter to
+the query.
+
+```@example qc
+obs_qc = occurrences(
+    sp_code,
+    "limit" => 300,
+    "hasCoordinate" => "true",
+    "decimalLatitude" => lat,
+    "decimalLongitude" => lon,
+    "year" => (2011, 2013)
+)
+```
+
+The `length` method for this object will tell us how many records we currently
+have, and the `size` method will tell us how many we can retrieve in total.
+Because the query parameters are going to remain within the `obs_qc` variable
+(in the `query` field, specifically), all we need to do is call `occurrences!`
+on this variable until all occurrences (of which there are `size(obs_qc)`) are
+retrieved.
+
+```@example qc
+while length(obs_qc) < size(obs_qc)
+    occurrences!(obs_qc)
+end
+```
+
+At the end of this loop, the `obs_qc` object will have all of the occurrences. Running this loop may take some time, as there are limitations on speed due to interacting with a remote server.
+
+The result is directly iterable, so we do not need to do anything specific to
+use it in a `for` loop - but if we want to get an array of `GBIFRecord`, we can
+use `collect(view(obs_qc))`. Why `view`? The `GBIFRecords` type always starts
+with enough "room" to put all the `GBIFRecord`, but any record that was not
+retrieved yet is `#undef`. Calling `view` will give us the records that are
+initialized (in versions of Julia starting from 1.5, this has no performance
+cost); `collect`ing the view generates a `Vector{GBIFRecord}`. Internally,
+iteration methods act on the view, so the unassigned records are invisible to
+the user.
+
+The next step is to actually convert the data into a form where we can plot
+them, and this showcases how the package can be used with `Query`:
+
+```@example qc
+d = obs_qc |>
+  @filter(_.rank == "SPECIES") |>
+  @filter(_.country == "Canada") |>
+  @map({_.key, year=year(_.date), month=month(_.date)}) |>
+  @groupby((_.year, _.month)) |>
+  @map({year = first(unique(_.year)), month = first(unique(_.month)), obs = length(_)}) |>
+  @orderby(_.month) |>
+  @thenby(_.year) |>
+  DataFrame
+
+@df d plot(:month, :obs, group=:year)
+```
diff --git a/docs/src/filter.md b/docs/src/filter.md
diff --git a/docs/src/index.md b/docs/src/index.md
@@ -1,25 +1,33 @@
 # Access GBIF data with Julia
 
-This package offers access to biodiversity data through the Global Biodiversity
-Information Facility ([GBIF](https://www.gbif.org/)) API. The package currently
-supports access to occurrence information, and limited support for taxonomic
-information. There are a limited number of cleaning routines built-in, but more
-can easily be added.
+This package offers access to biodiversity data stored by the Global
+Biodiversity Information Facility ([GBIF](https://www.gbif.org/)). The package
+currently offers a wrapper around the search API (to retrieve information on
+occurrences), and a limited wrapper around the species API (to retrieve the
+identifier of taxa).
 
-## How to install
+The focus on the package is on retrieving data; filtering and data analysis
+should be done using other packages from the Julia ecosystem. In particular, we
+provide support for `DataFrames` and `Query` (and therefore the rest of the
+"queryverse").
 
-The package can be installed from the Julia console:
+## Getting started
+
+The latest release of the package can be installed from the Julia console:
 
 ~~~ julia
 Pkg.add("GBIF")
 ~~~
 
-## How to use
-
 After installing it, load the package as usual:
 
 ~~~ julia
 using GBIF
 ~~~
 
-This documentation will walk you through the various features.
+## Core features
+
+- get taxonomic information using the `taxon` function
+- retrieve a single occurrence as a `GBIFRecord` using `occurrence`
+- search for multiple occurrences as a `GBIFRecords` according to a query using the `occurrences` function, and page through the results with `occurrences!`
+- `GBIFRecords` are fully iterable
diff --git a/src/GBIF.jl b/src/GBIF.jl
@@ -115,12 +115,6 @@ include("paging.jl")
 export occurrence, occurrences
 export occurrences!
 
-include("filter.jl")
-export have_both_coordinates, have_neither_zero_coordinates,
-  have_no_zero_coordinates, have_no_issues, have_ok_coordinates,
-  have_a_date
-export qualitycontrol!, showall!, filter!, allrecords!
-
 # Extends with DataFrames functionalities
 function __init__()
   @require DataFrames="a93c6f00-e57d-5684-b7b6-d8193f3e46c0" include("requires/dataframes.jl")