Skip to content
This repository has been archived by the owner on Dec 11, 2022. It is now read-only.

Commit

Permalink
Merge pull request #36 from EcoJulia/improvement-query-speed
Browse files Browse the repository at this point in the history
Improve query speed
  • Loading branch information
tpoisot authored Jul 28, 2020
2 parents db992e9 + 7dc3903 commit 973a724
Show file tree
Hide file tree
Showing 22 changed files with 271 additions and 303 deletions.
5 changes: 3 additions & 2 deletions Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "GBIF"
uuid = "ee291a33-5a6c-5552-a3c8-0f29a1181037"
authors = ["Timothée Poisot <timothee.poisot@umontreal.ca>"]
version = "0.2.4"
version = "0.3.0"

[deps]
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
Expand All @@ -17,7 +17,8 @@ julia = "1.3"

[extras]
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Query = "1a8c2f83-1ff3-5112-b086-8aa67b057ba1"
Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40"

[targets]
test = ["Test", "DataFrames"]
test = ["Test", "DataFrames", "Query"]
1 change: 0 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ Pkg.add("GBIF")
- get a single occurrence (`occurrence`)
- look for multiple occurrences (`occurrences`)
- paging function to get the next batch of occurrences (`occurrences!`)
- quality control (`filter!`) and arbitrary filters
- species and taxon lookup (`species`)

## How to contribute
Expand Down
8 changes: 6 additions & 2 deletions docs/Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
[deps]
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0"
Documenter = "e30172f5-a6a5-5a46-863b-614d45cd2de4"
Query = "1a8c2f83-1ff3-5112-b086-8aa67b057ba1"
StatsPlots = "f3b207a7-027a-5e70-b257-86293d7955fd"

[compat]
Documenter = "0.24"
DataFrames = "0.21"
Documenter = "0.25"
Query = "0.12"
StatsPlots = "0.14"
8 changes: 5 additions & 3 deletions docs/make.jl
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,12 @@ makedocs(
"Home" => "index.md",
"Manual" => [
"Getting data" => "data.md",
"Filtering records" => "filter.md",
"DataFrames.jl support" => "dataframes.md"
"Types" => "types.md"
],
"Types" => "types.md"
"Examples" => [
"Northern cardinal" => "examples/cardinal.md",
"Bats rank-abundance" => "examples/bats.md"
]
]
)

Expand Down
31 changes: 20 additions & 11 deletions docs/src/data.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,19 +6,21 @@
taxon
```

## Getting occurrence data
## Searching for occurrence data

The most common task is to retrieve a number of occurrences. The core type
of this package is `GBIFRecord`, which stores a number of data and metadata
associated with observations of occurrences.
The most common task is to retrieve many occurrences according to a query. The
core type of this package is `GBIFRecord`, which is a very lightweight type
containing information about the query, and a list of `GBIFRecord` for every
matching occurrence. Note that the GBIF "search" API is limited to 100000
results, and will not return more than this amount.

### Single occurrence

```@docs
occurrence
```

This can be used to retrieve occurrence `1258202889`, with
As an example, we can retrieve the occurrence with the key `1258202889`, using the following code:

```@example
using GBIF
Expand All @@ -33,38 +35,45 @@ occurrences(t::GBIFTaxon)
```

When called with no arguments, this function will return a list of the latest 20
occurrences recorded in GBIF. Note that the `GBIFRecords` type, returned by
`occurrences`, implements all the necessary methods to iterate over collections.
For example, this allows writing the following:
occurrences recorded in GBIF. Note that the `GBIFRecords` type, which is the
return type of `occurrences`, implements the iteration interface. For example,
this allows writing the following:

```@example
using GBIF
o = occurrences()
for single_occ in o
print(single_occ)
print(single_occ.taxon.name)
end
```

### Query parameters

The queries must be given as pairs of

```@docs
occurrences(query::Pair...)
occurrences(t::GBIFTaxon, query::Pair...)
```

For example, we can get the data on observations of bats between -30 and 30 of
latitudes using the following syntax:
latitude using the following syntax:

```@example
using GBIF
bats = GBIF.taxon("Chiroptera"; strict=false)
bats = GBIF.taxon("Chiroptera"; rank=:ORDER)
for occ in occurrences(bats, "decimalLatitude" => (-30.0, 30.0))
println("$(occ.scientific) -- latitude = $(occ.latitude)")
end
```

### Batch-download of occurrences

When calling `occurrences`, the list of possible `GBIFRecord` will be
pre-allocated. Any subsequent call to `occurrences!` (on the `GBIFRecords`
variable) will retrieve the next "page" of results, and add them to the
collection:

```@docs
occurrences!
```
Expand Down
15 changes: 0 additions & 15 deletions docs/src/dataframes.md

This file was deleted.

38 changes: 38 additions & 0 deletions docs/src/examples/bats.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Rank-abundance curves of bats in Europe

In this example, we will use the `GBIF` package to produce a rank-abundance
curve of Chiroptera species in Europe, based on data from 2000 to 2005.

```@example bt
using GBIF
using DataFrames
using Query
using StatsPlots
bats = GBIF.taxon("Chiroptera"; strict=false)
occ = occurrences(bats, "continent" => "EUROPE", "year" => (2000, 2005), "limit" => 300)
while length(occ) < size(occ)
occurrences!(occ)
end
```

```@example bt
by_country = occ |>
@filter(_.rank == "SPECIES") |>
@map({_.key, _.country, species=_.taxon.name}) |>
@filter(!ismissing(_.species)) |>
@filter(!ismissing(_.country)) |>
@groupby((_.country, _.species)) |>
@map({country = first(unique(_.country)), species = first(unique(_.species)), count = length(_)}) |>
@groupby(_.country) |>
@map({country = key(_), abundance = sort(_.count, rev=true), rank = 1:length(_)}) |>
@filter(length(_.abundance) > 5) |>
DataFrame |>
(d) -> flatten(d, [:abundance, :rank]) |>
(d) -> sort(d, :rank)
theme(:wong)
@df by_country plot(:rank, :abundance, group = :country, m=:circle, legend=:outertopright)
xaxis!("Rank", :log)
yaxis!("Observations", :log)
```
85 changes: 85 additions & 0 deletions docs/src/examples/cardinal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Observations of Northern Cardinal over time

In this example, we will use the `GBIF` package to compare the number of
observations of a species over two years. Specifically, we will look at records of the Northern Cardinal (*Cardinalis cardinalis*) in Québec, from 2011 to 2013. This example will allow us to highlight how `GBIFRecords` can be used with `Query`, to select records and transform them.

```@example qc
using GBIF
using DataFrames
using Query
using StatsPlots
using Dates
```

We can get the taxonomic object for *Cardinalis cardinalis*:

```@example qc
sp_code = taxon("Cardinalis cardinalis", rank = :SPECIES)
```

The `rank = :SPECIES` argument is not required, as it is the default behaviour
of the API. Yet, it helps the readability of the code to specify what we should
be expecting. With this object created, we can define a rough bounding box for
Québec:

```@example qc
lat, lon = (44.0, 62.0), (-80.0, -56.0)
```

This bounding box will also include a few parts of the continental USA, but this
is not an issue as we will filter them out when we have done the occurrences
retrieval. It would also be possible to add a `"country" => "CA"` parameter to
the query.

```@example qc
obs_qc = occurrences(
sp_code,
"limit" => 300,
"hasCoordinate" => "true",
"decimalLatitude" => lat,
"decimalLongitude" => lon,
"year" => (2011, 2013)
)
```

The `length` method for this object will tell us how many records we currently
have, and the `size` method will tell us how many we can retrieve in total.
Because the query parameters are going to remain within the `obs_qc` variable
(in the `query` field, specifically), all we need to do is call `occurrences!`
on this variable until all occurrences (of which there are `size(obs_qc)`) are
retrieved.

```@example qc
while length(obs_qc) < size(obs_qc)
occurrences!(obs_qc)
end
```

At the end of this loop, the `obs_qc` object will have all of the occurrences. Running this loop may take some time, as there are limitations on speed due to interacting with a remote server.

The result is directly iterable, so we do not need to do anything specific to
use it in a `for` loop - but if we want to get an array of `GBIFRecord`, we can
use `collect(view(obs_qc))`. Why `view`? The `GBIFRecords` type always starts
with enough "room" to put all the `GBIFRecord`, but any record that was not
retrieved yet is `#undef`. Calling `view` will give us the records that are
initialized (in versions of Julia starting from 1.5, this has no performance
cost); `collect`ing the view generates a `Vector{GBIFRecord}`. Internally,
iteration methods act on the view, so the unassigned records are invisible to
the user.

The next step is to actually convert the data into a form where we can plot
them, and this showcases how the package can be used with `Query`:

```@example qc
d = obs_qc |>
@filter(_.rank == "SPECIES") |>
@filter(_.country == "Canada") |>
@map({_.key, year=year(_.date), month=month(_.date)}) |>
@groupby((_.year, _.month)) |>
@map({year = first(unique(_.year)), month = first(unique(_.month)), obs = length(_)}) |>
@orderby(_.month) |>
@thenby(_.year) |>
DataFrame
@df d plot(:month, :obs, group=:year)
```
51 changes: 0 additions & 51 deletions docs/src/filter.md

This file was deleted.

28 changes: 18 additions & 10 deletions docs/src/index.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,33 @@
# Access GBIF data with Julia

This package offers access to biodiversity data through the Global Biodiversity
Information Facility ([GBIF](https://www.gbif.org/)) API. The package currently
supports access to occurrence information, and limited support for taxonomic
information. There are a limited number of cleaning routines built-in, but more
can easily be added.
This package offers access to biodiversity data stored by the Global
Biodiversity Information Facility ([GBIF](https://www.gbif.org/)). The package
currently offers a wrapper around the search API (to retrieve information on
occurrences), and a limited wrapper around the species API (to retrieve the
identifier of taxa).

## How to install
The focus on the package is on retrieving data; filtering and data analysis
should be done using other packages from the Julia ecosystem. In particular, we
provide support for `DataFrames` and `Query` (and therefore the rest of the
"queryverse").

The package can be installed from the Julia console:
## Getting started

The latest release of the package can be installed from the Julia console:

~~~ julia
Pkg.add("GBIF")
~~~

## How to use

After installing it, load the package as usual:

~~~ julia
using GBIF
~~~

This documentation will walk you through the various features.
## Core features

- get taxonomic information using the `taxon` function
- retrieve a single occurrence as a `GBIFRecord` using `occurrence`
- search for multiple occurrences as a `GBIFRecords` according to a query using the `occurrences` function, and page through the results with `occurrences!`
- `GBIFRecords` are fully iterable
6 changes: 0 additions & 6 deletions src/GBIF.jl
Original file line number Diff line number Diff line change
Expand Up @@ -115,12 +115,6 @@ include("paging.jl")
export occurrence, occurrences
export occurrences!

include("filter.jl")
export have_both_coordinates, have_neither_zero_coordinates,
have_no_zero_coordinates, have_no_issues, have_ok_coordinates,
have_a_date
export qualitycontrol!, showall!, filter!, allrecords!

# Extends with DataFrames functionalities
function __init__()
@require DataFrames="a93c6f00-e57d-5684-b7b6-d8193f3e46c0" include("requires/dataframes.jl")
Expand Down
Loading

2 comments on commit 973a724

@tpoisot
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/18600

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.3.0 -m "<description of version>" 973a724e7ab9866e49d29162992a3933b181e8a8
git push origin v0.3.0

Please sign in to comment.