Only keep records that have been indexed to a catchment/flowline ID. #220

dblodgett-usgs · 2023-07-10T19:12:29Z

Currently, the crawler keeps all records that are read in whether they get indexed or not. The crawler should operate exclusively where it only keeps data that indexes to a comid.

When a crawl finishes, no rows with NULL comids should remain in the NLDI database. This could be made configurable but default to drop un-indexed features.

gzt5142 · 2023-07-17T13:40:25Z

Will tackle this issue this week.

The prerequisite will depend on getting a fresh copy of the demo database to be sure I'm working against the current standard schema and content.

Build and test against new nldi-db::demo database
Upon ingest, 'spatial join' to NLDI to set comid for each ingested feature
For those ingested features which did not match to a comid, we can either drop or keep them:
- Default is to drop features where comid is empty/null
- a command-line switch or other configurable option can override this behavior

As we add configuration options, it may be worth discussing how the crawler is invoked. Right now, it is run from the Linux command line, with --option style mechanism for altering default behavior. I wonder if it makes sense to put all configuration into a yml or similar input file.

gzt5142 · 2023-07-17T14:25:33Z

I think I may have misunderstood what has been happening with the crawler source table.... I just pulled a fresh copy of the nldi-db repo into a pristine docker environment. docker-compose up demo stands up a working database. But the contents of the crawler source table are confusing me:

Of specific interest to me are the suffixes and the crawler source ID integers. Were those integers going to be re-ranged starting from 1 and no skips?

* [#220]: link comids spatially

gzt5142 · 2023-07-31T16:58:27Z

I have ported the logic from the java crawler into python. Mostly, this is just arranging different framing around the SQL lifted directly from the java repo.

It is a minor security risk to allow "raw" SQL to execute (injection concerns)... but this code has very limited ability for users to affect the variables, so it is reasonably insulated from such attacks.

In terms of testing -- I was only able to match three features from source 11 ( geoconnex contribution demo sites) against the NHD data in the NHDplus artifact at https://github.com/internetofwater/nldi-db/releases/download/artifacts-2.0.0/

Looking for domain experts to help me understand if that is the expected result. @dblodgett-usgs

I drop all ingested features with COMID=0 after the crawl. This is not optional (yet).

dblodgett-usgs assigned gzt5142 Jul 10, 2023

gzt5142 pushed a commit that referenced this issue Jul 31, 2023

[#220]: link comids spatially

1a4bfa7

gzt5142 pushed a commit that referenced this issue Jul 31, 2023

Links COMIDs w/ spatial logic; drops zero/null COMIDs (#245)

dbb5325

* [#220]: link comids spatially

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only keep records that have been indexed to a catchment/flowline ID. #220

Only keep records that have been indexed to a catchment/flowline ID. #220

dblodgett-usgs commented Jul 10, 2023

gzt5142 commented Jul 17, 2023 •

edited

Loading

gzt5142 commented Jul 17, 2023

gzt5142 commented Jul 31, 2023

Only keep records that have been indexed to a catchment/flowline ID. #220

Only keep records that have been indexed to a catchment/flowline ID. #220

Comments

dblodgett-usgs commented Jul 10, 2023

gzt5142 commented Jul 17, 2023 • edited Loading

gzt5142 commented Jul 17, 2023

gzt5142 commented Jul 31, 2023

gzt5142 commented Jul 17, 2023 •

edited

Loading