Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only keep records that have been indexed to a catchment/flowline ID. #220

Open
dblodgett-usgs opened this issue Jul 10, 2023 · 3 comments
Open
Assignees

Comments

@dblodgett-usgs
Copy link
Member

Currently, the crawler keeps all records that are read in whether they get indexed or not. The crawler should operate exclusively where it only keeps data that indexes to a comid.

When a crawl finishes, no rows with NULL comids should remain in the NLDI database. This could be made configurable but default to drop un-indexed features.

@gzt5142
Copy link
Collaborator

gzt5142 commented Jul 17, 2023

Will tackle this issue this week.

The prerequisite will depend on getting a fresh copy of the demo database to be sure I'm working against the current standard schema and content.

  • Build and test against new nldi-db::demo database
  • Upon ingest, 'spatial join' to NLDI to set comid for each ingested feature
  • For those ingested features which did not match to a comid, we can either drop or keep them:
    • Default is to drop features where comid is empty/null
    • a command-line switch or other configurable option can override this behavior

As we add configuration options, it may be worth discussing how the crawler is invoked. Right now, it is run from the Linux command line, with --option style mechanism for altering default behavior. I wonder if it makes sense to put all configuration into a yml or similar input file.

@gzt5142
Copy link
Collaborator

gzt5142 commented Jul 17, 2023

I think I may have misunderstood what has been happening with the crawler source table.... I just pulled a fresh copy of the nldi-db repo into a pristine docker environment. docker-compose up demo stands up a working database. But the contents of the crawler source table are confusing me:

image

Of specific interest to me are the suffixes and the crawler source ID integers. Were those integers going to be re-ranged starting from 1 and no skips?

gzt5142 pushed a commit that referenced this issue Jul 31, 2023
gzt5142 pushed a commit that referenced this issue Jul 31, 2023
@gzt5142
Copy link
Collaborator

gzt5142 commented Jul 31, 2023

I have ported the logic from the java crawler into python. Mostly, this is just arranging different framing around the SQL lifted directly from the java repo.

It is a minor security risk to allow "raw" SQL to execute (injection concerns)... but this code has very limited ability for users to affect the variables, so it is reasonably insulated from such attacks.

In terms of testing -- I was only able to match three features from source 11 ( geoconnex contribution demo sites) against the NHD data in the NHDplus artifact at https://github.com/internetofwater/nldi-db/releases/download/artifacts-2.0.0/

Looking for domain experts to help me understand if that is the expected result. @dblodgett-usgs

I drop all ingested features with COMID=0 after the crawl. This is not optional (yet).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants