Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataHub Catalogue client for dp-catalogue library #2902

Merged
merged 43 commits into from
Jan 19, 2024

Conversation

tom-webber
Copy link
Contributor

@tom-webber tom-webber commented Jan 12, 2024

  • Convert CatalogueClient class into ABC base, so OMD and DataHub client classes can be inherited
  • implement __init__ and upsert_table for DataHubCatalogueClient using DataHub gms
  • add tests to check DataHub client is submitting requests as expected (in the format supported by DataHub python sdk gms emitter).
  • add packages to pypoetry.toml
  • refactoring

Further changes - @MatMoore

  • get tests working with a snapshot testing approach (based on DataHub unit tests)
  • add integration test for datahub server
  • further refactoring to make the interface catalogue-agnostic
  • allow the dataset to be created within a container for its database (e.g. my_data_product_v1)
  • remove dummy data from platform schema
  • update readme & changelog

…t classes can be inherited

- implement `__init__` and `create_or_update_table` for DataHubCatalogueClient using DataHub gms
- add tests to check DataHub client is submitting requests as expected (in the format supported by DataHub python sdk gms emitter).
- add packages to pypoetry.toml
- refactoring
@tom-webber tom-webber requested a review from a team January 12, 2024 10:16
@tom-webber tom-webber linked an issue Jan 12, 2024 that may be closed by this pull request
@tom-webber tom-webber requested a review from a team January 17, 2024 11:44
@MatMoore MatMoore force-pushed the add-datahub-client-to-dp-catalogue-pkg branch from 4e89ba1 to 4275112 Compare January 17, 2024 13:48
@MatMoore
Copy link
Contributor

MatMoore commented Jan 17, 2024

I have tested using this branch to create a dataset on the apps and tools instance. See https://datahub.apps-tools.development.data-platform.service.justice.gov.uk/dataset/urn:li:dataset:(urn:li:dataPlatform:glue,my_data_product2.my_table3,PROD)/Schema?is_lineage_mode=false&schemaFilter=

I think I'm reasonably confident it's working end to end now, although there are some things we might need to come back to.

Notes:

  • Custom variables are not implemented yet - so expecting we'll need to add these and any other missing fields later
  • We have to pass tags at the table level now, since DatahubCatalogueClient is not passing along any tags from the Data Product (and either way I don't believe the dataset would inherit this from the data product).
  • With the DatahubCatalogueClient, you need to pass the name as $database_name.$table_name, if you want the database to show up as a container. Otherwise it sticks everything in a "Default" container. This is no longer necessary
  • In datahub entities have multiple "aspects". We were creating an aspect for the schema, and tags, but we were missing the one for basic properties associated with the dataset, so I've added that in.

The aim here is to hide catalogue specific details from the user of
this library.

This means we should expect to be able to type any value to
BaseCatalogueClient and then be able to substitute any of the concrete
classes without breaking things. (See Liskov's substitution principle)

This cannot be the case if the base class has methods which accept *any*
arguments, and the concrete implementations only handle specific
arguments.

To fix this I've modified the abstract methods to specify the arguments.

Since the schema_fqn, database_fqn, etc arguments reflected the
OpenMetadata graph structure, I've replaced this with a generic
"DataLocation" argument, which does not enforce a particularly
hierarchy.
- Pass a fully qualified table name ('database.table' rather than 'table') so
  that a container is created rather than allocating to the default
  container.

- Remove dummy value for mandatory platformSchema argument. Just use
  empty string.
@MatMoore MatMoore force-pushed the add-datahub-client-to-dp-catalogue-pkg branch from 6d2ffbd to 37130d2 Compare January 18, 2024 15:22
@MatMoore
Copy link
Contributor

MatMoore commented Jan 18, 2024

Here is a table created with the latest code:

https://datahub.apps-tools.development.data-platform.service.justice.gov.uk/dataset/urn:li:dataset:(urn:li:dataPlatform:glue,test_data_product_v2.test_table,PROD)/Schema?is_lineage_mode=false&schemaFilter=

Note that I have fully qualified the database name here (database.table), which groups everything from the same database in a container that shows up in the breadcrumb, and in the datahub search interface.
Screenshot 2024-01-18 at 15 25 36

We can change this later if we don't like it, but this keeps it consistent with what we were doing before, in the OpenMetadata implementation.

@MatMoore MatMoore self-assigned this Jan 19, 2024
@PriyaBasker23
Copy link
Contributor

Here is a table created with the latest code:

https://datahub.apps-tools.development.data-platform.service.justice.gov.uk/dataset/urn:li:dataset:(urn:li:dataPlatform:glue,test_data_product_v2.test_table,PROD)/Schema?is_lineage_mode=false&schemaFilter=

Note that I have fully qualified the database name here (database.table), which groups everything from the same database in a container that shows up in the breadcrumb, and in the datahub search interface. Screenshot 2024-01-18 at 15 25 36

We can change this later if we don't like it, but this keeps it consistent with what we were doing before, in the OpenMetadata implementation.

It would be good to map tables into its own database

Copy link
Contributor

@julialawrence julialawrence left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priya approved.

@MatMoore MatMoore merged commit 8bb6e68 into main Jan 19, 2024
17 checks passed
@MatMoore MatMoore deleted the add-datahub-client-to-dp-catalogue-pkg branch January 19, 2024 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Connect registration APIs to catalogue
5 participants