Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop subclassing JSONSchema validator #1185

Merged
merged 5 commits into from
Oct 11, 2022

Conversation

WilliamJamieson
Copy link
Contributor

Potentential fix for issue #1172

@WilliamJamieson
Copy link
Contributor Author

@Julian this is my initial attempt to stop subclassing Validator for use in ASDF. I found your suggestion in python-jsonschema/jsonschema#981 (comment) a bit hard to implement because it fails to have several methods we use from the Validator class, rather than re-implement (copy from jsonschema) all the missing parts, I am simply patching the Validator class.

Please feel free to leave comments, alternate suggestions, or a review. I would welcome your thoughts.

@WilliamJamieson
Copy link
Contributor Author

WilliamJamieson commented Sep 1, 2022

I believe the failing tests are due to a bug in jsonschema, see python-jsonschema/jsonschema#994 for details.

@Julian
Copy link

Julian commented Sep 2, 2022

Hi! Monkeypatching an object you don't "own" (i.e. a validator class) is definitely even more fragile than subclassing from it, so this definitely could be improved somehow. I'll have to spend some time with it to see what to recommend though, unless someone else sees it first.

@WilliamJamieson
Copy link
Contributor Author

@Julian I completely agree that Monkeypatching is less than ideal. Essentially, I am just "replicating" the changes the previous subclassing would have accomplished. This "avoids" the warning that has been added as that breaks the required testing in downstream packages that rely on asdf (they turn all warnings into raised errors); moreover, the warning does mention that subclassing will be removed at a later date.

The alternate method to this is to start with what you suggested in python-jsonschema/jsonschema#981 (comment); however, as far as I can tell this then requires me to then replicate much of the jsonschema Validator class's interface through copying the necessary parts from the jsonschema source. This also seems fragile and less than ideal because it requires me to watch for all changes to the jsonschema Validator and adjust asdf accordingly.

As far as I can tell there isn't an interface provided by jsonschema which is rich enough to enable asdf to add its additional logic on top of the jsonschema Validator's iter_errors method. As far as I can tell, asdf truly needs to modify the iter_errors method in order to handle validation with respect to yaml tags (which are not part of the json specification and so are outside the jsonschema specification), which are used as a core part of asdf's functionality.

@WilliamJamieson
Copy link
Contributor Author

Tagging @eslavich to weigh in, as they understand how asdf leverages jsonschema better than I do.

@eslavich
Copy link
Contributor

eslavich commented Sep 5, 2022

I agree that we won't be able to support descending into tag schemas by composing jsonschema's validator or modifying the evolve-able attributes.

Should asdf simply vendorize jsonschema at a version that works for us? Our use case is bizarre and I don't think it's reasonable to expect jsonschema to officially support the level of customization that we need. Since we're stuck on draft 4 anyway, we won't miss the new features, and that draft is nearly a decade old so I don't expect that there will be much to maintain.

@Julian
Copy link

Julian commented Sep 7, 2022

As far as I can tell there isn't an interface provided by jsonschema which is rich enough to enable asdf to add its additional logic on top of the jsonschema Validator's iter_errors method. As far as I can tell, asdf truly needs to modify the iter_errors method in order to handle validation with respect to yaml tags (which are not part of the json specification and so are outside the jsonschema specification), which are used as a core part of asdf's functionality.

If there's any detail you can share about what's missing that'd definitely be useful to hear, though I still haven't dug in in detail yet, apologies.

@WilliamJamieson
Copy link
Contributor Author

If there's any detail you can share about what's missing that'd definitely be useful to hear, though I still haven't dug in in detail yet, apologies.

Currently the iter_errors that asdf uses is:

asdf/asdf/schema.py

Lines 266 to 311 in ebac36e

def iter_errors(self, instance, *args, **kwargs):
# We can't validate anything that looks like an external reference,
# since we don't have the actual content, so we just have to defer
# it for now. If the user cares about complete validation, they
# can call `AsdfFile.resolve_references`.
with self._context:
if self._context.seen(instance, self.schema):
# We've already validated this instance against this schema,
# no need to do it again.
return
if not visit_repeat_nodes:
self._context.add(instance, self.schema)
if (isinstance(instance, dict) and "$ref" in instance) or isinstance(instance, reference.Reference):
return
if not self.schema:
tag = getattr(instance, "_tag", None)
if tag is not None:
if self.serialization_context.extension_manager.handles_tag_definition(tag):
tag_def = self.serialization_context.extension_manager.get_tag_definition(tag)
schema_uris = tag_def.schema_uris
else:
schema_uris = [self.ctx.tag_mapping(tag)]
if schema_uris[0] == tag:
schema_uris = []
# Must validate against all schema_uris
for schema_uri in schema_uris:
try:
with self.resolver.resolving(schema_uri) as resolved:
yield from self.descend(instance, resolved)
except RefResolutionError:
msg = "Unable to locate schema file for '{}': '{}'"
warnings.warn(msg.format(tag, schema_uri), AsdfWarning)
if isinstance(instance, dict):
for val in instance.values():
yield from self.iter_errors(val)
elif isinstance(instance, list):
for val in instance:
yield from self.iter_errors(val)
else:
yield from super().iter_errors(instance)

It provides two functions:

  1. It prevents the validator from getting caught in "reference cycles" as yaml provides anchors which allow one to reference already stated information. See

    asdf/asdf/schema.py

    Lines 272 to 278 in ebac36e

    if self._context.seen(instance, self.schema):
    # We've already validated this instance against this schema,
    # no need to do it again.
    return
    if not visit_repeat_nodes:
    self._context.add(instance, self.schema)
  2. More importantly for asdf it handles the yaml tags, which asdf uses in a similar manner to $ref but instead asdf needs to pull a specific schema and use it to validate only the subtree in question against this new schema. This includes validating that the yaml tag from the data matches the one listed in the schema. See

    asdf/asdf/schema.py

    Lines 284 to 301 in ebac36e

    tag = getattr(instance, "_tag", None)
    if tag is not None:
    if self.serialization_context.extension_manager.handles_tag_definition(tag):
    tag_def = self.serialization_context.extension_manager.get_tag_definition(tag)
    schema_uris = tag_def.schema_uris
    else:
    schema_uris = [self.ctx.tag_mapping(tag)]
    if schema_uris[0] == tag:
    schema_uris = []
    # Must validate against all schema_uris
    for schema_uri in schema_uris:
    try:
    with self.resolver.resolving(schema_uri) as resolved:
    yield from self.descend(instance, resolved)
    except RefResolutionError:
    msg = "Unable to locate schema file for '{}': '{}'"
    warnings.warn(msg.format(tag, schema_uri), AsdfWarning)

The remaining components of the iter_errors we use just set it up so that it operates like the jsonschema iter_errors or directly call the jsonschema iter_errors itself.

The main point is that asdf has to do some work to handle yaml/asdf specific things outside the scope of the original iter_errors method (checking/handling them prior to calling down into the original iter_errors). The natural way (to me) we can support this is via subclassing, but I am open to suggestions.

@WilliamJamieson WilliamJamieson force-pushed the feature/validator branch 3 times, most recently from e96bf2d to 3bd3a38 Compare October 7, 2022 14:36
@WilliamJamieson WilliamJamieson marked this pull request as ready for review October 11, 2022 15:24
Copy link
Contributor

@braingram braingram left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@WilliamJamieson WilliamJamieson merged commit f889d84 into asdf-format:master Oct 11, 2022
@WilliamJamieson WilliamJamieson deleted the feature/validator branch October 11, 2022 16:59
bmwiedemann pushed a commit to bmwiedemann/openSUSE that referenced this pull request Oct 22, 2022
https://build.opensuse.org/request/show/1030520
by user bnavigator + dimstar_suse
- Update to 2.13.0
  * The ASDF Standard is at v1.6.0
  * Add ability to pull information from schema about asdf file
    data, using ~asdf.AsdfFile.schema_info method. [#1167]
- Release 2.12.1
  * Overhaul of the ASDF documentation to make it more consistent
    and readable. [#1142, #1152]
  * Update deprecated instances of abstractproperty to
    abstractmethod [#1148]
  * Move build configuration into pyproject.toml [#1149, #1155]
  * Pin jsonschema to below 4.10.0. [#1171]
- Release 2.12.0
  * Added ability to display title as a comment in using the info()
    functionality. [#1138]
  * Add ability to set asdf-standard version for schema example
    items. [#1143]
- Add asdf-pr1185+pr1203-fix-jsonschema.patch
  * gh#asdf-format/asdf#1185, gh#asdf-format/asdf#1203
- Add asdf-pr1214-inst
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants