Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hdf5 response format #1292

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
e2d8010
Added support for returning optimade data in the hdf5 format.
JPBergsma Jul 28, 2022
079bd71
Added extra doctstrings to hdf5.py and made setting for enabling/disa…
JPBergsma Jul 29, 2022
0b71e9e
Added dependancies for hdf5 response to requirements.txt and setup.py.
JPBergsma Jul 29, 2022
9167351
Added enabled_response_formats to test config and disabled hdf5 tests…
JPBergsma Jul 29, 2022
7551132
Added enabled_response_formats to test config and disabled hdf5 tests…
JPBergsma Jul 29, 2022
e43297e
Merge branch 'master' into JPBergsma/add_HDF5_output_format
JPBergsma Jul 29, 2022
d811457
merges changes from master.
JPBergsma Jul 29, 2022
7952092
checking whether the not installing of numpy on github server was cau…
JPBergsma Jul 29, 2022
694894f
added hdf5_deps to extras_require.
JPBergsma Jul 29, 2022
8d51f55
Added numpy and h5py to install_requirements in setup.py
JPBergsma Jul 29, 2022
12b79e0
Use a query that does not have an _exampl_ field to test response for…
JPBergsma Jul 29, 2022
9fe4dcc
Added extra test and the supported response formats are now listed at…
JPBergsma Aug 3, 2022
1981032
Made some changes to the docstrings and type definitions so it will h…
JPBergsma Aug 4, 2022
79b48d6
The test for the single entry point did not work. This is fixed now
JPBergsma Aug 4, 2022
687ea78
Added more thorough check to see whetehr the response contnet type is…
JPBergsma Aug 4, 2022
fbfe0f7
Remove numpy and h5py from 'install_requires'.
JPBergsma Aug 4, 2022
a55bd82
Revert "Remove numpy and h5py from 'install_requires'."
JPBergsma Aug 4, 2022
43e326f
Remove h5py_deps and put numpy and h5py back in install_requires.
JPBergsma Aug 4, 2022
1e7e3f9
Processed comments from code review.
JPBergsma Aug 9, 2022
50cacf0
Fixed test_response_format.py
JPBergsma Aug 9, 2022
82f2b31
Added extra test values, and added support for handling nested lists …
JPBergsma Aug 9, 2022
15770f9
Merge branch 'master' into JPBergsma/add_HDF5_output_format
JPBergsma Aug 9, 2022
42864cb
Added extra test to check if response_format is in the enabled_respon…
JPBergsma Aug 10, 2022
7c6a562
Merge branch 'JPBergsma/add_HDF5_output_format' of https://github.com…
JPBergsma Aug 10, 2022
30af05a
Added filenames to the header.
JPBergsma Aug 15, 2022
47fa9ad
Changed the way the collection name is determined for the file name o…
JPBergsma Aug 16, 2022
9ef6b05
Merge branch 'master' into JPBergsma/add_HDF5_output_format
JPBergsma Sep 15, 2022
4ada284
Update requirements.txt
JPBergsma Sep 15, 2022
f1c309d
updated version requirement numpy in requirements.txt
JPBergsma Sep 18, 2022
b32278f
Small fields are now stored as attributes rather than datasets.
JPBergsma Sep 21, 2022
9597cca
Merge branch 'master' into JPBergsma/add_HDF5_output_format
JPBergsma Sep 21, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions docs/api_reference/adapters/hdf5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# hdf5

::: optimade.adapters.hdf5
230 changes: 230 additions & 0 deletions optimade/adapters/hdf5.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,230 @@
from io import BytesIO
from typing import Union, Any
from pydantic import AnyUrl
from datetime import datetime, timezone
from optimade.models import EntryResponseMany, EntryResponseOne
import h5py
import numpy as np


"""This adaptor class can be used to generate a hdf5 response instead of a json response and to convert the hdf5 response back into an python dictionary.
It can handle numeric data in a binary format compatible with numpy.
It is therefore more efficient than the JSON format at returning large amounts of numeric data.
It however also has more overhead resulting in a larger response for entries with little numeric data.
To enable support for your server the parameter "enabled_response_formats" can be specified in the config file.
It is a list of the supported response_formats. To support the hdf5 return format it should be set to: ["json", "hdf5"]
(support for the JSON format is mandatory)

Unfortunately, h5py does not support storing objects with the numpy.object type.
It is therefore not possible to directly store a list of dictionaries in a hdf5 file with h5py.
As a workaround, the index of a value in a list is used as a dictionary key so a list can be stored as a dictionary if neccesary.
"""


def generate_hdf5_file_content(
response_object: Union[EntryResponseMany, EntryResponseOne, dict, list, tuple]
) -> bytes:
"""This function generates the content of a hdf5 file from an EntryResponse object.
It should also be able to handle python dictionaries lists and tuples.

Parameters:
response_object: an OPTIMADE response object. This can be of any OPTIMADE entry type, such as structure, reference etc.

Returns:
A binary object containing the contents of the hdf5 file.
"""

temp_file = BytesIO()
hdf5_file = h5py.File(temp_file, "w")
if isinstance(response_object, (EntryResponseMany, EntryResponseOne)):
response_object = response_object.dict(exclude_unset=True)
store_hdf5_dict(hdf5_file, response_object)
hdf5_file.close()
file_content = temp_file.getvalue()
temp_file.close()
return file_content


def store_hdf5_dict(
hdf5_file: h5py._hl.files.File, iterable: Union[dict, list, tuple], group: str = ""
):
"""This function stores a python list, dictionary or tuple in a hdf5 file.
the currently supported datatypes are str, int, float, list, dict, tuple, bool, AnyUrl,
None ,datetime or any numpy type or numpy array.

Unfortunately, h5py does not support storing objects with the numpy.object type.
It is therefore not possible to directly store a list of dictionaries in a hdf5 file with h5py.
As a workaround, the index of a value in a list is used as a dictionary key so a list can be stored as a dictionary if neccesary.

Parameters:
hdf5_file: An hdf5 file like object.
iterable: The object to be stored in the hdf5 file.
group: This indicates to group in the hdf5 file the list, tuple or dictionary should be added.

Raises:
TypeError: If this function encounters an object with a type that it cannot convert to the hdf5 format
a ValueError is raised.
"""
if isinstance(iterable, (list, tuple)):
iterable = enumerate(iterable)
elif isinstance(iterable, dict):
iterable = iterable.items()
for x in iterable:
key = str(x[0])
value = x[1]
if isinstance(
value, (list, tuple)
): # For now, I assume that all values in the list have the same type.
if len(value) < 1: # case empty list
hdf5_file[group + "/" + key] = []
continue
val_type = type(value[0])
if val_type == dict:
hdf5_file.create_group(group + "/" + key)
store_hdf5_dict(hdf5_file, value, group + "/" + key)
elif val_type.__module__ == np.__name__:
try:
hdf5_file[group + "/" + key] = value
except (TypeError) as hdf5_error:
raise TypeError(
"Unfortunatly more complex numpy types like object can not yet be stored in hdf5. Error from hdf5:"
+ hdf5_error
)
elif isinstance(value[0], (int, float)):
hdf5_file[group + "/" + key] = np.asarray(value)
elif isinstance(value[0], str):
hdf5_file[group + "/" + key] = value
elif isinstance(value[0], (list, tuple)):
list_type = get_recursive_type(value[0])
if list_type in (int, float):
hdf5_file[group + "/" + key] = np.asarray(value)
else:
hdf5_file.create_group(group + "/" + key)
store_hdf5_dict(hdf5_file, value, group + "/" + key)
else:
raise ValueError(
f"The list with type :{val_type} cannot be converted to hdf5."
)
elif isinstance(value, dict):
hdf5_file.create_group(group + "/" + key)
store_hdf5_dict(hdf5_file, value, group + "/" + key)
elif isinstance(value, bool):
hdf5_file[group + "/" + key] = np.bool_(value)
elif isinstance(
value, AnyUrl
): # This case had to be placed above the str case as AnyUrl inherits from the string class, but cannot be handled directly by h5py.
hdf5_file[group + "/" + key] = str(value)
elif isinstance(
value,
(
int,
float,
str,
),
):
hdf5_file[group + "/" + key] = value
elif type(value).__module__ == np.__name__:
try:
hdf5_file[group + "/" + key] = value
except (TypeError) as hdf5_error:
raise TypeError(
"Unfortunatly more complex numpy types like object can not yet be stored in hdf5. Error from hdf5:"
+ hdf5_error
)
elif isinstance(value, datetime):
hdf5_file[group + "/" + key] = value.astimezone(timezone.utc).strftime(
"%Y-%m-%dT%H:%M:%SZ"
)
elif value is None:
hdf5_file[group + "/" + key] = h5py.Empty("f")
else:
raise ValueError(
f"Unable to store a value of type: {type(value)} in hdf5 format."
)


def get_recursive_type(obj: Any) -> type:
"""If obj is a list or tuple this function returns the type of the first object in the list/tuple that is not a list
or tuple. If the list or tuple is empty it returns None.
Finally if the object is not a list or tuple it returns the type of the object.

Parameters:
obj: any python object

Returns:
The type of the objects that the object contains or the type of the object itself when it does not contain other objects."""

if isinstance(obj, (list, tuple)):
if len(obj) == 0:
return None
else:
if isinstance(obj[0], (list, tuple)):
return get_recursive_type(obj[0])
else:
return type(obj[0])
return type(obj)


def generate_response_from_hdf5(hdf5_content: bytes) -> dict:
"""Generates a response_dict from a HDF5 file like object.
It is similar to the response_dict generated from the JSON response, except that the numerical data will have numpy
types.

Parameters:
hdf5_content: the content of a hdf5 file.

Returns:
A dictionary containing the data of the hdf5 file."""

temp_file = BytesIO(hdf5_content)
hdf5_file = h5py.File(temp_file, "r")
response_dict = generate_dict_from_hdf5(hdf5_file)
return response_dict


def generate_dict_from_hdf5(
hdf5_file: h5py._hl.files.File, group: str = "/"
) -> Union[dict, list]:
"""This function returns the content of a hdf5 group.
Because of the workaround described under the store_hdf5_dict function, groups which have numbers as keys will be turned to lists(No guartee that the order is the same as in th eoriginal list).
Otherwise, the group will be turned into a dict.

Parameters:
hdf5_file: An HDF5_object containing the data that should be converted to a dictionary or list.
group: The hdf5 group for which the dictionary should be created. The default is "/" which will return all the data in the hdf5_object

Returns:
A dict or list containing the content of the hdf5 group.
"""

return_value = None
for key, value in hdf5_file[group].items():
if key.isdigit():
if return_value is None:
return_value = []
if isinstance(value, h5py._hl.group.Group):
return_value.append(
generate_dict_from_hdf5(hdf5_file, group=group + key + "/")
)
elif isinstance(value[()], h5py._hl.base.Empty):
return_value.append(None)
elif isinstance(value[()], bytes):
return_value.append(value[()].decode())
else:
return_value.append(value[()])

else: # Case dictionary
if return_value is None:
return_value = {}
if isinstance(value, h5py._hl.group.Group):
return_value[key] = generate_dict_from_hdf5(
hdf5_file, group=group + key + "/"
)
elif isinstance(value[()], h5py._hl.base.Empty):
return_value[key] = None
elif isinstance(value[()], bytes):
return_value[key] = value[()].decode()
else:
return_value[key] = value[()]

return return_value
7 changes: 7 additions & 0 deletions optimade/models/jsonapi.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
parse_obj_as,
root_validator,
)
import numpy
from optimade.models.utils import StrictField


Expand Down Expand Up @@ -365,4 +366,10 @@ class Config:
datetime: lambda v: v.astimezone(timezone.utc).strftime(
"%Y-%m-%dT%H:%M:%SZ"
),
numpy.int32: lambda v: int(v),
numpy.float32: lambda v: float(v),
numpy.int64: lambda v: int(v),
numpy.float64: lambda v: float(v),
numpy.bool_: lambda v: bool(v),
numpy.ndarray: lambda v: v.tolist(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to introduce a mandatory dependency on numpy. I would suggest that the HDF5Response is in a separate module and inherits from the JSON:API one. In the best case, it will just contain this additional config, but it may also make it easier to modify where necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not directly related to the hdf5 format, so it would be strange to place it in a HDF5Response.
I want to be able to handle NumPy numbers internally, so the format of the numbers does not need to change when they are read from a file.

I can make it so, that these encoders are only loaded when NumPy is present. However, I am not sure how we should indicate optional dependencies in setup.py or requirements.txt.

}
4 changes: 4 additions & 0 deletions optimade/server/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,10 @@ class ServerConfig(BaseSettings):
True,
description="If True, the server will check whether the query parameters given in the request are correct.",
)
enabled_response_formats: Optional[List[str]] = Field(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should make an enum of supported formats, then do Optional[List[SupportedFormats]] like some of the other options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I am trying to do this, but it does make things more complicated because I now have to convert the enums to a string before I can do the comparisons in my code. It would be easier to use a Literal["json", "hdf5"] instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now using an ENUM class to restrict which values can be specified for enabled_response_formats.

["json"],
description="""A list of the response formats that are supported by this server. Must include the "json" format.""",
)

@validator("implementation", pre=True)
def set_implementation_version(cls, v):
Expand Down
4 changes: 2 additions & 2 deletions optimade/server/entry_collections/entry_collections.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,10 +301,10 @@ def handle_query_params(
# response_format
if (
getattr(params, "response_format", False)
and params.response_format != "json"
and params.response_format not in CONFIG.enabled_response_formats
):
raise BadRequest(
detail=f"Response format {params.response_format} is not supported, please use response_format='json'"
detail=f"Response format {params.response_format} is not supported, please use one of the supported response_formats: {','.join(CONFIG.enabled_response_formats)}"
)

# page_limit
Expand Down
3 changes: 2 additions & 1 deletion optimade/server/middleware.py
Original file line number Diff line number Diff line change
Expand Up @@ -447,7 +447,8 @@ async def dispatch(self, request: Request, call_next):
if not isinstance(chunk, bytes):
chunk = chunk.encode(charset)
body += chunk
body = body.decode(charset)
if response.raw_headers[1][1] == b"application/vnd.api+json":
body = body.decode(charset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this always guaranteed to be at [1][1]? Probably better to check via the header keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I have changed the code, so it now loops over all entries in the header.


if self._warnings:
response = json.loads(body)
Expand Down
7 changes: 4 additions & 3 deletions optimade/server/routers/info.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ def get_info(request: Request) -> InfoResponse:
"version": __api_version__,
}
],
formats=["json"],
formats=CONFIG.enabled_response_formats,
available_endpoints=["info", "links"] + list(ENTRY_INFO_SCHEMAS.keys()),
entry_types_by_format={"json": list(ENTRY_INFO_SCHEMAS.keys())},
is_index=False,
Expand Down Expand Up @@ -71,8 +71,9 @@ def get_entry_info(request: Request, entry: str) -> EntryInfoResponse:
properties = retrieve_queryable_properties(
schema, queryable_properties, entry_type=entry
)

output_fields_by_format = {"json": list(properties.keys())}
output_fields_by_format = {}
for outputformat in CONFIG.enabled_response_formats:
output_fields_by_format[outputformat] = list(properties.keys())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output_fields_by_format[outputformat] = list(properties.keys())
output_fields_by_format[outputformat] = list(properties)

.keys() is unnecessary if you just want a list of all keys (I see we use it above too, could be removed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed the unnecessary .keys() from this file.
It would probably be good to do a regex search for the "list(*.keys()" pattern, so we can remove these in all our code.


return EntryInfoResponse(
meta=meta_values(
Expand Down
23 changes: 20 additions & 3 deletions optimade/server/routers/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
from datetime import datetime
from typing import Any, Dict, List, Set, Union

from fastapi import Request
from fastapi import Request, Response
from fastapi.responses import JSONResponse
from starlette.datastructures import URL as StarletteURL

Expand All @@ -22,6 +22,7 @@
from optimade.server.exceptions import BadRequest, InternalServerError
from optimade.server.query_params import EntryListingQueryParams, SingleEntryQueryParams
from optimade.utils import mongo_id_for_database, get_providers, PROVIDER_LIST_URLS
from optimade.adapters.hdf5 import generate_hdf5_file_content

__all__ = (
"BASE_URL_PREFIXES",
Expand Down Expand Up @@ -265,7 +266,7 @@ def get_entries(
if fields or include_fields:
results = handle_response_fields(results, fields, include_fields)

return response(
response_object = response(
links=links,
data=results,
meta=meta_values(
Expand All @@ -277,6 +278,14 @@ def get_entries(
),
included=included,
)
if params.response_format == "json":
return response_object
elif params.response_format == "hdf5":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check whether hdf5 is also enabled in the CONFIG.enabled_response_formats too right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I now see that this is done in the handle_query_params, but perhaps another guard is needed here so that implementations can pick and choose which bit of the reference server they use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added an extra check.

return Response(
content=generate_hdf5_file_content(response_object),
media_type="application/x-hdf5",
headers={"Content-Disposition": "attachment"},
)


def get_single_entry(
Expand Down Expand Up @@ -313,7 +322,7 @@ def get_single_entry(
if fields or include_fields and results is not None:
results = handle_response_fields(results, fields, include_fields)[0]

return response(
response_object = response(
links=links,
data=results,
meta=meta_values(
Expand All @@ -325,3 +334,11 @@ def get_single_entry(
),
included=included,
)
if params.response_format == "json":
return response_object
elif params.response_format == "hdf5":
return Response(
content=generate_hdf5_file_content(response_object),
media_type="application/x-hdf5",
headers={"Content-Disposition": "attachment"},
)
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
elasticsearch-dsl==7.4.0
email_validator==1.2.1
fastapi==0.79.0
h5py==3.7.0
lark==1.1.2
mongomock==4.1.2
numpy==1.23.0
pydantic==1.9.1
pymongo==4.1.1
pyyaml==5.4
Expand Down
Loading