-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hdf5 response format #1292
base: main
Are you sure you want to change the base?
Add hdf5 response format #1292
Changes from 13 commits
e2d8010
079bd71
0b71e9e
9167351
7551132
e43297e
d811457
7952092
694894f
8d51f55
12b79e0
9fe4dcc
1981032
79b48d6
687ea78
fbfe0f7
a55bd82
43e326f
1e7e3f9
50cacf0
82f2b31
15770f9
42864cb
7c6a562
30af05a
47fa9ad
9ef6b05
4ada284
f1c309d
b32278f
9597cca
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# hdf5 | ||
|
||
::: optimade.adapters.hdf5 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,230 @@ | ||
from io import BytesIO | ||
from typing import Union, Any | ||
from pydantic import AnyUrl | ||
from datetime import datetime, timezone | ||
from optimade.models import EntryResponseMany, EntryResponseOne | ||
import h5py | ||
import numpy as np | ||
|
||
|
||
"""This adaptor class can be used to generate a hdf5 response instead of a json response and to convert the hdf5 response back into an python dictionary. | ||
It can handle numeric data in a binary format compatible with numpy. | ||
It is therefore more efficient than the JSON format at returning large amounts of numeric data. | ||
It however also has more overhead resulting in a larger response for entries with little numeric data. | ||
To enable support for your server the parameter "enabled_response_formats" can be specified in the config file. | ||
It is a list of the supported response_formats. To support the hdf5 return format it should be set to: ["json", "hdf5"] | ||
(support for the JSON format is mandatory) | ||
|
||
Unfortunately, h5py does not support storing objects with the numpy.object type. | ||
It is therefore not possible to directly store a list of dictionaries in a hdf5 file with h5py. | ||
As a workaround, the index of a value in a list is used as a dictionary key so a list can be stored as a dictionary if neccesary. | ||
""" | ||
|
||
|
||
def generate_hdf5_file_content( | ||
response_object: Union[EntryResponseMany, EntryResponseOne, dict, list, tuple] | ||
) -> bytes: | ||
"""This function generates the content of a hdf5 file from an EntryResponse object. | ||
It should also be able to handle python dictionaries lists and tuples. | ||
|
||
Parameters: | ||
response_object: an OPTIMADE response object. This can be of any OPTIMADE entry type, such as structure, reference etc. | ||
|
||
Returns: | ||
A binary object containing the contents of the hdf5 file. | ||
""" | ||
|
||
temp_file = BytesIO() | ||
hdf5_file = h5py.File(temp_file, "w") | ||
if isinstance(response_object, (EntryResponseMany, EntryResponseOne)): | ||
response_object = response_object.dict(exclude_unset=True) | ||
store_hdf5_dict(hdf5_file, response_object) | ||
hdf5_file.close() | ||
file_content = temp_file.getvalue() | ||
temp_file.close() | ||
return file_content | ||
|
||
|
||
def store_hdf5_dict( | ||
hdf5_file: h5py._hl.files.File, iterable: Union[dict, list, tuple], group: str = "" | ||
): | ||
"""This function stores a python list, dictionary or tuple in a hdf5 file. | ||
the currently supported datatypes are str, int, float, list, dict, tuple, bool, AnyUrl, | ||
None ,datetime or any numpy type or numpy array. | ||
|
||
Unfortunately, h5py does not support storing objects with the numpy.object type. | ||
It is therefore not possible to directly store a list of dictionaries in a hdf5 file with h5py. | ||
As a workaround, the index of a value in a list is used as a dictionary key so a list can be stored as a dictionary if neccesary. | ||
|
||
Parameters: | ||
hdf5_file: An hdf5 file like object. | ||
iterable: The object to be stored in the hdf5 file. | ||
group: This indicates to group in the hdf5 file the list, tuple or dictionary should be added. | ||
|
||
Raises: | ||
TypeError: If this function encounters an object with a type that it cannot convert to the hdf5 format | ||
a ValueError is raised. | ||
""" | ||
if isinstance(iterable, (list, tuple)): | ||
iterable = enumerate(iterable) | ||
elif isinstance(iterable, dict): | ||
iterable = iterable.items() | ||
for x in iterable: | ||
key = str(x[0]) | ||
value = x[1] | ||
if isinstance( | ||
value, (list, tuple) | ||
): # For now, I assume that all values in the list have the same type. | ||
if len(value) < 1: # case empty list | ||
hdf5_file[group + "/" + key] = [] | ||
continue | ||
val_type = type(value[0]) | ||
if val_type == dict: | ||
hdf5_file.create_group(group + "/" + key) | ||
store_hdf5_dict(hdf5_file, value, group + "/" + key) | ||
elif val_type.__module__ == np.__name__: | ||
try: | ||
hdf5_file[group + "/" + key] = value | ||
except (TypeError) as hdf5_error: | ||
raise TypeError( | ||
"Unfortunatly more complex numpy types like object can not yet be stored in hdf5. Error from hdf5:" | ||
+ hdf5_error | ||
) | ||
elif isinstance(value[0], (int, float)): | ||
hdf5_file[group + "/" + key] = np.asarray(value) | ||
elif isinstance(value[0], str): | ||
hdf5_file[group + "/" + key] = value | ||
elif isinstance(value[0], (list, tuple)): | ||
list_type = get_recursive_type(value[0]) | ||
if list_type in (int, float): | ||
hdf5_file[group + "/" + key] = np.asarray(value) | ||
else: | ||
hdf5_file.create_group(group + "/" + key) | ||
store_hdf5_dict(hdf5_file, value, group + "/" + key) | ||
else: | ||
raise ValueError( | ||
f"The list with type :{val_type} cannot be converted to hdf5." | ||
) | ||
elif isinstance(value, dict): | ||
hdf5_file.create_group(group + "/" + key) | ||
store_hdf5_dict(hdf5_file, value, group + "/" + key) | ||
elif isinstance(value, bool): | ||
hdf5_file[group + "/" + key] = np.bool_(value) | ||
elif isinstance( | ||
value, AnyUrl | ||
): # This case had to be placed above the str case as AnyUrl inherits from the string class, but cannot be handled directly by h5py. | ||
hdf5_file[group + "/" + key] = str(value) | ||
elif isinstance( | ||
value, | ||
( | ||
int, | ||
float, | ||
str, | ||
), | ||
): | ||
hdf5_file[group + "/" + key] = value | ||
elif type(value).__module__ == np.__name__: | ||
try: | ||
hdf5_file[group + "/" + key] = value | ||
except (TypeError) as hdf5_error: | ||
raise TypeError( | ||
"Unfortunatly more complex numpy types like object can not yet be stored in hdf5. Error from hdf5:" | ||
+ hdf5_error | ||
) | ||
elif isinstance(value, datetime): | ||
hdf5_file[group + "/" + key] = value.astimezone(timezone.utc).strftime( | ||
"%Y-%m-%dT%H:%M:%SZ" | ||
) | ||
elif value is None: | ||
hdf5_file[group + "/" + key] = h5py.Empty("f") | ||
else: | ||
raise ValueError( | ||
f"Unable to store a value of type: {type(value)} in hdf5 format." | ||
) | ||
|
||
|
||
def get_recursive_type(obj: Any) -> type: | ||
"""If obj is a list or tuple this function returns the type of the first object in the list/tuple that is not a list | ||
or tuple. If the list or tuple is empty it returns None. | ||
Finally if the object is not a list or tuple it returns the type of the object. | ||
|
||
Parameters: | ||
obj: any python object | ||
|
||
Returns: | ||
The type of the objects that the object contains or the type of the object itself when it does not contain other objects.""" | ||
|
||
if isinstance(obj, (list, tuple)): | ||
if len(obj) == 0: | ||
return None | ||
else: | ||
if isinstance(obj[0], (list, tuple)): | ||
return get_recursive_type(obj[0]) | ||
else: | ||
return type(obj[0]) | ||
return type(obj) | ||
|
||
|
||
def generate_response_from_hdf5(hdf5_content: bytes) -> dict: | ||
"""Generates a response_dict from a HDF5 file like object. | ||
It is similar to the response_dict generated from the JSON response, except that the numerical data will have numpy | ||
types. | ||
|
||
Parameters: | ||
hdf5_content: the content of a hdf5 file. | ||
|
||
Returns: | ||
A dictionary containing the data of the hdf5 file.""" | ||
|
||
temp_file = BytesIO(hdf5_content) | ||
hdf5_file = h5py.File(temp_file, "r") | ||
response_dict = generate_dict_from_hdf5(hdf5_file) | ||
return response_dict | ||
|
||
|
||
def generate_dict_from_hdf5( | ||
hdf5_file: h5py._hl.files.File, group: str = "/" | ||
) -> Union[dict, list]: | ||
"""This function returns the content of a hdf5 group. | ||
Because of the workaround described under the store_hdf5_dict function, groups which have numbers as keys will be turned to lists(No guartee that the order is the same as in th eoriginal list). | ||
Otherwise, the group will be turned into a dict. | ||
|
||
Parameters: | ||
hdf5_file: An HDF5_object containing the data that should be converted to a dictionary or list. | ||
group: The hdf5 group for which the dictionary should be created. The default is "/" which will return all the data in the hdf5_object | ||
|
||
Returns: | ||
A dict or list containing the content of the hdf5 group. | ||
""" | ||
|
||
return_value = None | ||
for key, value in hdf5_file[group].items(): | ||
if key.isdigit(): | ||
if return_value is None: | ||
return_value = [] | ||
if isinstance(value, h5py._hl.group.Group): | ||
return_value.append( | ||
generate_dict_from_hdf5(hdf5_file, group=group + key + "/") | ||
) | ||
elif isinstance(value[()], h5py._hl.base.Empty): | ||
return_value.append(None) | ||
elif isinstance(value[()], bytes): | ||
return_value.append(value[()].decode()) | ||
else: | ||
return_value.append(value[()]) | ||
|
||
else: # Case dictionary | ||
if return_value is None: | ||
return_value = {} | ||
if isinstance(value, h5py._hl.group.Group): | ||
return_value[key] = generate_dict_from_hdf5( | ||
hdf5_file, group=group + key + "/" | ||
) | ||
elif isinstance(value[()], h5py._hl.base.Empty): | ||
return_value[key] = None | ||
elif isinstance(value[()], bytes): | ||
return_value[key] = value[()].decode() | ||
else: | ||
return_value[key] = value[()] | ||
|
||
return return_value |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -280,6 +280,10 @@ class ServerConfig(BaseSettings): | |
True, | ||
description="If True, the server will check whether the query parameters given in the request are correct.", | ||
) | ||
enabled_response_formats: Optional[List[str]] = Field( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should make an enum of supported formats, then do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, I am trying to do this, but it does make things more complicated because I now have to convert the enums to a string before I can do the comparisons in my code. It would be easier to use a Literal["json", "hdf5"] instead. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am now using an ENUM class to restrict which values can be specified for enabled_response_formats. |
||
["json"], | ||
description="""A list of the response formats that are supported by this server. Must include the "json" format.""", | ||
) | ||
|
||
@validator("implementation", pre=True) | ||
def set_implementation_version(cls, v): | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -447,7 +447,8 @@ async def dispatch(self, request: Request, call_next): | |
if not isinstance(chunk, bytes): | ||
chunk = chunk.encode(charset) | ||
body += chunk | ||
body = body.decode(charset) | ||
if response.raw_headers[1][1] == b"application/vnd.api+json": | ||
body = body.decode(charset) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this always guaranteed to be at There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point, I have changed the code, so it now loops over all entries in the header. |
||
|
||
if self._warnings: | ||
response = json.loads(body) | ||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -40,7 +40,7 @@ def get_info(request: Request) -> InfoResponse: | |||||
"version": __api_version__, | ||||||
} | ||||||
], | ||||||
formats=["json"], | ||||||
formats=CONFIG.enabled_response_formats, | ||||||
available_endpoints=["info", "links"] + list(ENTRY_INFO_SCHEMAS.keys()), | ||||||
entry_types_by_format={"json": list(ENTRY_INFO_SCHEMAS.keys())}, | ||||||
is_index=False, | ||||||
|
@@ -71,8 +71,9 @@ def get_entry_info(request: Request, entry: str) -> EntryInfoResponse: | |||||
properties = retrieve_queryable_properties( | ||||||
schema, queryable_properties, entry_type=entry | ||||||
) | ||||||
|
||||||
output_fields_by_format = {"json": list(properties.keys())} | ||||||
output_fields_by_format = {} | ||||||
for outputformat in CONFIG.enabled_response_formats: | ||||||
output_fields_by_format[outputformat] = list(properties.keys()) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have removed the unnecessary .keys() from this file. |
||||||
|
||||||
return EntryInfoResponse( | ||||||
meta=meta_values( | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,7 +4,7 @@ | |
from datetime import datetime | ||
from typing import Any, Dict, List, Set, Union | ||
|
||
from fastapi import Request | ||
from fastapi import Request, Response | ||
from fastapi.responses import JSONResponse | ||
from starlette.datastructures import URL as StarletteURL | ||
|
||
|
@@ -22,6 +22,7 @@ | |
from optimade.server.exceptions import BadRequest, InternalServerError | ||
from optimade.server.query_params import EntryListingQueryParams, SingleEntryQueryParams | ||
from optimade.utils import mongo_id_for_database, get_providers, PROVIDER_LIST_URLS | ||
from optimade.adapters.hdf5 import generate_hdf5_file_content | ||
|
||
__all__ = ( | ||
"BASE_URL_PREFIXES", | ||
|
@@ -265,7 +266,7 @@ def get_entries( | |
if fields or include_fields: | ||
results = handle_response_fields(results, fields, include_fields) | ||
|
||
return response( | ||
response_object = response( | ||
links=links, | ||
data=results, | ||
meta=meta_values( | ||
|
@@ -277,6 +278,14 @@ def get_entries( | |
), | ||
included=included, | ||
) | ||
if params.response_format == "json": | ||
return response_object | ||
elif params.response_format == "hdf5": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Need to check whether hdf5 is also enabled in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (I now see that this is done in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have added an extra check. |
||
return Response( | ||
content=generate_hdf5_file_content(response_object), | ||
media_type="application/x-hdf5", | ||
headers={"Content-Disposition": "attachment"}, | ||
) | ||
|
||
|
||
def get_single_entry( | ||
|
@@ -313,7 +322,7 @@ def get_single_entry( | |
if fields or include_fields and results is not None: | ||
results = handle_response_fields(results, fields, include_fields)[0] | ||
|
||
return response( | ||
response_object = response( | ||
links=links, | ||
data=results, | ||
meta=meta_values( | ||
|
@@ -325,3 +334,11 @@ def get_single_entry( | |
), | ||
included=included, | ||
) | ||
if params.response_format == "json": | ||
return response_object | ||
elif params.response_format == "hdf5": | ||
return Response( | ||
content=generate_hdf5_file_content(response_object), | ||
media_type="application/x-hdf5", | ||
headers={"Content-Disposition": "attachment"}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to introduce a mandatory dependency on numpy. I would suggest that the HDF5Response is in a separate module and inherits from the JSON:API one. In the best case, it will just contain this additional config, but it may also make it easier to modify where necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not directly related to the hdf5 format, so it would be strange to place it in a HDF5Response.
I want to be able to handle NumPy numbers internally, so the format of the numbers does not need to change when they are read from a file.
I can make it so, that these encoders are only loaded when NumPy is present. However, I am not sure how we should indicate optional dependencies in setup.py or requirements.txt.