From 7b4d737244774dbef778d0bc2e6e33eaabf9de1f Mon Sep 17 00:00:00 2001
From: Quarto GHA Workflow Runner J|ttZiJ6!=DuBet;55M9
zP3>&NCo#6fqJ?sD=w^4*+Uxkp)+7?uruu+#&f63DLzdHy3`VGpxNq@zf$en_Vgjch
zoRanHyC3X#zn5ttJ4`KN(Gsu;mTcG``5ui_O>oD-4fF}#ID+5eSyJP~U1^NMtj~$M
zuE8x#y;KmkIvs_G9lncyw*hgD70=BKTjtREY;l2vrNTUo8&YA0lurQ;-uPCGYc_er_>)6G03eCDKuP%S%oBm@#;J
zkc8!GVL4w^cShssM_;ITC3<)NxC9-^F_mVkFIa{^d|k9fp5b117645x8V3gMoCOQ|
zR3)Xs#3LH3`P;5d
zMK$kEIpYc`1&1!(Nmr)r5=Z=fq22hxR3U+Dre0WA=YF@~K7UUJGRLjUe8eKt&0uR=
z%4otPHOjTCb;IcS&V+bJZA;SlE7f)e>QkCl0_$8A$+)OY$SD+ujhnSekRple9
z%qQVX-87i0tKIxpcxAGBX3PZx)BodN_ZKx+Sj#WH#PUw~C3=1p&E!tRQpRta8-ENs
z`wn;Che#!nYTp${(E|g}A0PNw+Wa`aAzof}3HPcdmjkrSkD$QyNTW-#;2Xm)-9$n7
z%XCC$Lp3)T7?$(4GtEQvdFz6545?-n&ugyy=ThBw&)7(b#$H$LgiW>Ft|nWJRgSQ8
zGFGivTS%sr+?KVcI}q={e>Y@Q@_&X>QN%&$J{Z)D7fc)yk*qY-T3ynYG=8RZ-7bb6
zE!-4cK1)l6#{GtbHr~xth7n>Nurl}eFjsMl>_iRKZf>V%Tj72vZEcgLyLxZ1$IS{0
zj}-jKN7X)L2>I~wW=9-6w#7q4or%La#vn7~zLv=bH!bdqyB@^aTp_@~Uw`WWP~gEj
z9%tK1Z&CsU%Bt~w!NFmQ=)#G{mD0!9j!4A}j*3lG0B+-lbUg%={}va~EHNIVh|P4N
z>knVS(I!w9cdY!!T-~EqDM?3g1E+2X2&Lj%mg7;oT?27CV&&|A61m~w0Fud~lO
6;5f=uczy=e{KihCSVV$0<4x7mc0i%zqWj9!EJMlRtAd-))
zY!>m-z9`zWX%S&51IYxGfJf66BQYlE=xBmY&S1gvOs4H}wW^bDdi_wsZLVJtOPCde
zG{fqUK^4}IHYq+s$XIWxujcEyphCJ-U>Qvtd@L3=65IbR;sVKeL3lK^Py{VJGxn!1
zZqPH~k9`3{jfcQx=zuUknWk@8at1dniuW6S!flBxY`sbLu83^@{O98K(z}%%oOoYo
zf^JWYv%Y+n045t$;1na#F!?Y5tPhJSsrapFL~Sor=Iz^T07
zj-9rzb5^1$D%Hv!Bzs9t80R8y9=sgX+)Il@6J~?sN$SmQh8ndY)RrpVD!L}TqbnM&F*%E8^fPcy}l0GM6
zUXfYKz6wBLr%_j9=bZ~GVY8PmbZN-+zLMy#C14M8vUCcTn<}!ZpI;U^b4?ea>RB`Q
zmOd_P*Vf)=ty~bw!A8x=3xh`nlic6rAsIsEodeOcX7(6EosKn_U#&B_+I{ray#K~-
z+*w{fD;+iq9ZMa?X3m$-DQ=(i>u0MshOu^ITm`tGqTz=syu>L`>n2~SCXz-DLJr6n
z4M_-LgRWZO43FgQb;fG*OOT%B{jbj3Q8?=Ag0Ah<=qxZ&KVwhh1|Rur*LI*|M_4M
zD+l@v3m_TpoV}mT+aTCcLsmi&32ri>cNnwg5Cv%%Uj-Hg0!AyJGm6CbFfRUxe|jM-
z3HUf*g?!-hCk53S_AZI*F_?Hq~QhOAg{PA80p*asNzF)27dSFb}rq+ji(=iZlk
z?pf0CB!}JwY>l8Fuqu0k&<}~`d1{anKz#ms=wYud;DwRTL^JS7b}#y?@HqgzUPjp<
zjk#Z%{TPGX
^Bxn#0
zuz+B-Cg0lAWOAhN14!MCG7YaPPetOJHG~a>vg_r2cOo(@&Ab-9uWbo&OOf4`V{PsU
z@1FOgxsx|oPEHg?qQ0K^UhHUAgbQba(YgdPdhJBG)~Z(7I6pfgs{Kg*xi>2g59`C~
z1+Wxh08l73wdUaIb*Ce6Wh1&1+o@FH31th5z@}xiw6zB30M#}3UOqgUMD&G0;EJOa
zm1~@Ut;On3azVWaPr)*dWoV*alPdeS2iud4Y*m4?j3(aqo4;$L7T*TJztM4dc+Z(X
zM@&j#@aVt4M^egCzYYN3JiLXJ{AlUE#|q<}$}Ril^K6VmBV2aRh>l1#<)5(08X!73
zXI9FnEHbfp0Ki<68w(Q#UKimZomkzF=N6DrpCuZ~>pM=3^%{54fP}~flYy`rfq=!N
z!18`k8Pq(5v67_Mf-?(TL2&(F6MM+&?6PEmd!(5Ex7_#mAfl7R@U9`>12q^%10~L
Table of contents
Table of contents
-
@@ -164,12 +170,12 @@
Table of contents
-
-
3 Introduction
Data and metadata standards that adopt tools and practices of OSS (“open-source standards” henceforth) stand to reap many of the benefits that the OSS model has provided in the development of other technologies.The present report explore how OSS processes and tools have affected the development of data and metadata standards. The report will triangulate common features of a variety of use cases; it will identify some of the challenges and pitfalls of this mode of standards development, with a particular focus on cross-sector interactions; and it will make recommendations for future developments and policies that can help this mode of standards development thrive and reach its full potential.
-To understand how OSS development practices affect the development of data and metadata standards, it is informative to demonstrate this cross-fertilization through a few use cases. As we will see in these examples some fields, such as astronomy, high-energy physics and earth sciences have a relatively long history of shared data resources from organizations such as LSST and CERN, while other fields have only relatively recently become aware of the value of data sharing and its impact. These disparate histories inform how standards have evolved and how OSS practices have pervaded their development.
At the same time, these tools and practices are associated with risks that need to be mitigated.
One of the defining characteristics of OSS is its dynamism and its rapid evolution. Because OSS can be used by anyone and, in most cases, contributions can be made by anyone, innovations flow into OSS in a bottom-up fashion from user/developers. Pathways to contribution by members of the community are often well-defined: both from the technical perspective (e.g., through a pull request on GitHub, or other similar mechanisms), as well as from the social perspective (e.g., whether contributors need to accept certain licensing conditions through a contributor licensing agreement) and the socio-technical perspective (e.g., how many people need to review a contribution, what are the timelines for a contribution to be reviewed and accepted, what are the release cycles of the software that make the contribution available to a broader community of users, etc.). Similarly, open-source standards may also find themselves addressing use cases and solutions that were not originally envisioned through bottom-up contributions of members of a research community to which the standard pertains. However, while this dynamism provides an avenue for flexibility it also presents a source of tension. This is because data and metadata standards apply to already existing datasets, and changes may affect the compliance of these existing datasets. Similarly, analysis technology stacks that are developed based on an existing version of a standard have to adapt to the introduction of new ideas and changes into a standard. Dynamic changes of this sort therefore risk causing a loss of faith in the standard by a user community, and migration away from the standard. Similarly, if a standard evolves too rapidly, users may choose to stick to an outdated version of a standard for a long time, creating strains on the community of developers and maintainers of a standard who will need to accommodate long deprecation cycles.
Data standardization investment is justified if the standard is generalizable beyond any specific science domain. However while the use cases are domain sciences based, data standardization is seen as a data infrastructure and not a science investment. Moreover due to how science research funding works, scientists lack incentives to work across domains, or work on infrastructure problems.
+Data standardization investment is justified if the standard is generalizable beyond any specific science domain. However while the use cases are domain sciences based, data standardization is seen as a data infrastructure and not a science investment. Moreover, due to how science research funding works, scientists lack incentives to work across domains or to work on infrastructure problems.
Data for scientific observations are often generated by proprietary instrumentation due to commercialization or other profit driven incentives. There islack of regulatory oversight to adhere to available standards or evolve Significant data transformation is required to get data to a state that is amenable to standards, if available. If not available, there is lack of incentive to set aside investment or resources to invest in establishing data standards.
+Data for scientific observations are often generated by proprietary instrumentation due to commercialization or other profit-driven incentives. There is a lack of regulatory oversight to adhere to available standards or evolve Significant data transformation is required to get data to a state that is amenable to standards, if available. If not available, there is a lack of incentive to set aside investment or resources to invest in establishing data standards.
+Open-source standards development faces the challenges of adapting to new computing paradigms and technologies. Cloud computing provides a particularly stark set of opportunities and challenges. On the one hand, cloud computing offers practical solutions for many challenges of contemporary data-driven research. For example, the scalability of cloud resources addresses some of the challenges of the scale of data that is produced by instruments in many fields. The cloud also makes data access relatively straightforward, because of the ability to determine data access permissions in a granular fashion. On the other hand, cloud computing requires reinstrumenting many data formats. This is because cloud data access patterns are fundamentally different from the ones that are used in local posix-style file-systems. Suspicion of cloud computing comes in two different flavors: the first by researchers and administrators who may be wary of costs associated with cloud computing, and especially with the difficulty of predicting these costs. Projects such as NSF’s Cloud Bank seek to mitigate some of these concerns, by providing an additional layer of transparency into cloud costs (Norman et al. 2021). The other type of objection relates to the fact that cloud computing services, by their very nature, are closed ecosystems that resist portability and interoperability. Some aspects of the services are always going to remain hidden and privy only to the cloud computing service provider. In this respect, cloud computing runs afoul of some of the appealing aspects of OSS. That said, the development of “cloud native” standards can provide significant benefits in terms of the research that can be conducted. For example, NOAA plans to use cloud computing for integration across the multiple disparate datasets that it collects to build knowledge graphs that can be queried by researchers to answer questions that can only be answered through this integration. Putting all the data “in one place” should help with that. Adaptation to the cloud in terms of data standards has driven development of new file formats. A salient example is the ZARR format (Miles et al. 2024), which supports random access into array-based datasets stored in cloud object storage, facilitating scalable and parallelized computing on these data. Indeed, data standards such as NWB (neuroscience) and OME (microscopy) now use ZARR as a backend for cloud-based storage. In other cases, file formats that were once not straightforward to use in the cloud, such as HDF5 and TIFF have been adapted to cloud use (e.g., through the cloud-optimized geoTIFF format).
+Open-source standards development faces the challenges of adapting to new technologies The development of standards that are well-Cloud computing provides
-