-
Notifications
You must be signed in to change notification settings - Fork 21
GSoC Ideas
Xbitinfo is an open-source Python package that enables lossy compression of geo-spatial data based on its information content. Embedded into the pangeo ecosystem, xbitinfo builds on top of xarray and dask and allows for fast compression and analysis of various data formats including netCDF and zarr. Xbitinfo addresses the challenge of increasingly large datasets that are currently created due to increasingly available compute power. Climate simulations with resolutions of sub-km scale with petabytes of output are just one example where xbitinfo can help to keep the dataset manageable.
A good general introduction about the compression technique that we use is presented in this clip and this Nature paper. This introduces the general concept of bit rounding and the original Julia implementation. An introductory movie about the python implementation can be found here.
To play around with xbitinfo and to familiarise with its syntax, the README contains valuable information on how to install the necessary environment. A simple example is included there as well. Further usage examples are available at https://xbitinfo.readthedocs.io/. In case an issue occurs or documentation is missing, check the current issues and feel free to create a new one.
It can also be very helpful to use your own dataset and try to compress it with the help of xbitinfo.
If you plan to become a GSoC contributor, we encourage you to
- contact us early to show your interest in a project idea
- look through the current list of issues and identify one that you want to solve. You likely want to choose one that is marked as
good-first-issue
- comment on the issue and request to be assigned to the issue so that we can make sure only one person is working on each issue
- open a pull request and discuss your contributions with the maintainers
If you consider applying for the project ideas below, it's a good idea to contact the potential mentors before submitting the application to answer any questions you might have. Please also review our contribution guidelines.
Deadline to upload proposal at https://summerofcode.withgoogle.com/ is April 2th 18 UTC!
- Select a project idea (see below) and write a detailed proposal following this template using Google Docs (in advance!)
- Plan your prepwork for the community bonding period (eg. a Proof-Of-Concept)
- Define milestones for each evaluation phase (i.e. Prototype, Pilot / Final Demo)
- Plan you weekly work & deliverables (tasking out: high-level goals for each milestone)
- Describe the acceptance criteria ("Minimum Viable Product" of each phase)
- Share an early draft and discuss your approach in the group with mentors. We are here to help! Do not forget to submit your application to the Google system when ready, some days before the deadline (the server can be overloaded at last minute). Xbitinfo takes part in GSoC as a sub-org of The Python Software Foundation
TIPs: read and follow the GSoC guide & PSF check-list
Here is a list of projects we think are interesting and can be managed within the timeframe of the Google Summer of Code
Datasets are continuously increasing in size and geo-scientific datasets are no exception to this. Satellites are continuously gathering higher and higher resolved images and also simulations of weather and climate resolve more and more details.
This challenge aims at improving the performance of the algorithm and its implementation in particular the calculation of the bitinformation content. Options are to analyse any bottlenecks in the dask implementation and/or interacting with the Julia implementation BitInformation.jl.
- Skills: Shell, Python, familiarity with xarray, zarr and dask and Julia are highly beneficial.
- Difficulty: Medium
- Project length: 350 hour
- Potential mentors: Hauke Schulz (@observingClouds), Milan Klöwer (@milankl)
The bitinformation framework can, for a given bitstream, distinguish between so-called real and false information, whereby real information is defined as the mutual information between adjacent bits. The real information for a bitstream can be at most the bistream's entropy, which can be thought of as unconditional information. The false information is the difference between entropy and real information. In simple terms: real + false = entropy.
The trailing mantissa bits in a geospatial dataset are usually of high entropy but low real information. However, if the dataset was lossily compressed before, especially quantisation in another binary format (e.g. linear packing that's often employed in netCDF data), some of the false information would be interpreted as real, as the quantisation can introduce some mutual information. For example, if always two neighbouring grid points are quantised to their average then the resulting bitstream would consist of only pairs of 0s and 1s, e.g. 0011000011001111 etc. We call this information that's only a consequence of prior compression "artificial".
At the moment the presence of artificial information is avoided by using high-precision/uncompressed data, but this is not always available, or users may not be aware of this. Looking at the bitwise information, one often sees a reemerging information for trailing mantissa bits, but that is a very crude rule of thumb. Sometimes the information never drops to zero.
This project's aim is to find an information theoretic approach to filter out aritifical information from real information. It will start with theoretic considerations, simple test cases, the development of new algorithms or the refining of the existing ones. Ultimately, such an artificial information filter would be part of xbitinfo and should be switched on by default.
- Skills: Background in maths/informatics or similar field, familiarity with Shell, Python, xarray, zarr and dask are beneficial
- Difficulty: Difficult
- Project length: 350 hour
- Potential mentors: Milan Klöwer (@milankl), Hauke Schulz (@observingClouds)
While the main application of xbitinfo is regularly gridded data on a longitude-latitude grid or similar, many other grids exist for Earth system modelling. While in principle, the bitinformation framework can be generalised to other grids too, as long as neighbouring grid cells are also adjacent in memory, this application remains largely uninvestigated. In the most extreme case one could use bitinformation to compress an unstructured mesh with data stored using some space-filling curve. Challenges here are non-uniform directions of the next grid cells, and the distances between grid can considerably change too. How does the bitinformation change in regions of high or low resolution? Can we always assume bitinformation to be largely isotropic and therefore a space-filling curve has little impact on the bitinformation analysis?
Furthermore, a large class of models uses polynomials representations of data, e.g. spectral models use Fourier modes, Legendre or Chebyshev polynomials and spectral models of the sphere use spherical harmonics. Data in these spaces does not satisfy the "neighbouring grid cells are also adjacent in memory" requirement, however, bitrounding can still be applied to enhance compression. It is unclear how to generalise the bitinformation framework to data stored in such spaces as an underlying assumption is not given. How does the rounding error of bitrounded data project onto spectral space? How to do bitrounding in spectral space to guarantee a bounded rounding error in grid-point space? How efficient is the round+lossless compression when applied in spectral space?
This project would, as a first step, involve the bitinformation analysis of data on various grids and spaces to understand how essentially the same rounding error can be guaranteed, regardless the space, while maintaing the high compressibility of bitrounding. Libraries for the transforms between spaces are available and would not need to be developed but used in this project. As a second step, the bitinformation algorithms will be extended to allow for more flexibility when the space the data is stored in is known. This project will start with artificial data to better test the concepts to be developed in theory before going to real data.
- Skills: Background in maths/informatics or similar field, familiarity with Shell, Python, xarray, zarr and dask are beneficial
- Difficulty: Difficult
- Project length: 350 hour
- Potential mentors: Milan Klöwer (@milankl), Hauke Schulz (@observingClouds)
ERA5 is the most widely used dataset that describes the world's weather from 1940 to present at 25km global resolution and hourly time steps. It is used as the ground truth for many machine learning applications and the weatherbench benchmark. It contains variables like temperature, wind, humidity, precipitation, radiation or pressure on 137 vertical levels from the surface far up into the stratosphere.
The idea of this project is not to improve xbitinfo to be better at compression datasets such as ERA5 but, conversly, to use xbitinfo to compress (a subset of) ERA5 as best as possible and improve xbitinfo on the way, resoling issues as their appear. In contrast to other projects that are largely issue driven, you'd probably create a lot of issues on the way and hopefully resolve as many as possible. Could we compress ERA5 5x, 10x, 20x or maybe even 50x smaller than the original data as it's available on the Copernicus climate data store? Anything might be possible, but you'd also want to not sacrifice speed too much: Compressing the entire dataset at 1MB/s is likely too slow, 100MB/s would already be much better. And then there's, of course, information that shouldn't unnecessarily be lost on the way. Could you achieve all of this while preserving at least 99% of real information? You see that, in practice, compression is an optimization game between size, speed and information.
- Skills: Background in maths/informatics or similar field, familiarity with Shell, Python, xarray, zarr and dask are beneficial
- Difficulty: Difficult but also lots of room for creativity
- Project length: 350 hour
- Potential mentors: Milan Klöwer (@milankl), Hauke Schulz (@observingClouds)