Releases: thunder-project/thunder
Maintenance release
This a maintenance release of Thunder.
The main focus is fixing a variety of deployment and installation related issues, and adding initial support for the recently released Spark 1.4. Thunder has not been extensively used alongside Spark 1.4, but with this release all core functionality has been verified.
Changes and Bug Fixes
- Fix launching error when starting Thunder with Spark 1.4 (addresses #201)
- Fix EC2 deployment with Spark 1.4
- More informative errors for handling import errors on startup
- Remove pylab when starting notebooks on EC2
- Improved dependency handling on EC2
- Updated documentation for factorization methods
Contributions
- Davis Bennet (@d-v-b): doc improvements
- Andrew Giessel (@andrewgiessel): EC2 deployment
- Jeremy Freeman (@freeman-lab): varius bug fixes
If you have any questions come chat with us, and stay tuned for Thunder 0.6.0 in the near future.
Fifth development release
We are pleased to announce the release of Thunder 0.5.0. This release introduces several new features, including a new framework for image registration algorithms, performance improvements for core data conversions, improved EC2 deployment, and many bug fixes. This release requires Spark 1.1.0 or later, and is compatible with the most recent Spark release, 1.3.0.
Major features
- A new image registration API inside the new
thunder.imgprocessing
package. See the tutorial. - Significant performance improvements to the
Images
toSeries
conversion, including aBlocks
object as an intermediate stage. The inverse conversion, fromSeries
back toImages
, is now supported. - Support for tiff image files as an input format has been expanded and made more robust. Multiple image volumes can now be read from a single input file via the nplanes argument in the loading functions, and files can be read from nested directory trees using the
recursive=True
flag. - New methods for working with mutli-level indexing on
Series
objects, includingselectByIndex
andseriesStatByIndex
, see the tutorial. - Convenient new getter methods for extracting Individual records or small sets of records using bracket notation, as in
Series[(x,y,z)]
orImages[k]
. - A new
serializable
decorator to make it easy to save/load small objects (e.g. models) to JSON, including handling of numpy arrays. See saving/loading ofRegistrationModel
for an example.
Minor features
- Parameter files can be loaded from a file with simple JSON schema (useful for working with covariates), using
ThunderContext.loadParams
- A new method
ThunderContext.setAWSCredentials
handles AWS credential settings in managed cluster environments (where it may not be possible to modify system config files) - An Images object can be saved to a collection of binary files using
Images.saveAsBinaryImages
- Data objects now have a consistent
__repr__
method, displaying uniform and informative results when these objects are printed. - Images and Series objects now each offer a
meanByRegions()
method, which calculates a mean over one or more regions specified either by a set of indices or a mask image. - TimeSeries has a new
convolve()
method. - The
thunder
andthunder-submit
executables have been modified to better expose the options available in the underlyingpyspark
andspark-submit
Spark executable scripts. - An improved and streamlined
Colorize
with new colorization options. - Load data hosted by the Open Connectome Project with the
loadImagesOCP
method. - New example data sets available, both for local testing and on S3
- New tutorials: regression, image registration, multi-level indexing
Transition guide
- Some keyword parameters have been changed for consistency with the Thunder style guide naming conventions. Example are
inputformat
,startidx
, andstopidx
parameters on the ThunderContext loading methods, which are nowinputFormat
,startIdx
, andstopIdx
, respectively. We expect minimal future changes in existing method and parameter names. - The Series methods
normalize()
anddetrend()
have been moved to TimeSeries objects, which can be created by theSeries.toTimeSeries()
method. - The default file extension for the binary
stack
format is nowbin
instead ofstack
. If you need to load files with thestack
extension, you can use theext='stack'
keyword argument ofloadImages
. export
is now a method on theThunderContext
instead of a standalone function, and now supports exporting to S3.- The
loadImagesAsSeries
andconvertImagesToSeries
methods onThunderContext
now default toshuffle=True
, making use of a revised execution path that should improve performance. - The method for loading example data has been renamed from
loadExampleEC2
toloadExampleS3
Deployment and development
- Anaconda is now the default Python installation on EC2 deployments, as well as on our Travis server for testing.
- EC2 scripts and unit tests provide quieter and prettier status outputs.
- Egg files now included with official releases, so that a pip install of thunder-python can immediately be deployed on a cluster without cloning the repo and building an egg.
Contributions:
- Andrew Osheroff (data getter improvements)
- Ben Poole (optimized window normalization, image registration)
- Jascha Swisher (images to series conversion, serializable class, tif handling, get and meanBy methods, bug fixes)
- Jason Wittenbach (new series indexing functionality, regression and indexing tutorials, bug fixes)
- Jeremy Freeman (image registration, EC2 deployment, exporting, colorizing, bug fixes)
- Kunal Lillaney (loading from OCP)
- Michael Broxton (serializable class, new series statistics, improved EC2 deployment)
- Noah Young (improved EC2 deployment)
- Tom Sainsbury (image filtering, PNG saving options)
- Uri Laseron (submit scripts, Hadoop versioning)
Roadmap
Moving forward we will do a code freeze and cut a release every three months. The next will be June 30th.
For 0.6.0 we will focus on the following components:
- A source extraction / segmentation API
- New capabilities for regression and GLM model fitting
- New image registration algorithms (including volumetric methods)
- Latent factor and network models
- Improved performance on single-core workflows
- Bug fixes and performance improvements throughout
If you are interested in contributing, let us know! Check out the existing issues or join us in the chatroom.
Maintenance release
We are happy to announce the 0.4.1 release of Thunder. This is a maintenance / bug fix release.
The focus is ensuring consistent array indexing across all supported input types and internal data formats. For 3D image volumes, the z-plane will now be on the third array axis (e.g. ary[:,:,2]
), and will be in the same position for Series
indices and the dims
attribute on Images
and Series
objects. Visualizing image data by matplotlib’s imshow()
function will yield an image in the expected orientation, both for Images
objects and for the arrays returned by a Series.pack()
call. Other changes described below.
Changes and Bug Fixes
- Handling of wildcards in path strings for the local filesystem and S3 is improved.
- New
Data.astype
method for converting numerical type of values. - A
dtype
parameter has been added to theThunderContext.load*
methods. - Several exceptions thrown by uncommon edge cases in tif handling code have been resolved.
- The
Series.pack()
method no longer automatically casts returned data tofloat16
. This can instead be performed ahead of time using the newastype
methods. tsc.convertImagesToSeries()
did not previously write output files with tif file input whenshuffle=True
.- A
ValueError
thrown by the random sampling methods with numpy 1.9 has been resolved (issue #41). - The
thunder-ec2
script will now generate a~/.boto
configuration file containing AWS access keys on all nodes, allowing workers to access S3 with no additional configuration. - Test example data files are now copied out to all nodes in a cluster as part of the
thunder-ec2
script. - Now compatible with boto 2.8.0 and later versions, for EC2 deployments (issue #40).
- Fixed a dimension bug when colorizing 2D images with the
indexed
conversion type. - Fixed an issue with optimization approach being misspecified in colorization.
Thanks
- Joseph Naegele: reporting path and data type bugs
- Allan Wong: reporting random sampling bug
- Sung Soo Kim: reporting colorization optimization issue
- Thomas Sainsbury: reporting indexed colorization bug
Contributions
- Jascha Swisher (@industrial-sloth): unified indexing schemes, bug fixes
- Jeremy Freeman (@freeman-lab): bug fixes
Thanks very much for your interest in Thunder. Questions and comments can be set to the mailing list.
Fourth development release
We are pleased to announce the release of Thunder 0.4.0.
This release introduces some major API changes, especially around loading and converting data types. It also brings some substantial updates to the documentation and tutorials, and better support for data sets stored on Amazon S3. While some big changes have been made, we feel that this new architecture provides a more solid foundation for the project, better supporting existing use cases, and encouraging contributions. Please read on for more!
Major Changes
- Data representation. Most data in Thunder now exists as subclasses of the new
thunder.rdds.Data
object. This wraps a PySpark RDD and provides several general convenience methods. Users will typically interact with two main subclasses of data,thunder.rdds.Images
andthunder.rdds.Series
, representing spatially- and temporally-oriented data sets, respectively. A common workflow will be to load image data into anImages
object and then convert it to aSeries
object for further analysis, or just to convertImages
directly toSeries
data. - Loading data. The main entry point for most users remains the
thunder.utils.context.ThunderContext
object, available in the interactive shell astsc
, but this class has many new, expanded, or renamed methods, in particularloadImages()
,loadSeries()
,loadImagesAsSeries()
, andconvertImagesToSeries()
. Please see the Thunder Context tutorial and the API documentation for more examples and detail. - New methods for manipulating and processing images and series data, including refactored versions of some earlier analyses (e.g. routines from the package previously known as
timeseries
). - Documentation has been expanded, and new tutorials have been added.
- Core API components are now exposed at the top-level for simpler importing, e.g. from thunder import Series or from thunder import ICA
Improved support for loading image data directly from Amazon S3, using the boto AWS client library. Theload*
methods in ThunderContext now all supports3n://
schema URIs as data path specifiers.
Notes about requirements and environments
- Spark 1.1.0 is required. Most functionality will be intact with earlier versions of Spark, with the exception of loading flat binary data.
- “Hadoop 1” jars as packaged with Spark are recommended, but Thunder should work fine if recompiled against the CDH4, CDH5, or “Hadoop 2” builds.
- Python 2 required, version 2.6 or greater.
- PIL/pillow libraries are used to handle tif images. We have encountered some issues working with these libraries, particularly on OSX 10.9. Some errors related to image loading may be traceable to a broken PIL/pillow installation.
- This release has been tested most extensively in three environments: local usage, a private research compute cluster, and Amazon EC2 clusters stood up using the thunder-ec2 script packaged with the distribution.
Future Directions
Thunder is still young, and will continue to grow. Now is a great time to get involved! While we will try to minimize changes to the API, it should not yet be considered stable, and may change in upcoming releases. That said, if you are using or contemplating using Thunder in a production environment, please reach out and let us know what you’re working on, or post to the mailing list.
Contributors
Jascha Swisher (@industrial-sloth): loading functionality, data types, AWS compatibility, API
Jeremy Freeman (@freeman-lab): API, data types, analyses, general performance and stability
Maintenance release
This release includes bug fixes and other minor improvements.
Bug fixes
- Removed pillow dependency, to prevent a bug that appears to occur frequently in Mac OS 10.9 installations (87280ec)
- Customized EC2 installation and configuration, to avoid using Anaconda AMI, which was failing to properly configure mounted drives (fixes #21)
Improvements
Maintenance release
Maintenance release with bug fixes and minor improvements.
Bug fixes
- Fixed error specifying path to shell.py in pip installations
- Fixed a broken import that prevented use of Colorize
Improvements
- Query returns average keys as well as average values
- Loading example data from EC2 supports "requester pays" mode
- Fixed documentation typos (#19)
Third development release
This update adds new functionality for loading data, alongside changes to the API for loading, and a variety of smaller bug fixes.
API changes
- All data loading is performed through the new Thunder Context, a thin wrapper for a Spark Context. This context is automatically created when starting thunder, and has methods for loading data from different input sources.
tsc.loadText
behaves identically to theload
from previous versions.- Example data sets can now be loaded from
tsc.makeExample
,tsc.loadExample
, andtsc.loadExampleEC2
. - Output of the
pack
operation now preserves xy definition, but outputs will be transposed relative to previous versions.
New features
- Include design matrix with example data set on EC2
- Faster
nmf
implementation by changing update equation order (#15) - Support for loading local MAT files into RDDs through
tsc.loadMatLocal
- Preliminary support for loading binary files from HDFS using
tsc.loadBinary
(depends on features currently only available in Spark's master branch)
Bug fixes
Second development release
This is a significant update with changes and enhancements to the API, new analyses, and bug fixes.
Major changes
- Updated for compatibility with Spark 1.0.0, which brings with it a number of significant performance improvements
- Reorganization of the API such that all analyses are all accessed through their respective classes and methods (e.g.
ICA.fit
,Stats.calc
). Standalone functions use the same classes, and act as wrappers soley for non-interactive job submission (e.g.thunder-submit factorization/ica <opts>
) - Executables included with the release for easily launching a PySpark shell, or an EC2 cluster, with Thunder dependencies and set-up handled automatically
- Improved and expanded documentation, built with Sphinx
- Basic functionality for colorization of results, useful for visualization, see example
- Registered project in PyPi
New analyses and features
- A
DataSet
class for easily loading simulated and real data examples - A decoding package and
MassUnivariateClassifier
class, currently supporting two mass univariate classification analyse (GaussNaiveBayes
andTTest
) - An
NMF
class for dense non-negative matrix factorization, a useful analysis for spatio-temporal decompositions
Bug fixes and other changes
- Renamed
sigprocessing
library totimeseries
- Replace
eig
witheigh
for symmetric matrix - Use
set
andbroadcasting
to speed up filtering for subsets inQuery
- Several optimizations and bug fixes in basic saving functionality, including new
pack
function - Fixed handling of integer indices in
subtoind
First development release
First development release, highlighting newly refactored four analysis packages (clustering, factorization, regression, and sigprocessing) and more extensive testing and documentation
Release notes:
General
Preprocessing an optional argument for all analysis scripts
Tests for accuracy for all analyses
Clustering
Max iterations and tolerance optional arguments for kmeans
Factorization
Unified singular value decomposition into one function with method option ("direct" or "em")
Made max iterations and tolerance optional arguments to ICA
Added random seed argument to ICA to faciliate testing
Regression
All functions use derivatives of a single RegressionModel or TuningModel class
Allow input to RegressionModel classes to be arrays or tuples for increased flexibility
Made regression-related arguments to tuning optional arguments
Signal processing
All functions use derivatives of a single SigProcessMethod class
Added crosscorr function
Thanks to many contributions from @JoshRosen!