Merge pull request #265 from elfi-dev/dev

Release 0.7.1
elfi-dev · Apr 12, 2018 · b5807b1 · b5807b1
2 parents 1476806 + 79c7e9c
commit b5807b1
Show file tree

Hide file tree

Showing 15 changed files with 188 additions and 55 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -12,9 +12,10 @@ matrix:
     - mkdir -p /Users/travis/.matplotlib
     - "echo 'backend: TkAgg' > /Users/travis/.matplotlib/matplotlibrc"
     - brew update
-    - brew install python3
-    - virtualenv env -p python3
-    - source env/bin/activate
+    - brew upgrade python
+    - pip3 install virtualenv
+    - virtualenv py3env -p python3
+    - source py3env/bin/activate
 
 cache: pip
 

diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,9 +1,17 @@
 Changelog
 =========
 
+0.7.1 (2018-04-11)
+------------------
+- Implemented model selection (elfi.compare_models). See API documentation.
+- Fix threshold=0 in rejection sampling
+- Set default batch_size to 1 in ParameterInference base class
+
 0.7 (2017-11-30)
 ----------------
-
+- Added new example: the stochastic Lotka-Volterra model
+- Fix methods.bo.utils.minimize to be strictly within bounds
+- Implemented the Two Stage Procedure, a method of summary-statistics diagnostics
 - Added the MaxVar acquisition method
 - Added the RandMaxVar acquisition method
 - Added the ExpIntVar acquisition method

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-**Version 0.7 released!** See the CHANGELOG and [notebooks](https://github.com/elfi-dev/notebooks).
+**Version 0.7.1 released!** See the CHANGELOG and [notebooks](https://github.com/elfi-dev/notebooks).
 
 **NOTE:** For the time being NetworkX 2 is incompatible with ELFI.
 
@@ -28,6 +28,8 @@ Other notable included algorithms and methods:
 - Bayesian Optimization
 - [No-U-Turn-Sampler](http://jmlr.org/papers/volume15/hoffman14a/hoffman14a.pdf), a Hamiltonian Monte Carlo MCMC sampler
 
+ELFI also integrates tools for visualization, model comparison, diagnostics and post-processing.
+
 See examples under [notebooks](https://github.com/elfi-dev/notebooks) to get started. Full
 documentation can be found at http://elfi.readthedocs.io/. Limited user-support may be
 asked from elfi-support.at.hiit.fi, but the 

diff --git a/docs/api.rst b/docs/api.rst
@@ -265,6 +265,13 @@ Inference API classes
    :members:
    :inherited-members:
 
+**Model selection**
+
+.. currentmodule:: .
+
+.. autofunction:: elfi.compare_models
+
+
 Other
 .....
 

diff --git a/docs/faq.rst b/docs/faq.rst
@@ -10,3 +10,31 @@ produces outputs from the interval (1, 3).*
 their definitions. There the uniform distribution uses the location/scale definition, so
 the first argument defines the starting point of the interval and the second its length.
 
+.. _vectorization:
+
+*Q: What is vectorization in ELFI?*
+
+**A**: Looping is relatively inefficient in Python, and so whenever possible, you should *vectorize*
+your operations_. This means that repetitive computations are performed on a batch of data using
+precompiled libraries (typically NumPy_), which effectively runs the loops in faster, compiled C-code.
+ELFI supports vectorized operations, and due to the potentially huge saving in CPU-time it is
+recommended to vectorize all user-code whenever possible.
+
+.. _operations: good-to-know.html#operations
+.. _NumPy: http://www.numpy.org/
+
+For example, imagine you have a simulator that depends on a scalar parameter and produces a vector of 5
+values. When this is used in ELFI with ``batch_size`` set to 1000, ELFI draws 1000 values from the
+parameter's prior distribution and gives this *vector* to the simulator. Ideally, the simulator should
+efficiently process all 1000 parameter cases in one go and output an array of shape (1000, 5). When using
+vectorized operations in ELFI, the length (i.e. the first dimension) of all output arrays should equal
+``batch_size``. Note that because of this the evaluation of summary statistics, distances etc. should
+bypass the first dimension (e.g. with NumPy functions using ``axis=1`` in this case).
+
+See ``elfi.examples`` for tips on how to vectorize simulators and work with ELFI. In case you are
+unable to vectorize your simulator, you can use `elfi.tools.vectorize`_ to mimic
+vectorized behaviour, though without the performance benefits. Finally, for simplicity vectorization
+is not assumed (``batch_size=1`` by default).
+
+.. _`elfi.tools.vectorize`: api.html#elfi.tools.vectorize
+
diff --git a/docs/index.rst b/docs/index.rst
@@ -34,6 +34,8 @@ ELFI also has the following non LFI methods:
 
 .. _No-U-Turn-Sampler: http://jmlr.org/papers/volume15/hoffman14a/hoffman14a.pdf
 
+Additionally, ELFI integrates tools for visualization, model comparison, diagnostics and post-processing.
+
 
 .. toctree::
     :maxdepth: 1

diff --git a/docs/usage/tutorial.rst b/docs/usage/tutorial.rst
@@ -18,10 +18,9 @@ settings.
     
     import numpy as np
     import scipy.stats
-    import matplotlib
     import matplotlib.pyplot as plt
     import logging
-    logging.basicConfig(level=logging.INFO)
+    logging.basicConfig(level=logging.INFO)  # sometimes this is required to enable logging inside Jupyter
     
     %matplotlib inline
     %precision 2
@@ -251,7 +250,7 @@ a DAG.
 
 
 
-.. note:: You will need the Graphviz_ software as well as the graphviz `Python package`_ (https://pypi.python.org/pypi/graphviz) for drawing this. The software is already installed in many unix-like OS.
+.. note:: You will need the Graphviz_ software as well as the graphviz `Python package`_ (https://pypi.python.org/pypi/graphviz) for drawing this.
 
 .. _Graphviz: http://www.graphviz.org
 .. _`Python package`: https://pypi.python.org/pypi/graphviz
@@ -396,8 +395,8 @@ time is spent in drawing.
 
 .. parsed-literal::
 
-    CPU times: user 2.28 s, sys: 165 ms, total: 2.45 s
-    Wall time: 2.45 s
+    CPU times: user 1.6 s, sys: 166 ms, total: 1.77 s
+    Wall time: 1.76 s
 
 
 The ``sample`` method returns a ``Sample`` object, which contains
@@ -452,8 +451,8 @@ as long as it takes to generate the requested number of samples.
 
 .. parsed-literal::
 
-    CPU times: user 222 ms, sys: 40.3 ms, total: 263 ms
-    Wall time: 261 ms
+    CPU times: user 198 ms, sys: 35.5 ms, total: 233 ms
+    Wall time: 231 ms
     Method: Rejection
     Number of samples: 1000
     Number of simulations: 40000
@@ -497,9 +496,9 @@ been reached or a maximum of one second of time has been used.
 
     Method: Rejection
     Number of samples: 1000
-    Number of simulations: 190000
-    Threshold: 0.0855
-    Sample means: t1: 0.561, t2: 0.218
+    Number of simulations: 180000
+    Threshold: 0.088
+    Sample means: t1: 0.561, t2: 0.221
     
 
 
@@ -547,8 +546,8 @@ in our model:
 
 .. parsed-literal::
 
-    CPU times: user 5.26 s, sys: 37.1 ms, total: 5.3 s
-    Wall time: 5.3 s
+    CPU times: user 5.01 s, sys: 60.9 ms, total: 5.07 s
+    Wall time: 5.09 s
 
 
 
@@ -558,8 +557,8 @@ in our model:
     Method: Rejection
     Number of samples: 1000
     Number of simulations: 1000000
-    Threshold: 0.036
-    Sample means: t1: 0.561, t2: 0.227
+    Threshold: 0.0363
+    Sample means: t1: 0.554, t2: 0.216
 
 
 
@@ -580,8 +579,8 @@ anything. Let's do that.
 
 .. parsed-literal::
 
-    CPU times: user 636 ms, sys: 1.35 ms, total: 638 ms
-    Wall time: 638 ms
+    CPU times: user 423 ms, sys: 3.35 ms, total: 426 ms
+    Wall time: 429 ms
 
 
 
@@ -591,8 +590,8 @@ anything. Let's do that.
     Method: Rejection
     Number of samples: 1000
     Number of simulations: 1000000
-    Threshold: 0.0452
-    Sample means: t1: 0.56, t2: 0.228
+    Threshold: 0.0457
+    Sample means: t1: 0.55, t2: 0.216
 
 
 
@@ -610,8 +609,8 @@ simulations and only have to simulate the new ones:
 
 .. parsed-literal::
 
-    CPU times: user 1.72 s, sys: 10.6 ms, total: 1.73 s
-    Wall time: 1.73 s
+    CPU times: user 1.44 s, sys: 17.9 ms, total: 1.46 s
+    Wall time: 1.47 s
 
 
 
@@ -621,8 +620,8 @@ simulations and only have to simulate the new ones:
     Method: Rejection
     Number of samples: 1000
     Number of simulations: 1200000
-    Threshold: 0.0417
-    Sample means: t1: 0.561, t2: 0.225
+    Threshold: 0.0415
+    Sample means: t1: 0.55, t2: 0.215
 
 
 
@@ -640,8 +639,8 @@ standard numpy .npy files:
 
 .. parsed-literal::
 
-    CPU times: user 25.8 ms, sys: 3.27 ms, total: 29 ms
-    Wall time: 28.5 ms
+    CPU times: user 28.7 ms, sys: 4.5 ms, total: 33.2 ms
+    Wall time: 33.4 ms
 
 
 This stores the simulated data in binary ``npy`` format under
@@ -658,7 +657,7 @@ This stores the simulated data in binary ``npy`` format under
 
 .. parsed-literal::
 
-    Files in pools/arraypool_3521077242 are ['d.npy', 't1.npy', 't2.npy', 'Y.npy']
+    Files in pools/arraypool_3375867934 are ['d.npy', 't1.npy', 't2.npy', 'Y.npy']
 
 
 Now lets load all the parameters ``t1`` that were generated with numpy:
@@ -672,7 +671,7 @@ Now lets load all the parameters ``t1`` that were generated with numpy:
 
 .. parsed-literal::
 
-    array([ 0.79, -0.01, -1.47, ...,  0.98,  0.18,  0.5 ])
+    array([ 0.36,  0.47, -1.66, ...,  0.09,  0.45,  0.2 ])
 
 
 
@@ -687,7 +686,7 @@ We can also close (or save) the whole pool if we wish to continue later:
 
 .. parsed-literal::
 
-    arraypool_3521077242
+    arraypool_3375867934
 
 
 And open it up later to continue where we were left. We can open it
@@ -718,12 +717,12 @@ You can delete the files with:
         os.listdir(arraypool.path)
         
     except FileNotFoundError:
-        print("The directry is removed")
+        print("The directory is removed")
 
 
 .. parsed-literal::
 
-    The directry is removed
+    The directory is removed
 
 
 Visualizing the results
@@ -820,8 +819,9 @@ sampler:
     smc = elfi.SMC(d, batch_size=10000, seed=seed)
 
 For sampling, one has to define the number of output samples, the number
-of populations and a *schedule* i.e. a list of quantiles to use for each
-population. In essence, a population is just refined rejection sampling.
+of populations and a *schedule* i.e. a list of thresholds to use for
+each population. In essence, a population is just refined rejection
+sampling.
 
 .. code:: ipython3
 
@@ -839,8 +839,8 @@ population. In essence, a population is just refined rejection sampling.
 
 .. parsed-literal::
 
-    CPU times: user 1.72 s, sys: 154 ms, total: 1.87 s
-    Wall time: 1.56 s
+    CPU times: user 1.6 s, sys: 156 ms, total: 1.75 s
+    Wall time: 1.38 s
 
 
 We can have summaries and plots of the results just like above:

diff --git a/elfi/__init__.py b/elfi/__init__.py
@@ -12,6 +12,7 @@
 import elfi.model.tools as tools
 from elfi.client import get_client, set_client
 from elfi.methods.diagnostics import TwoStageSelection
+from elfi.methods.model_selection import *
 from elfi.methods.parameter_inference import *
 from elfi.methods.post_processing import adjust_posterior
 from elfi.model.elfi_model import *
@@ -24,4 +25,4 @@
 __email__ = 'elfi-support@hiit.fi'
 
 # make sure __version_ is on the last non-empty line (read by setup.py)
-__version__ = '0.7'
+__version__ = '0.7.1'
diff --git a/elfi/methods/model_selection.py b/elfi/methods/model_selection.py
@@ -0,0 +1,59 @@
+"""This module contains methods for model comparison and selection."""
+
+import numpy as np
+
+
+def compare_models(sample_objs, model_priors=None):
+    """Find posterior probabilities for different models.
+
+    The algorithm requires elfi.Sample objects from prerun inference methods. For example the
+    output from elfi.Rejection.sample is valid. The portion of samples for each model in the top
+    discrepancies are adjusted by each models acceptance ratio and prior probability.
+
+    The discrepancies (including summary statistics) must be comparable so that it is
+    meaningful to sort them!
+
+    Parameters
+    ----------
+    sample_objs : list of elfi.Sample
+        Resulting Sample objects from prerun inference models. The objects must include
+        a valid `discrepancies` attribute.
+    model_priors : array_like, optional
+        Prior probability of each model. Defaults to 1 / n_models.
+
+    Returns
+    -------
+    np.array
+        Posterior probabilities for the considered models.
+
+    """
+    n_models = len(sample_objs)
+    n_min = min([s.n_samples for s in sample_objs])
+
+    # concatenate discrepancy vectors
+    try:
+        discrepancies = np.concatenate([s.discrepancies for s in sample_objs])
+    except ValueError:
+        raise ValueError("All Sample objects must include valid discrepancies.")
+
+    # sort and take the smallest n_min
+    inds = np.argsort(discrepancies)[:n_min]
+
+    # calculate the portions of accepted samples for each model in the top discrepancies
+    p_models = np.empty(n_models)
+    up_bound = 0
+    for i in range(n_models):
+        low_bound = up_bound
+        up_bound += sample_objs[i].n_samples
+        p_models[i] = np.logical_and(inds >= low_bound, inds < up_bound).sum()
+
+        # adjust by the number of simulations run
+        p_models[i] /= sample_objs[i].n_sim
+
+        # adjust by the prior model probability
+        if model_priors is not None:
+            p_models[i] *= model_priors[i]
+
+    p_models = p_models / p_models.sum()
+
+    return p_models