Non-deterministic dataset creation #16804

oXwvdrbbj8S4wo9k8lSN · 2020-03-30T09:26:12Z

Describe the bug

Datasets created using make_classification differ slightly between two test machines. Not all columns differ, but some have minor differences. Yet, within unit-tests, comparison of algorithm outputs will fail if inputs differ.
To make sure that no fundamental difference between the test machines exist, a random numpy array was created using the following snippet. The outputs match exactly.

np.random.seed(42)
temp = np.random.random((1000,1000))
print(temp.sum())

result on both machines:
500334.48617571464

Steps/Code to Reproduce

from sklearn.datasets import make_classification
X_generated, y_generated = make_classification(n_samples= 57,
n_features=369,
random_state= 485,
n_informative= 96+21,
n_redundant= 73,
n_repeated= 69,
n_classes= 52,
shuffle= 0)
print(X_generated.sum())

Expected Results

Exactly same print-outs on both machines.

Actual Results

machine1: -497.68147031410854
machine2: -497.6814703141058

Versions

machine1:

System:
    python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20)  [GCC 7.3.0]
executable: /opt/conda/bin/python
   machine: Linux-4.19.76-linuxkit-x86_64-with-debian-buster-sid

Python dependencies:
       pip: 20.0.2
setuptools: 46.0.0.post20200311
   sklearn: 0.22.2.post1
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: 0.29.15
    pandas: 1.0.3
matplotlib: 3.1.3
    joblib: 0.14.1

Built with OpenMP: True

machine2:

System:
    python: 3.7.6 | packaged by conda-forge | (default, Mar  5 2020, 15:27:18)  [GCC 7.3.0]
executable: /opt/conda/bin/python
   machine: Linux-4.15.0-1075-azure-x86_64-with-debian-buster-sid

Python dependencies:
       pip: 20.0.2
setuptools: 46.0.0.post20200311
   sklearn: 0.22.2.post1
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: 0.29.15
    pandas: 1.0.2
matplotlib: 3.1.3
    joblib: 0.14.1

Built with OpenMP: True

The text was updated successfully, but these errors were encountered:

rth · 2020-03-30T09:38:02Z

Reproducibility is an interesting topic. make_classification uses fairly standard operations with numpy, I don't think that random_seed was not passed somewhere.

I think, the most likely difference lies either in BLAS (conda list | grep blas) -- used when computing the sum or even in libc versions I guess. Generally the 11th significant digit shouldn't be that bad for practical applications.

It would be interesting to dump both arrays on these two machines (np.save) and compare relative error element by element, to see where it is coming from. It could also be just the sum operation due to BLAS.

rth · 2020-03-30T09:38:21Z

Also somewhat related conda/conda-build#2140

oXwvdrbbj8S4wo9k8lSN · 2020-03-30T10:41:15Z

Here are the saved numpy arrays.

X_generated_machine1.zip
X_generated_machine2.zip

And the outputs from conda list | grep blas:
machine1:

blas                      2.14                   openblas    conda-forge
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
liblapacke                3.8.0               14_openblas    conda-forge
libopenblas               0.3.7                h5ec1e0e_6    conda-forge

machine2:

blas                      2.14                   openblas    conda-forge
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
liblapacke                3.8.0               14_openblas    conda-forge
libopenblas               0.3.7                h5ec1e0e_6    conda-forge

rth · 2020-03-30T16:13:10Z

Here are the saved numpy arrays.

>>> import numpy as np
>>> X1 = np.load("X_generated_machine1.npy")
>>> X2 = np.load("X_generated_machine2.npy")
>>> np.abs(X1 - X2).max()
8.526512829121202e-14
>>> (X1 - X2).sum()
-1.3681208943516765e-12
>>> np.abs(X1 - X2).sum()
4.225935573698436e-11
>>> X1.sum()
-497.68147031410854
>>> X2.sum()
-497.6814703141058

So the maximum abs error is 8e-14, and the problem is not with the sum operation.

And the outputs from conda list | grep blas:

Even if it's exactly the same BLAS, different implementation may be used depending on the available CPU flags (cat /proc/cpuinfo | grep flags) e.g. AVX2/512 or not, which are only expected to produce the same output up to the numerical tolerance,

>>> np.finfo(np.float64).eps
2.220446049250313e-16

we had issues like these in the past where some tests only fail on AVX512 machines. If not BLAS maybe something on the RNG generation side or in libc. As to which operation is exactly responsible here, that would require more investigation and I'm not sure it's worth it. Why do you need exact reproducibility in this case?

oXwvdrbbj8S4wo9k8lSN · 2020-03-30T18:57:23Z

Why do you need exact reproducibility in this case?

'Need' is maybe a bit exaggerated. I stumbled upon this problem because I run unittests on a wrapper of this function, in which I modify the outputs. And since we use the produced datasets for other unit tests, I wanted to make sure, they are always consistent in order not to waste time debugging an algorithm just because the input differed. The consistency is currently checked using a hash value and therefore, the checks failed.
So, as you said, for practical applications, this should be sufficient, but non-deterministic behavior in places where I did not expect it, always makes me kind of nervous. Basically, I just wanted to let you know about it.

oXwvdrbbj8S4wo9k8lSN · 2020-04-01T09:08:07Z

I ran a few more tests and found out that the problem occurs only with the informative columns and only if a certain number is exceeded. This is only a heuristic, but if someone has the same problem: Use only up to 22 informative features and the results should match. The number of useless features (also repeated and redundant) did not matter in my tests.

cmarmo · 2020-09-25T09:30:46Z

Thanks @oXwvdrbbj8S4wo9k8lSN for your analysis. I think this will be useful for other users having similar use cases.
However, I'm closing this issue as I can't see how this could be fixed on scikit-learn side. Feel free to reopen if you think that something more is needed.

oXwvdrbbj8S4wo9k8lSN added the Bug: triage label Mar 30, 2020

cmarmo added the module:datasets label Apr 2, 2020

cmarmo closed this as completed Sep 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-deterministic dataset creation #16804

Non-deterministic dataset creation #16804

oXwvdrbbj8S4wo9k8lSN commented Mar 30, 2020

rth commented Mar 30, 2020 •

edited

Loading

rth commented Mar 30, 2020

oXwvdrbbj8S4wo9k8lSN commented Mar 30, 2020

rth commented Mar 30, 2020

oXwvdrbbj8S4wo9k8lSN commented Mar 30, 2020

oXwvdrbbj8S4wo9k8lSN commented Apr 1, 2020 •

edited

Loading

cmarmo commented Sep 25, 2020

Non-deterministic dataset creation #16804

Non-deterministic dataset creation #16804

Comments

oXwvdrbbj8S4wo9k8lSN commented Mar 30, 2020

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

rth commented Mar 30, 2020 • edited Loading

rth commented Mar 30, 2020

oXwvdrbbj8S4wo9k8lSN commented Mar 30, 2020

rth commented Mar 30, 2020

oXwvdrbbj8S4wo9k8lSN commented Mar 30, 2020

oXwvdrbbj8S4wo9k8lSN commented Apr 1, 2020 • edited Loading

cmarmo commented Sep 25, 2020

rth commented Mar 30, 2020 •

edited

Loading

oXwvdrbbj8S4wo9k8lSN commented Apr 1, 2020 •

edited

Loading