Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-deterministic dataset creation #16804

Closed
oXwvdrbbj8S4wo9k8lSN opened this issue Mar 30, 2020 · 7 comments
Closed

Non-deterministic dataset creation #16804

oXwvdrbbj8S4wo9k8lSN opened this issue Mar 30, 2020 · 7 comments

Comments

@oXwvdrbbj8S4wo9k8lSN
Copy link

Describe the bug

Datasets created using make_classification differ slightly between two test machines. Not all columns differ, but some have minor differences. Yet, within unit-tests, comparison of algorithm outputs will fail if inputs differ.
To make sure that no fundamental difference between the test machines exist, a random numpy array was created using the following snippet. The outputs match exactly.

np.random.seed(42)
temp = np.random.random((1000,1000))
print(temp.sum())

result on both machines:
500334.48617571464

Steps/Code to Reproduce

from sklearn.datasets import make_classification
X_generated, y_generated = make_classification(n_samples= 57,
n_features=369,
random_state= 485,
n_informative= 96+21,
n_redundant= 73,
n_repeated= 69,
n_classes= 52,
shuffle= 0)
print(X_generated.sum())

Expected Results

Exactly same print-outs on both machines.

Actual Results

machine1: -497.68147031410854
machine2: -497.6814703141058

Versions

machine1:

System:
    python: 3.7.6 | packaged by conda-forge | (default, Mar 23 2020, 23:03:20)  [GCC 7.3.0]
executable: /opt/conda/bin/python
   machine: Linux-4.19.76-linuxkit-x86_64-with-debian-buster-sid

Python dependencies:
       pip: 20.0.2
setuptools: 46.0.0.post20200311
   sklearn: 0.22.2.post1
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: 0.29.15
    pandas: 1.0.3
matplotlib: 3.1.3
    joblib: 0.14.1

Built with OpenMP: True

machine2:

System:
    python: 3.7.6 | packaged by conda-forge | (default, Mar  5 2020, 15:27:18)  [GCC 7.3.0]
executable: /opt/conda/bin/python
   machine: Linux-4.15.0-1075-azure-x86_64-with-debian-buster-sid

Python dependencies:
       pip: 20.0.2
setuptools: 46.0.0.post20200311
   sklearn: 0.22.2.post1
     numpy: 1.18.1
     scipy: 1.4.1
    Cython: 0.29.15
    pandas: 1.0.2
matplotlib: 3.1.3
    joblib: 0.14.1

Built with OpenMP: True
@rth
Copy link
Member

rth commented Mar 30, 2020

Reproducibility is an interesting topic. make_classification uses fairly standard operations with numpy, I don't think that random_seed was not passed somewhere.

I think, the most likely difference lies either in BLAS (conda list | grep blas) -- used when computing the sum or even in libc versions I guess. Generally the 11th significant digit shouldn't be that bad for practical applications.

It would be interesting to dump both arrays on these two machines (np.save) and compare relative error element by element, to see where it is coming from. It could also be just the sum operation due to BLAS.

@rth
Copy link
Member

rth commented Mar 30, 2020

Also somewhat related conda/conda-build#2140

@oXwvdrbbj8S4wo9k8lSN
Copy link
Author

Here are the saved numpy arrays.

X_generated_machine1.zip
X_generated_machine2.zip

And the outputs from conda list | grep blas:
machine1:

blas                      2.14                   openblas    conda-forge
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
liblapacke                3.8.0               14_openblas    conda-forge
libopenblas               0.3.7                h5ec1e0e_6    conda-forge

machine2:

blas                      2.14                   openblas    conda-forge
libblas                   3.8.0               14_openblas    conda-forge
libcblas                  3.8.0               14_openblas    conda-forge
liblapack                 3.8.0               14_openblas    conda-forge
liblapacke                3.8.0               14_openblas    conda-forge
libopenblas               0.3.7                h5ec1e0e_6    conda-forge

@rth
Copy link
Member

rth commented Mar 30, 2020

Here are the saved numpy arrays.

>>> import numpy as np
>>> X1 = np.load("X_generated_machine1.npy")
>>> X2 = np.load("X_generated_machine2.npy")
>>> np.abs(X1 - X2).max()
8.526512829121202e-14
>>> (X1 - X2).sum()
-1.3681208943516765e-12
>>> np.abs(X1 - X2).sum()
4.225935573698436e-11
>>> X1.sum()
-497.68147031410854
>>> X2.sum()
-497.6814703141058

So the maximum abs error is 8e-14, and the problem is not with the sum operation.

And the outputs from conda list | grep blas:

Even if it's exactly the same BLAS, different implementation may be used depending on the available CPU flags (cat /proc/cpuinfo | grep flags) e.g. AVX2/512 or not, which are only expected to produce the same output up to the numerical tolerance,

>>> np.finfo(np.float64).eps
2.220446049250313e-16

we had issues like these in the past where some tests only fail on AVX512 machines. If not BLAS maybe something on the RNG generation side or in libc. As to which operation is exactly responsible here, that would require more investigation and I'm not sure it's worth it. Why do you need exact reproducibility in this case?

@oXwvdrbbj8S4wo9k8lSN
Copy link
Author

Why do you need exact reproducibility in this case?

'Need' is maybe a bit exaggerated. I stumbled upon this problem because I run unittests on a wrapper of this function, in which I modify the outputs. And since we use the produced datasets for other unit tests, I wanted to make sure, they are always consistent in order not to waste time debugging an algorithm just because the input differed. The consistency is currently checked using a hash value and therefore, the checks failed.
So, as you said, for practical applications, this should be sufficient, but non-deterministic behavior in places where I did not expect it, always makes me kind of nervous. Basically, I just wanted to let you know about it.

@oXwvdrbbj8S4wo9k8lSN
Copy link
Author

oXwvdrbbj8S4wo9k8lSN commented Apr 1, 2020

I ran a few more tests and found out that the problem occurs only with the informative columns and only if a certain number is exceeded. This is only a heuristic, but if someone has the same problem: Use only up to 22 informative features and the results should match. The number of useless features (also repeated and redundant) did not matter in my tests.

@cmarmo
Copy link
Contributor

cmarmo commented Sep 25, 2020

Thanks @oXwvdrbbj8S4wo9k8lSN for your analysis. I think this will be useful for other users having similar use cases.
However, I'm closing this issue as I can't see how this could be fixed on scikit-learn side. Feel free to reopen if you think that something more is needed.

@cmarmo cmarmo closed this as completed Sep 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants