-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-deterministic dataset creation #16804
Comments
Reproducibility is an interesting topic. I think, the most likely difference lies either in BLAS ( It would be interesting to dump both arrays on these two machines ( |
Also somewhat related conda/conda-build#2140 |
Here are the saved numpy arrays. X_generated_machine1.zip And the outputs from
machine2:
|
>>> import numpy as np
>>> X1 = np.load("X_generated_machine1.npy")
>>> X2 = np.load("X_generated_machine2.npy")
>>> np.abs(X1 - X2).max()
8.526512829121202e-14
>>> (X1 - X2).sum()
-1.3681208943516765e-12
>>> np.abs(X1 - X2).sum()
4.225935573698436e-11
>>> X1.sum()
-497.68147031410854
>>> X2.sum()
-497.6814703141058 So the maximum abs error is 8e-14, and the problem is not with the sum operation.
Even if it's exactly the same BLAS, different implementation may be used depending on the available CPU flags (
we had issues like these in the past where some tests only fail on AVX512 machines. If not BLAS maybe something on the RNG generation side or in libc. As to which operation is exactly responsible here, that would require more investigation and I'm not sure it's worth it. Why do you need exact reproducibility in this case? |
'Need' is maybe a bit exaggerated. I stumbled upon this problem because I run unittests on a wrapper of this function, in which I modify the outputs. And since we use the produced datasets for other unit tests, I wanted to make sure, they are always consistent in order not to waste time debugging an algorithm just because the input differed. The consistency is currently checked using a hash value and therefore, the checks failed. |
I ran a few more tests and found out that the problem occurs only with the informative columns and only if a certain number is exceeded. This is only a heuristic, but if someone has the same problem: Use only up to 22 informative features and the results should match. The number of useless features (also repeated and redundant) did not matter in my tests. |
Thanks @oXwvdrbbj8S4wo9k8lSN for your analysis. I think this will be useful for other users having similar use cases. |
Describe the bug
Datasets created using make_classification differ slightly between two test machines. Not all columns differ, but some have minor differences. Yet, within unit-tests, comparison of algorithm outputs will fail if inputs differ.
To make sure that no fundamental difference between the test machines exist, a random numpy array was created using the following snippet. The outputs match exactly.
result on both machines:
500334.48617571464
Steps/Code to Reproduce
Expected Results
Exactly same print-outs on both machines.
Actual Results
machine1: -497.68147031410854
machine2: -497.6814703141058
Versions
machine1:
machine2:
The text was updated successfully, but these errors were encountered: