You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It must be a bug, a feature request, or a significant problem with
documentation (for small docs fixes please send a PR instead).
The form below must be filled out.
Here's why we have that policy: TensorFlow Model Analysis developers respond
to issues. We want to focus on work that benefits the whole community, e.g.,
fixing bugs and adding features. Support only helps individuals. GitHub also
notifies thousands of people when issues are filed. We want them to see you
communicating an interesting problem, rather than being redirected to Stack
Overflow.
System information
Have I written custom code (as opposed to using a stock example script
provided in TensorFlow Model Analysis): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
TensorFlow Model Analysis installed from (source or binary): from pip
TensorFlow Model Analysis version (use command below): 0.29.0
Python version: 3.7
Jupyter Notebook version:
Exact command to reproduce:
You can obtain the TensorFlow Model Analysis version with
python -c "import tensorflow_model_analysis as tfma; print(tfma.version.VERSION)"
Describe the problem
Recently, I've been working on upgrading our ML model R&D stack to use more up-to-date versions of TF & TFX libraries. Before this, we were using TF==2.3.2 and TFMA==0.26.1. While going through this exercise, I noticed that in the upgraded version of TFMA, we started getting the following errors when we run our TFMA job with DataflowRunner, but not when we run it with DirectRunner:
DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 644, in do_work
work_executor.execute()
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 208, in execute
op.start()
File "dataflow_worker/shuffle_operations.py", line 63, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
File "dataflow_worker/shuffle_operations.py", line 64, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
File "dataflow_worker/shuffle_operations.py", line 79, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
File "dataflow_worker/shuffle_operations.py", line 80, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
File "dataflow_worker/shuffle_operations.py", line 84, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
File "apache_beam/runners/worker/operations.py", line 348, in apache_beam.runners.worker.operations.Operation.output
File "apache_beam/runners/worker/operations.py", line 215, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
File "dataflow_worker/shuffle_operations.py", line 261, in dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.process
File "dataflow_worker/shuffle_operations.py", line 267, in dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.process
File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/trigger.py", line 1324, in process_elements
yield output.with_value(self.phased_combine_fn.apply(output.value))
File "/usr/local/lib/python3.7/site-packages/apache_beam/transforms/combiners.py", line 876, in merge_only
return self.combine_fn.merge_accumulators(accumulators)
File "/usr/local/lib/python3.7/dist-packages/apache_beam/transforms/combiners.py", line 659, in merge_accumulators
a in zip(self._combiners, zip(*accumulators_batch))
File "/usr/local/lib/python3.7/dist-packages/apache_beam/transforms/combiners.py", line 659, in <listcomp>
a in zip(self._combiners, zip(*accumulators_batch))
File "/root/.local/lib/python3.7/site-packages/tensorflow_model_analysis/metrics/tf_metric_wrapper.py", line 562, in merge_accumulators
for metric_index in range(len(self._metrics[output_name])):
TypeError: 'NoneType' object is not subscriptable
Since we skipped multiple releases and I didn't spot anything notable enough in the release notes to cause this, I decided to try each one out to see when the issue started to occur and it was at v0.29. Upon deeper investigation I found that there was this patch which was not included in the release notes. In this patch, TFMA moved from calling its own _setup() function to using Beam's beam.CombineFn.setup(). However, looking at Beam's documentation, beam.CombineFn.setup() is only supported in Dataflow Runner v2, making TFMA no longer compatible with Dataflow Runner v1. Retrying with use_runner_v2 indeed fixed our issue.
Is this component general to all TFMA runs? If so, it should perhaps try to detect if people are using Dataflow runner v1 and throw an exception and document this compatibility issue.
Source code / logs
Include any logs or source code that would be helpful to diagnose the problem.
If including tracebacks, please include the full traceback. Large logs and files
should be attached. Try to provide a reproducible test case that is the bare
minimum necessary to generate the problem.
The text was updated successfully, but these errors were encountered:
Please go to Stack Overflow for help and support:
https://stackoverflow.com/questions/tagged/tensorflow-model-analysis
If you open a GitHub issue, here is our policy:
documentation (for small docs fixes please send a PR instead).
Here's why we have that policy: TensorFlow Model Analysis developers respond
to issues. We want to focus on work that benefits the whole community, e.g.,
fixing bugs and adding features. Support only helps individuals. GitHub also
notifies thousands of people when issues are filed. We want them to see you
communicating an interesting problem, rather than being redirected to Stack
Overflow.
System information
provided in TensorFlow Model Analysis): Yes
You can obtain the TensorFlow Model Analysis version with
python -c "import tensorflow_model_analysis as tfma; print(tfma.version.VERSION)"
Describe the problem
Recently, I've been working on upgrading our ML model R&D stack to use more up-to-date versions of TF & TFX libraries. Before this, we were using TF==2.3.2 and TFMA==0.26.1. While going through this exercise, I noticed that in the upgraded version of TFMA, we started getting the following errors when we run our TFMA job with DataflowRunner, but not when we run it with DirectRunner:
Since we skipped multiple releases and I didn't spot anything notable enough in the release notes to cause this, I decided to try each one out to see when the issue started to occur and it was at v0.29. Upon deeper investigation I found that there was this patch which was not included in the release notes. In this patch, TFMA moved from calling its own
_setup()
function to using Beam'sbeam.CombineFn.setup()
. However, looking at Beam's documentation,beam.CombineFn.setup()
is only supported in Dataflow Runner v2, making TFMA no longer compatible with Dataflow Runner v1. Retrying withuse_runner_v2
indeed fixed our issue.Is this component general to all TFMA runs? If so, it should perhaps try to detect if people are using Dataflow runner v1 and throw an exception and document this compatibility issue.
Source code / logs
Include any logs or source code that would be helpful to diagnose the problem.
If including tracebacks, please include the full traceback. Large logs and files
should be attached. Try to provide a reproducible test case that is the bare
minimum necessary to generate the problem.
The text was updated successfully, but these errors were encountered: