Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

viztracer seems to be incompatible with some python packages #259

Closed
CWHer opened this issue Aug 3, 2022 · 7 comments · Fixed by #261
Closed

viztracer seems to be incompatible with some python packages #259

CWHer opened this issue Aug 3, 2022 · 7 comments · Fixed by #261
Labels
bug Something isn't working

Comments

@CWHer
Copy link

CWHer commented Aug 3, 2022

I encountered this weird problem when I was using viztracer to profile reverb.

My python version is Python 3.9.12 and OS version is Ubuntu 20.04.4 LTS.

The exception details are as follows,

(acme) xxx@xxx:xxx$ viztracer issue-code.py 
...
[code concerned output]
...
Loading trace data from processes 0/1Traceback (most recent call last):         
  File "/anaconda/envs/acme/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
    finalizer()
  File "/anaconda/envs/acme/lib/python3.9/multiprocessing/util.py", line 224, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/anaconda/envs/acme/lib/python3.9/site-packages/viztracer/main.py", line 586, in exit_routine
    self.save()
  File "/anaconda/envs/acme/lib/python3.9/site-packages/viztracer/main.py", line 567, in save
    builder.save(output_file=ofile)
  File "/anaconda/envs/acme/lib/python3.9/site-packages/viztracer/report_builder.py", line 180, in save
    self.generate_report(f, output_format="json", file_info=file_info)
  File "/anaconda/envs/acme/lib/python3.9/site-packages/viztracer/report_builder.py", line 162, in generate_report
    self.prepare_json(file_info=file_info)
  File "/anaconda/envs/acme/lib/python3.9/site-packages/viztracer/report_builder.py", line 106, in prepare_json
    self.load_jsons()
  File "/anaconda/envs/acme/lib/python3.9/site-packages/viztracer/report_builder.py", line 72, in load_jsons
    self.jsons.append(get_json(j))
  File "/anaconda/envs/acme/lib/python3.9/site-packages/viztracer/report_builder.py", line 27, in get_json
    json_str = f.read()
  File "/anaconda/envs/acme/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 40682979: invalid start byte
[reverb/cc/platform/default/server.cc:84] Shutting down replay server

And the following is the issue code,

import time
from functools import wraps

import reverb


def timeLog(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(f"function: {func.__name__}, time: {end -start:.2f}s ")
        return result
    return wrapper


@timeLog
def runProducer(port, n_sample):
    client = reverb.Client(f'localhost:{port}')
    print(client.server_info())
    print("===Producer started===")
    for i in range(n_sample):
        # Creates a single item and data element [0, 1].
        client.insert(i, priorities={'my_table': 1.0})
    print("===Producer ended===")


@timeLog
def runConsumer(port, n_sample):
    client = reverb.Client(f'localhost:{port}')
    print(client.server_info())
    print("===Consumer started===")
    for i in range(n_sample):
        # Creates a single item and data element [0, 1].
        list(client.sample('my_table', num_samples=256))
    print("===Consumer ended===")


if __name__ == "__main__":
    server = reverb.Server(
        tables=[reverb.Table(
            name='my_table',
            sampler=reverb.selectors.Uniform(),
            remover=reverb.selectors.Fifo(),
            max_size=10000,
            # rate_limiter=reverb.rate_limiters.SampleToInsertRatio(2.58, 256, 50),)],
            rate_limiter=reverb.rate_limiters.MinSize(256),)],
    )

    # n_sample = 100000
    n_sample = 1000

    runProducer(server.port, n_sample)
    runConsumer(server.port, n_sample // 100)

    client = reverb.Client(f'localhost:{server.port}')
    print(client.server_info())

I'm wondering whether I have made some mistakes or if this is a compatibility issue with viztracer. 🤔

By the way, this is really a nice profiling tool. 🤗

@gaogaotiantian
Copy link
Owner

I can't reproduce this issue locally with python3.9 + WSL2 Ubuntu. The error normally means there's a file corruption. Are you able to reproduce this issue reliably with the exact code you provided?

@CWHer
Copy link
Author

CWHer commented Aug 4, 2022

Thanks for your timely reply. It seems there was something wrong with my environment. I create a new environment and everything is fine now.

I have never seen this kind of error before. Sorry for the inconvenience caused.

@CWHer CWHer closed this as completed Aug 4, 2022
@CWHer
Copy link
Author

CWHer commented Aug 4, 2022

Well, after using another version of reverb, the error occurs again.

FYI, reverb is part of acme. For me, pip install dm-reverb[tensorflow] causes no error. However, when combined with acme, I can reproduce this issue reliably.

The following are my pip installation log,

pip install viztracer
pip install tqdm
pip install dm-acme[jax,tf,envs]

together with the full requirements.txt,

absl-py==1.2.0
ale-py==0.7.5
astunparse==1.6.3
atari-py==0.3.0
attrs==22.1.0
autopep8==1.6.0
bsuite==0.3.5
cachetools==5.2.0
certifi @ file:///opt/conda/conda-bld/certifi_1655968806487/work/certifi
charset-normalizer==2.1.0
chex==0.1.3
cloudpickle==2.1.0
colorama==0.4.5
commonmark==0.9.1
cycler==0.11.0
decorator==5.1.1
dill==0.3.5.1
distrax==0.1.2
dm-acme==0.4.0
dm-control==0.0.364896371
dm-env==1.5
dm-haiku==0.0.7
dm-launchpad==0.5.0
dm-reverb==0.7.0
dm-sonnet==2.0.0
dm-tree==0.1.7
etils==0.6.0
flatbuffers==2.0
flax==0.5.3
fonttools==4.34.4
frozendict==2.3.4
future==0.18.2
gast==0.5.3
glfw==2.5.4
google-auth==2.9.1
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.56.4
grpcio==1.48.0
gym==0.25.1
gym-notices==0.0.7
h5py==3.7.0
idna==3.3
imageio==2.21.0
importlib-metadata==4.12.0
importlib-resources==5.9.0
jax==0.3.15
jaxlib==0.3.15
jmp==0.0.2
keras==2.8.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.4
labmaze==1.0.5
libclang==14.0.6
lxml==4.9.1
Markdown==3.4.1
MarkupSafe==2.1.1
matplotlib==3.5.2
mizani==0.7.4
mock==4.0.3
msgpack==1.0.4
networkx==2.8.5
numpy==1.23.1
oauthlib==3.2.0
objprint==0.2.2
opt-einsum==3.3.0
optax==0.1.3
packaging==21.3
palettable==3.3.0
pandas==1.4.3
patsy==0.5.2
Pillow==9.2.0
plotnine==0.9.0
portpicker==1.5.2
promise==2.3
protobuf==3.20.1
psutil==5.9.1
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycodestyle==2.9.0
Pygments==2.12.0
PyOpenGL==3.1.6
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2022.1
PyWavelets==1.3.0
PyYAML==6.0
requests==2.28.1
requests-oauthlib==1.3.1
rich==11.2.0
rlax==0.1.2
rsa==4.9
scikit-image==0.19.3
scipy==1.9.0
six==1.16.0
statsmodels==0.13.2
tabulate==0.8.10
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.8.0
tensorflow-datasets==4.4.0
tensorflow-estimator==2.8.0
tensorflow-io-gcs-filesystem==0.26.0
tensorflow-metadata==1.9.0
tensorflow-probability==0.15.0
tensorstore==0.1.22
termcolor==1.1.0
tf-estimator-nightly==2.8.0.dev2021122109
tifffile==2022.8.3
toml==0.10.2
toolz==0.12.0
tqdm==4.64.0
trfl==1.2.0
typing_extensions==4.3.0
urllib3==1.26.11
viztracer==0.15.3
Werkzeug==2.2.1
wrapt==1.14.1
zipp==3.8.1

Could you please try this? 🤔

@CWHer CWHer reopened this Aug 4, 2022
@gaogaotiantian gaogaotiantian added the bug Something isn't working label Aug 4, 2022
@gaogaotiantian
Copy link
Owner

The reason of this bug is PyCapsule, an uncommon class used for passing a C void pointer between C extensions, which is used in jaxlib(mlir actually). I think the binary is built using some 3rd party library so the function name is left blank(more like garbage) on C level, which confused VizTracer. I can patch this, just need to think about what's the best way to do it.

@gaogaotiantian
Copy link
Owner

I believe this is a bug in either jax or mlir(llvm). I filed an issue in jax jax-ml/jax#11750

@gaogaotiantian
Copy link
Owner

So after some experiments, this is in kind of a grey area. VizTracer assumes every PyCFunctionObject has a static PyMethodDef, which contains a const char* pointer to its name and this pointer will be valid till the end of the execution. This is true for all CPython official code and almost all of the C extensions. However, mlir is probably creating a dynamic PyMethodDef which will be invalid after the PyCFunctionObject is released. I think they are creating a lot of anonymous disposable functions, which makes it very difficult to track their name in VizTracer.

There are a couple of solutions to this specific problem, but none of them are good:

  1. Keep a reference to all the PyCFunctionObject to ensure PyMethodDef is not released, which was actually the original implementation, but it was changed due to Does not work well together with tornado #47. If you think about it, it may not be the best idea to keep references to all the local variables, which will prevent them from being released.
  2. Copy the name string, instead of keeping the pointer to the string. This would result in a huge increase in memory, as in most cases, viztracer will record a lot of the same names(of the same function). This would also impact performance of course.
  3. Do something special for PyCapsule. Even though this issue is triggered by PyCapsule, it's more a problem for C extension implementation, so potentially it could happen anywhere.

This is a super rare corner case and probably will not happen a lot. In my opinion, it's at least 50% mlir's implementation's issue. I filed an issue llvm/llvm-project#56943 on llvm github and let's see how they will respond to this.

In the mean time, I will consider how to at least avoid an error from viztracer in this kind of situation.

@gaogaotiantian gaogaotiantian linked a pull request Aug 6, 2022 that will close this issue
@gaogaotiantian
Copy link
Owner

I added a workaround in #261 for this - an option --sanitize_function_name, which will check all the function names before dump and if it's not printable it will simply do not add the name.

If you want to try it out, you can install the mainline version by pip install git+https://github.com/gaogaotiantian/viztracer.

However, this is a very rare situation so I won't document this feature. It will stay quietly among the options.

To me this is an deficiency in mlir code, whether or not they agree. So there will be no major changes to the default behavior to accommodate this kind of issue.

On the other hand, the problematic function happened in import phase. If the only thing you are interested in is the code you are working on, you could do viztracer --ignore_frozen to skip the import stuff, which will, in this case, solve the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants