Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CFG extraction timeout not working #106

Open
AlexVanMechelen opened this issue Apr 19, 2024 · 5 comments
Open

CFG extraction timeout not working #106

AlexVanMechelen opened this issue Apr 19, 2024 · 5 comments
Assignees
Labels
failure Issue found in production while not necessarily being a mistake

Comments

@AlexVanMechelen
Copy link
Contributor

AlexVanMechelen commented Apr 19, 2024

Issue

Sometimes the CFG extraction continues even after the timeout is hit here. The line Timeout reached when extracting CFG gets printed to the screen, but Angr keeps extracting the CFG, delaying the CGF-based feature computation for that executable significantly.

Reproduce

It's hard to reproduce as there is some randomness to it. It sometimes happens for an executable, but when trying again later with the same executable it stops successfully after extraction.
With the tool in the latest PR #105 I started extracting the CFG-based features for a dataset of 400 samples using 32 CPU cores. The features for the first 300 executables got extracted at a rate of approximately 3 seconds per executable. For the last few executables however, the extraction time skyrockets due to this issue where CFG extraction continues even after the timeout. At this time, after 1Hr40, the features of the last 30 executables are still being extracted.

Resolve

If this issue cannot be resolved directly, maybe it's interesting to create the possibility to save progress on the dataset convert command so that the user can halt it early when some executables take a very long time to extract. Allowing the user to continue with the majority of executables for which the features got extracted, or allowing them to relaunch the conversion so that this time the CFG extraction maybe correctly halts at the timeout.

@AlexVanMechelen
Copy link
Contributor Author

AlexVanMechelen commented Apr 20, 2024

Testing

To test if it was slowly making progress or actually stuck, I let the extraction run for about ten hours, but no progress was made after the first 25 minutes. To check where it got stuck, I interrupted with CTRL+C and got the following traceback:

Exception ignored in: <function PagedMemoryMixin.__del__ at 0x7760402d16c0>
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.11/site-packages/angr/storage/memory_mixins/paged_memory/paged_memory_mixin.py", line 58, in __del__
    page.release_shared()
  File "/home/user/.local/lib/python3.11/site-packages/angr/storage/memory_mixins/paged_memory/pages/refcount_mixin.py", line 50, in release_shared
    with self.lock:
  File "/home/user/.local/lib/python3.11/site-packages/angr/misc/picklable_lock.py", line 16, in __enter__
    return self._lock.__enter__()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.local/lib/python3.11/site-packages/tinyscript/features/handlers.py", line 114, in __terminate_handler
    _hooks.quit(0)
  File "/home/user/.local/lib/python3.11/site-packages/tinyscript/features/handlers.py", line 64, in quit
    self.exit(code)
  File "/home/user/.local/lib/python3.11/site-packages/tinyscript/features/handlers.py", line 52, in exit
    self._orig_exit(code)
SystemExit: 0

Seems like angr gets in a deadlock when running out of memory pages (?)

Resolve

So maybe the aforementioned idea of incrementally saving progress of the extracted features to disk in a temporary folder can free up RAM and allow the complete extraction process to finish.

@AlexVanMechelen
Copy link
Contributor Author

Testing

I split the dataset in two with the tool in PR #108 into two equal-sized datasets of 200 executables.
I then ran the dataset convert command on both those datasets, providing some CFG-based features. One of them is (stuck) at 174/200 samples and the other at 197/200 after 2Hr30. As proposed before, if the CFG extraction timeout issue cannot be fixed, it might be useful to save progress for the executables for which the CFG-based features could be extracted, stop the dataset convert, then put -1 for all the CFG-based features for the other executables and compute the other non-CFG-based features for them.

@dhondta dhondta self-assigned this Apr 27, 2024
@dhondta dhondta added the failure Issue found in production while not necessarily being a mistake label Apr 27, 2024
@dhondta
Copy link
Collaborator

dhondta commented Apr 29, 2024

@AlexVanMechelen Please try. Not sure this will fix the issue but worth giving it a try.

@AlexVanMechelen
Copy link
Contributor Author

AlexVanMechelen commented Apr 29, 2024

@dhondta This indeed fixes the issue of angr getting into a deadlock.
I tested a couple datasets and the feature extraction finishes without blocking.

The broader issue of the CFG extraction timeout which sometimes doesn't work still remains, so a small percentage of samples have a significantly longer feature extraction time than others.

PS:

Although still an issue, this doesn't block performing experiments anymore especially when using #105. With multiprocessing, the samples where the timeout doesn't work don't block others from starting. Therefore, all samples for which the timeout doesn't work get started asap and continue in parallel. An example with 12 such samples taking longer:

In an experiment with 64CPU cores and a dataset of 402 samples, the first 389 samples finished extraction in 65" (6sps), while the last 12 samples took 11'10" (0.018sps).

For reference, an experiment without #105 (1 core) took 3h33'33" for the same 402 samples (0.03sps)

@dhondta
Copy link
Collaborator

dhondta commented May 1, 2024

@AlexVanMechelen OK, here is the explanation ; TimeoutError could not be handled in the code section covered by the lock because it was based on a simple Lock primitive. Changing it to RLock (that is, a reentrant lock) made possible not to escape the code section (when TimeoutError was raised) without releasing the lock.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
failure Issue found in production while not necessarily being a mistake
Projects
None yet
Development

No branches or pull requests

2 participants