Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to recursively search package #1270

Closed

Conversation

mcoconnor
Copy link

Description

Enables recursive mode for glob when processing the exclude_glob setting during package assembly. This allows the use of the ** operator in glob strings and brings the functionality in line with existing documentation.

GitHub Issues

Reported here: #1269

@coveralls
Copy link

Coverage Status

coverage: 74.688%. remained the same when pulling ddd6b65 on CCITatBCM:allow_exclude_glob_recursive into f2f03ba on zappa:master.

@javulticat javulticat self-requested a review November 6, 2023 03:27
Copy link
Member

@javulticat javulticat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TL;DR: Please add happy path and some failure mode unit tests, and I think this PR will be ready to merge! Thanks, great work!

Hi @mcoconnor, thanks for pointing this issue out and taking the initiative to create a PR to fix the issue yourself!

This change itself looks pretty good, but I'd like to ask that you also add some test cases to this PR before we merge it - just to confirm the behavior is working as expected, especially since it will be touching a common setting currently being actively used by a large portion of our userbase.

So, unit tests that prove that your changes now allow the ** glob operator to work as expected would be a great place to start.

I also think we probably want to have some tests that would prove that any other behavior that users currently experience/expect when using the exclude_glob setting (besides the ** recursion that is currently not working, of course) isn't going to regress or negatively impact the user experience if we merge this change.

For example, to play devil's advocate... using a recursive glob search over very large directories can be very slow, can it not?

  • What could the consequences be if someone has massive_dir\**\*.pyi in exclude_glob and then this change goes live?
  • Could that kind of thing risk causing major delays or even timeouts on CI runners if folks are deploying Zappa via some kind of CI pipeline?
    • Could CI timeouts prevent user applications from getting deployed in a timely manner (or at all)?
  • Could delays cause any open connection Zappa had with AWS to timeout?
    • If so, does Zappa have logic in place to automatically try reconnecting to AWS if that happens?
  • Could delays cause impatient users to think that Zappa is stuck/hanging when it's really just slowly recursively searching over a large directory listed in exclude_blob, which ends up with them force-quitting the process (potentially leaving their AWS environment in a very ugly state due to their force-quit during their half-deployed Zappa application).
    • Will those users then almost certainly come here to complain that "Zappa is broken because it just hangs whenever I try to use it"? 😄

All right - now, I'm not saying we need to solve for all of those potential issues before merging this PR, but if things can go wrong, we should at least know what those things could commonly be and how the most common problems will present themselves to the user (which is something tests come in very handy to help show us), which will then help teach us how we can ameliorate, or at least mitigate, these potential issues when, or ideally before, they happen to our users.

So, how can we solve for any of the potential issues that may arise from this rather classic example of a recursion performance dilemma?

  1. Set a timeout for how long we'll recursively look through each directory before moving on to the next one? But what do we do if we actually hit that limit?
  2. Enforce a size/file limit for directories that can be recursively searched using the ** operator? But how do we determine that limit? Especially when machine A might be able to recursively search a 500k file folder structure twice as fast as machine B.
  3. Have documentation explaining the potential issues and any workarounds we've identified that can be attempted by the users if they encounter those issues.

So, yeah... As you can see, even with a simple two line change, messing with recursion can open up a whole can of worms. 🙂

My two cents: you are making a change that fixes behavior that is currently incorrect. I like to have correct behavior, and imo, in this case, that outweighs any unlikely performance edge cases we may have to deal with. The likelihood that users are going to be doing recursive glob searches on directories large enough to cause them or us too much pain is probably very low for the vast majority of our users, so I'm willing to risk letting it go and dealing with any performance issues if/when our users complain of them.

That said, I still would love to see a few tests included in this PR that test some of these worst-case failure scenarios and how Zappa would react in those cases. As I hope I've explained, that will be helpful in showing us what to expect from any potential user issues if/when things go wrong, and the sooner we have that information, the sooner we can begin thinking about ways we might be able to improve the performance of these recursive glob searches. Thanks!

for path in glob.glob(os.path.join(temp_project_path, glob_path)):
# Use `recursive` to match paths deep in the directory tree
# https://github.com/zappa/Zappa/issues/1269
for path in glob.glob(os.path.join(temp_project_path, glob_path), recursive=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there's any downside to using glob.iglob() here instead of glob.glob(), as iglob() returns an iterator, which should be easier on memory usage if a user decides to put massive folders with tons of files in exclude_glob? Especially since this code may often running be running on a CI runner, and the free ones that Github/Gitlab provide aren't provisioned with very much RAM at all (IIRC, it's like 1GB, and most is taken up my the OS and CI runner software).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes, that's a great idea.

for path in glob.glob(os.path.join(temp_project_path, glob_path)):
# Use `recursive` to match paths deep in the directory tree
# https://github.com/zappa/Zappa/issues/1269
for path in glob.glob(os.path.join(temp_project_path, glob_path), recursive=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question of using glob.iglob() instead of glob.glob(). (See other comment)

@mcoconnor
Copy link
Author

Thanks for the very thoughtful and thorough review (especially for such an apparently minor change). I'll be happy to add some tests when I have time to swing back to this. You've asked some interesting questions about performance management and I agree it would be nice to characterize the failure modes a little more and consider how to handle or document them. I'm not sure about including those in the test suite though--seems like intentionally triggering performance issues in a unit test will be a pain for anyone running tests going forward, though I'm open to other opinions.

I'll have to think a little bit about what mitigations might look like in this scenario; to a degree, I feel that either preemptively bombing the process (e.g. a timeout) or executing a halfway correct solution (e.g. limiting the number of results processed) could lead to more frustrating bugs than an obvious failure with a long-running process. The other good thing is that this affects a relatively early part of the deploy/update process so if it does hang, there isn't a lot of garbage that will be left other than the temp file tree that was being operated on.

Copy link

github-actions bot commented Apr 3, 2024

Hi there! Unfortunately, this PR has not seen any activity for at least 90 days. If the PR is still relevant to the latest version of Zappa, please comment within the next 10 days if you wish to keep it open. Otherwise, it will be automatically closed.

@github-actions github-actions bot added the no-activity [Bot] Closing soon if no new activity label Apr 3, 2024
Copy link

Hi there! Unfortunately, this PR was automatically closed as it had not seen any activity in at least 100 days. If the PR is still relevant to the latest version of Zappa, please open a new PR.

@github-actions github-actions bot added the auto-closed [Bot] Closed, details in comments label Apr 13, 2024
@github-actions github-actions bot closed this Apr 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-closed [Bot] Closed, details in comments no-activity [Bot] Closing soon if no new activity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants