Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency for File State #374

Merged
merged 64 commits into from
Jan 15, 2025
Merged

Consistency for File State #374

merged 64 commits into from
Jan 15, 2025

Conversation

foodprocessor
Copy link
Contributor

@foodprocessor foodprocessor commented Nov 25, 2024

This PR changes the file cache by:

  1. Creating a clear concept of file state, and using file locks to keep file state consistent.
  2. Deleting local files immediately when a user command tells us to do so.
  3. Refactoring isDownloadRequired to be simpler and easier to read.
  4. Refactoring OpenFile to open files immediately, when a download is not required.

1. File state:

I made this simplified diagram to show how I think file states should work. Every transition between states is now protected by a lock, without any risk of deadlock.
file cache state diagram

  • I added a LazyOpen flag to the flock struct to keep track of whether a file was in that state. Previously we only tracked that in the handle, which hid the true state of the file from other contexts.
  • The state of directories is still not protected. That may need to be a future PR.
  • The worst of this change is the file locking I added to take care of RenameDir. I decided the right way to protect the files in that case was to lock every file involved simultaneously (but without deadlock risk), before proceeding to change an entire prefix.
  • There are many new "internal" functions which are wrapped by component API functions. The only difference is that the internal functions usually run with a lock already acquired.
  • One change that may look strange is that I chose to add a lock parameter to functions which need to run with locks already acquired. This choice makes more sense in some places than others. Please give me feedback... maybe just a naming convention would be simpler.
  • While working on this, I did attempt to add more state flags to flock to signal when a file is in cloud storage and when it is in cache. This turned out to add a lot of code complexity, with the same information already being supplied by calls to os.Stat and GetAttr (which is usually cached!).
  • I did add a lock to GetAttr, and I do feel a bit weird about it. It should really be an RLock(), but that is not part of our LockMap, and I don't know if I have the heart to add it for one use.
  • TruncateFile puts the file through multiple state transitions, but when I examined the possibly bad outcomes of Truncate being interrupted by other unexpected requests or state changes, I really didn't find anything alarming. So, the code changes to make Truncate atomic don't really seem worth it to me.

2. Synchronous file deletion:

This change only affects the deletion of cached files which we were explicitly asked by the user to delete. Evictions are still asynchronous. These two very different situations used to use all the same code, which was causing lots of issues, because the two are very different cases. This caused a number of race conditions, bugs, which up until now I was "fixing" using bad code. Now I think async deletion is simply incorrect for user requests. So I removed it.

3. isDownloadRequired:

  • I removed the logic that checked if the file is expired, since the policy is in charge of that. If the file exists, it is not expired.
  • I cleaned up the refresh code, and changed the logic so if a file is open (not just lazy open, but actually open), then we will not download an update from the cloud, even if the refresh timer has expired.

4. OpenFile:

OpenFile had a slight issue, which is that if the user called Open with the create flag instead of calling Create explicitly, then we would end up creating a file handle without creating any file or object behind that. Then the user could call GetAttr and we would report that the file they have open doesn't exist.
I changed OpenFile to call isDownloadRequired right away, and open the local file immediately if a download is not required.

Changed file cache tests:

  • Drop 0s default timeouts. The minimum is 1s now, and unless we're explicitly testing eviction, there's no point having a low value there.
  • When a test needs a file to stay in cache for longer than a second, I changed the timeout instead of holding the file open to prevent eviction.
  • Changed tests to write to fake storage directly instead of wasting our time by using the cache timeout for no reason.
  • Close handles to satisfy the fearful handle leak gods.

What type of Pull Request is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update

Checklist

  • Tested locally
  • Added new dependencies
  • Updated documentation
  • Added tests

Related Issues

  • Related Issue #
  • Closes #

Update tests to work with sync deletion.
Create files directly in loopback whenever possible.
@foodprocessor foodprocessor marked this pull request as draft November 25, 2024 16:58
Copy link
Contributor

@jfantinhardesty jfantinhardesty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions/concerns. Are there some negative performance impacts that you have noticed now that deletes are all synchronized? It seems that this could end up slowing the filesystem down when there is a large cache as a lot of time would be spent synchronously deleting. Did this fix any prior race conditions when using the go race detector?

component/file_cache/file_cache_linux_test.go Show resolved Hide resolved
component/file_cache/file_cache_linux_test.go Show resolved Hide resolved
component/file_cache/file_cache_test.go Show resolved Hide resolved
@foodprocessor
Copy link
Contributor Author

A few questions/concerns. Are there some negative performance impacts that you have noticed now that deletes are all synchronized? It seems that this could end up slowing the filesystem down when there is a large cache as a lot of time would be spent synchronously deleting. Did this fix any prior race conditions when using the go race detector?

I'll try to address your questions one bit at a time:

  1. This does not change eviction. Eviction is still done asynchronously. This only makes explicit deletion and rename (where the user has deleted or renamed a file) synchronous. So, the performance impact of this change should be very very small.
  2. The race conditions that this change fixes were rare, and I had already done a fair amount of work to minimize or eliminate the risk from them in previous improvements to the file cache. I do not have any proof that there are exposed race conditions that this resolves.
  3. However, the real reason for this change is to tighten up the file cache to prepare for adding network fault tolerance. The consistency challenges that come with adding that feature leave little or no room for the type of awkward timing issues that the file cache has before this change.

@foodprocessor foodprocessor marked this pull request as ready for review November 26, 2024 21:31
@foodprocessor
Copy link
Contributor Author

A few questions/concerns. Are there some negative performance impacts that you have noticed now that deletes are all synchronized? It seems that this could end up slowing the filesystem down when there is a large cache as a lot of time would be spent synchronously deleting. Did this fix any prior race conditions when using the go race detector?

I'll try to address your questions one bit at a time:

1. This does not change eviction. Eviction is still done asynchronously. This only makes explicit deletion and rename (where the user has deleted or renamed a file) synchronous. So, the performance impact of this change should be very very small.

2. The race conditions that this change fixes were rare, and I had already done a fair amount of work to minimize or eliminate the risk from them in previous improvements to the file cache. I do not have any proof that there are exposed race conditions that this resolves.

3. However, the real reason for this change is to tighten up the file cache to prepare for adding network fault tolerance. The consistency challenges that come with adding that feature leave little or no room for the type of awkward timing issues that the file cache has before this change.

In the third point, I said this was for network fault tolerance... it's actually in prep for a much smaller feature: maintaining file handle validity after rename and deletion (to better fit with POSIX).

Copy link
Contributor

@jfantinhardesty jfantinhardesty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that some of the locking and unlocking could be a bit aggressive. It seems that there is a possibility for deadlock. In particular, the unit tests are timing out due to deadlock. Also, I might recommend using the race detector when running the unit tests to check for any race conditions as you add / remove some of the locks.

@foodprocessor
Copy link
Contributor Author

I think that some of the locking and unlocking could be a bit aggressive. It seems that there is a possibility for deadlock. In particular, the unit tests are timing out due to deadlock. Also, I might recommend using the race detector when running the unit tests to check for any race conditions as you add / remove some of the locks.

If the tests can't finish then they can't fail! 😁

@foodprocessor foodprocessor marked this pull request as draft November 27, 2024 08:47
@foodprocessor foodprocessor marked this pull request as ready for review January 2, 2025 06:35
@foodprocessor foodprocessor changed the title Delete files synchronously File Cache Refactor Jan 2, 2025
@foodprocessor foodprocessor changed the title File Cache Refactor Consistency for File State Jan 2, 2025
@foodprocessor foodprocessor force-pushed the delete-files-synchronously branch from 258cd37 to 6655269 Compare January 7, 2025 05:17
@jfantinhardesty
Copy link
Contributor

Looks like there is a deadlock somewhere with the recent changes, all the unit tests timeout now.

@foodprocessor
Copy link
Contributor Author

Looks like there is a deadlock somewhere with the recent changes, all the unit tests timeout now.

Woops! Fixed it! In RenameDir, I incorrectly included locking the files (again!) during the rename process.

@foodprocessor foodprocessor merged commit 3dbfd26 into main Jan 15, 2025
16 of 17 checks passed
@foodprocessor foodprocessor deleted the delete-files-synchronously branch January 15, 2025 23:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants