ENOSPC rework #53

josefbacik · 2024-02-23T21:15:47Z

We have a variety of issues here, there's some things I want to get done in order to make progress. This isn't exhaustive, but is relatively line by line of the things that I know are currently bad and need to be worked on.

Problems

space_info->lock contention. Any of our multi-threaded write workloads get bogged down in the ENOSPC machinery because we have to take this lock to check and see if we can make our reservation.
Inaccuracy. We do a lot of games around where we should be reserving space, and sometimes we get this wrong. The biggest place is fragmentation on inodes. If we have to split a large delalloc region into a bunch of little regions and we never write to that inode again there is no mechanism for refilling the larger reservation need. This leads to transaction aborts when we try to finish the ordered extent.
Complicated accounting. This has gotten better through the years, but is still relatively complicated. We have the ->bytes_may_use counter which is the ultimate sum of all outstanding reservations, and we have all the various btrfs_block_rsv's which cart reservations around.

We could drastically increase our multi-threaded performance by addressing these issues, and reduce the corners where we can still end up with transaction abort's.

Overall design

I want to do a few things to move us in a better direction and address the above problems.

Use per-cpu wherever possible. We hit this machinery pretty hard, we need to reduce the lock contention by relying on lockless strategies wherever possible.
Always, always, always add our reservation to ->bytes_may_use, no matter what path we're in. This is to address the inaccuracy part. I would change the delalloc metadata reservations to immediately add to ->bytes_may_use when we know we have more reservation. This would be to make sure that any new writes coming in don't use up space that isn't available because of existing delalloc requirements.
No longer use block_rsv's. We currently call use_block_rsv() to make sure we have a reservation to call into the allocator. I want to decouple our reservation system from our actual usage. With 2 above our reservation will always be kept uptodate with the current worst case usage in the file system, so it's the entry points to this system (__reserve_bytes) that are responsible for balancing our outstanding reservations with our actual usage.
Allow for overcommit up to the size of the disk. Currently we curtail the overcommit size to some fraction of available space. This is to be pessimistic about the case where we could use every reservation we have outstanding. But this doesn't and can't happen in practice, we are not going to allocate BTRFS_MAX_LEVEL * 3 blocks for every single delayed ref we have outstanding. Instead use our flushing infrastructure to scale with our usage, and simply be more aggressive as we get closer to full. We will utilize the global_block_rsv as a real, on disk reservation to make sure we can always get what we need to disk, and then fall back to the slow, not lockless path when we run out of free chunk space.

Steps to address the issues

Each of these checkboxes is a single patch series. Some of this can be done independently, but I've listed them in order I would do them. The idea is that anybody should be able to pick these up and make progress without having my level of intimate knowledge of the ENOSPC machinery.

Create a percpu bytes_may_use counter, make sure it is updated in all the places where space_info->bytes_may_use is updated.
Create a new per-space_info counter that is the available space in the space_info. This is also should be percpu, but may make more sense to be an atomic64_t. This should be essentially the same as calling btrfs_space_info_used(space_info, false). Make it get adjusted everywhere the other counters are updated (bytes_used, bytes_reserved, bytes_pinned etc).
Change the percpu bytes_may_use counter to be updated whenever the delalloc reservation is changed. Currently we use btrfs_calculate_inode_block_rsv_size() to update the ->size of the inode block_rsv, then we call btrfs_inode_rsv_release() to release excess reservation. However we don't do anything if ->size is greater than ->reserved. Update btrfs_calculate_inode_block_rsv_size() to either subtract from the new percpu bytes_may_use if ->size is greater than ->reserved, or add to it if the opposite is true.
Update delayed refs resv to mirror what we now do for delalloc. btrfs_update_delayed_refs_rsv needs to be updated to add to the percpu bytes_may_use when ->size > ->reserved, and vice versa in the release path. This also needs to be done in the bg_inserts or bg_updates path.
Move the delayed iput into it's own worker. Currently it's in the cleaner which can get backed up behind snapshot deletion and other cleaner operations.

Once those things are done (they can be done in any order), the following can be done, but must be done in order.

Remove ->bytes_may_use, replace it's usage with the percpu bytes_may_use counter.
Add a lockless check in __reserve_bytes that checks the bytes_may_use counter plus our reservation against our available space counter. If it's under the threshold or can_overcommit is true return without taking any of the locks, otherwise go through the slow path.
Remove chunk allocation from the metadata space flushing.
Remove use_block_rsv and it's ilk.
Instead of tracking delayed refs reservations in the delayed_refs_block_rsv, simply update ->bytes_may_use every time we add or remove a delayed ref or need to account for a dirty block group.
Remove the delayed_refs_rsv.
Update the bytes_may_use percpu counter whenever we dirty an inode, remove the block rsv migration code.
Stop starting a trans handle in btrfs_dirty_inode.

The text was updated successfully, but these errors were encountered:

leebeck3 · 2024-11-14T17:30:30Z

I'm gonna take the first 4.

implements percpu_counter bytes_may_use_percpu in btrfs_space_info. Adds 2 functions to DECLARE_SPACE_INFO_UPDATE as well as some logic, enabling the percpu counter to keep track of all bytes added to "bytes_may_use". Also catching any errors as they may occur. This commit is the first for this rework: btrfs/btrfs-todo#53 Signed-off-by: Roger L. Beckermeyer III <[email protected]>

implements percpu_counter bytes_may_use_percpu in btrfs_space_info. Adds 2 functions to DECLARE_SPACE_INFO_UPDATE as well as some logic, enabling the percpu counter to keep track of all bytes added to "bytes_may_use". Also catching any errors as they may occur. This commit is the first for this rework: btrfs/btrfs-todo#53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENOSPC rework #53

ENOSPC rework #53

josefbacik commented Feb 23, 2024

leebeck3 commented Nov 14, 2024 •

edited

Loading

ENOSPC rework #53

ENOSPC rework #53

Comments

josefbacik commented Feb 23, 2024

Problems

Overall design

Steps to address the issues

leebeck3 commented Nov 14, 2024 • edited Loading

leebeck3 commented Nov 14, 2024 •

edited

Loading