Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 159 Discussion #123

Open
jjelinek opened this issue Dec 3, 2018 · 4 comments
Open

RFD 159 Discussion #123

jjelinek opened this issue Dec 3, 2018 · 4 comments

Comments

@jjelinek
Copy link
Contributor

jjelinek commented Dec 3, 2018

Issue for discussing RFD 159 on Manta storage zone capacity

@davepacheco
Copy link
Contributor

Thanks again for writing this up! Based on talking to a few people about these issues, I think it would be useful to start with a clearer separation of the policy choice from the mechanism used to implement it. I think the policy we want is something like:

  • Manta should fill storage zones up to a max utilization of X% (for some specific X). X is chosen such that performance remains reliably good up [at the expected fragmentation rate]. Ongoing work may allow us to increase X (or decrease the expected fragmentation rate), which decreases the cost of the system.
  • The max utilization should also account for a fixed amount of space used for temporary log files, crash dumps, and other system files.

The mechanism is more complicated:

  • We want Manta to stop trying to use a storage zone before it's actually full. Otherwise, there would be requests that fail as the storage zone fills up -- and plus, we wouldn't have room to write log files and other files inside the storage zone. (This implies that Manta is aware of both the physical space available in each zone and stops writing to it before it's full.) This behavior is currently controlled globally (i.e., by Muskie for all storage zones) by the MUSKIE_MAX_UTILIZATION_PCT tunable.
  • We also want a quota on each zone to act as a backstop in case of a problem with the above mechanism.
  • Since MUSKIE_MAX_UTILIZATION_PCT is effectively applied after the quota, if the policy has a target of X%, the operator must configure the quota and MUSKIE_MAX_UTILIZATION_PCT such that the product of these two represents X% of usable storage on the box. That likely means MUSKIE_MAX_UTILIZATION_PCT will be greater than X.

I know none of this is news to you, and much of it is reflected later in the RFD, but I think it would be useful to highlight the distinction. I say that because of the confusion I've seen around these issues. Some people think that if X=95, then we should just set the quota to 95% of the box's available storage [and pretend there's no MUSKIE_MAX_UTILIZATION_PCT]. We should explain here why if we do that, that would result in lots of request failures as zones fill up. We also may not be able to read from a totally full zone because nginx can't write to its request log. On the other hand, as the RFD mentions, we can't rely solely on MUSKIE_MAX_UTILIZATION_PCT because of the cases in production where that went wrong and we needed a backstop. These two considerations lead to the non-obvious result that if the quota is 95% and and MUSKIE_MAX_UTILIZATION_PCT limits us to 95%, then we wind up using only about 90% (0.95 * 0.95).

My suggestion of policy above itself is also somewhat tied to the mechanism (because the idea of a target percentage is based in part on having implemented that), but I still think some distinction here is useful.

Regarding the suggestion here:

Using a quota on the storage zone dataset to reserve 1 TiB of usable space should be more than adequate as a safety net for the rest of the system.

To set a fixed 1 TiB quota, the total top-level space for the 'zones' dataset should be used, minus 1 TiB. The resulting value should be applied as the quota for the storage zone's dataset.

I think it will under-use the box by about MUSKIE_MAX_UTILIZATION_PCT * 1 TiB.

There is one additional factor to be aware of with the Muskie limit. The way Muskie works, it actually rounds up the calculation. That is, with a setting of 95%, Muskie will actually use almost 96% of the storage in the zone. This provides a fortuitous situation where the storage zone's 1 TiB quota is more than offset on our production storage servers. Thus, the quota does not have to factor in to the general thinking around storage zone capacity.

Is that right, or is that backwards? I think the behavior here is that if the zone is 93.0001% full, Muskie treats that as 94% full (because Minnow uses ceil() when reporting the percent utilized). In that case, if MUSKIE_MAX_UTILIZATION_PCT = 95, then it will stop using zones when they reach 94% utilized, which I think is the reverse of the above.

If we want the target to be 95%, I think we'd set MUSKIE_MAX_UTILIZATION_PCT = 96. Then we'd be using 95% of the quota, which itself would be 1 TiB less than 95% of the box. I think that means we'd under-use each zone by MUSKIE_MAX_UTILIZATION_PCT * 1 TiB. Is that difference significant?

We should develop a plan to gradually roll out increases of the muskie limit, beyond 95%, into production. Ideally we can do this incrementally, on a few storage nodes at a time, then observe performance there, and roll back if we encounter new ZFS problems, before we deploy the updated limit to the entire fleet.

I've been wondering if we'd need to do this. Have you given much thought to how to do it? I imagine that we would have a SAPI tunable at the storage zone for the target fill (either as a percent or byte count), we'd have Minnow report this with the capacity record, and Muskie would use the value in the record as the target instead of its global limit.

@jclulow
Copy link
Contributor

jclulow commented Dec 7, 2018

While hanging out in the OpenZFS project Slack thing, the subject of SPA slop space came up. Looking at it, I suspect it accounts for the described 3-4% disparity between what space exists at the zpool (SPA) level, and what can be used at the zfs level.

There is a comment around spa_slop_shift that describes this in some detail:

/*
 * Normally, we don't allow the last 3.2% (1/(2^spa_slop_shift)) of space in
 * the pool to be consumed.  This ensures that we don't run the pool
 * completely out of space, due to unaccounted changes (e.g. to the MOS).
 * It also limits the worst-case time to allocate space.  If we have
 * less than this amount of free space, most ZPL operations (e.g. write,
 * create) will return ENOSPC.
 *
 * Certain operations (e.g. file removal, most administrative actions) can
 * use half the slop space.  They will only return ENOSPC if less than half
 * the slop space is free.  Typically, once the pool has less than the slop
 * space free, the user will use these operations to free up space in the pool.
 * These are the operations that call dsl_pool_adjustedsize() with the netfree
 * argument set to TRUE.
 *
 * Operations that are almost guaranteed to free up space in the absence of
 * a pool checkpoint can use up to three quarters of the slop space
 * (e.g zfs destroy).
 *
 * A very restricted set of operations are always permitted, regardless of
 * the amount of free space.  These are the operations that call
 * dsl_sync_task(ZFS_SPACE_CHECK_NONE). If these operations result in a net
 * increase in the amount of space used, it is possible to run the pool
 * completely out of space, causing it to be permanently read-only.
 *
 * Note that on very small pools, the slop space will be larger than
 * 3.2%, in an effort to have it be at least spa_min_slop (128MB),
 * but we never allow it to be more than half the pool size.
 *
 * See also the comments in zfs_space_check_t.
 */
int spa_slop_shift = 5;
uint64_t spa_min_slop = 128 * 1024 * 1024;

There are also comments (referred to above) around the zfs_space_check_t type that describe the different types of space checks and reserved quantities.

When this came up in the channel, there was some discussion of capping this value as 3.2% of a 250TB pool is 8TB, which feels like a lot of space to discard for all of the economic reasons mentioned in this RFD.

@KodyKantor
Copy link
Contributor

Dave,

There is one additional factor to be aware of with the Muskie limit. The way Muskie works, it actually rounds up the calculation. That is, with a setting of 95%, Muskie will actually use almost 96% of the storage in the zone. This provides a fortuitous situation where the storage zone's 1 TiB quota is more than offset on our production storage servers. Thus, the quota does not have to factor in to the general thinking around storage zone capacity.

Is that right, or is that backwards? I think the behavior here is that if the zone is 93.0001% full, Muskie treats that as 94% full (because Minnow uses ceil() when reporting the percent utilized). In that case, if MUSKIE_MAX_UTILIZATION_PCT = 95, then it will stop using zones when they reach 94% utilized, which I think is the reverse of the above.

Ah. That's my mistake. I didn't realize minnow used a ceiling operation when it computed its usage percentage. That's confusing behavior.

@numericillustration
Copy link

numericillustration commented Dec 21, 2018

Dave,

There is one additional factor to be aware of with the Muskie limit. The way Muskie works, it actually rounds up the calculation. That is, with a setting of 95%, Muskie will actually use almost 96% of the storage in the zone. This provides a fortuitous situation where the storage zone's 1 TiB quota is more than offset on our production storage servers. Thus, the quota does not have to factor in to the general thinking around storage zone capacity.

Is that right, or is that backwards? I think the behavior here is that if the zone is 93.0001% full, Muskie treats that as 94% full (because Minnow uses ceil() when reporting the percent utilized). In that case, if MUSKIE_MAX_UTILIZATION_PCT = 95, then it will stop using zones when they reach 94% utilized, which I think is the reverse of the above.

Ah. That's my mistake. I didn't realize minnow used a ceiling operation when it computed its usage percentage. That's confusing behavior.

While it uses a ceiling calculation here in the minnow code to do the utilization calculation, in the picker code where its determined if a storage zones is too full to write to it uses a <= comparison of that reported value to the MUSKIE_MAX_UTILIZATION_PCT value so even though 94.1% gets reported as 95%, sharks reporting 0 <= actual utilization <= 95.0 will still get used for writes. Its only when they hit 95.1% which gets ceilinged to 96 that if (v.percentUsed <= maxUtilization) becomes 96 <= 95 so setting 95% lets you use up to 95.000001 % of that storage.

> total=100
100
> used=95
95
> Math.ceil((used / total) * 100)
95
> used=95.000001
95.000001
> Math.ceil((used / total) * 100)
96
> 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants