RFD 170 Manta Picker Component #132

rjloura · 2019-06-10T14:31:14Z

This is an issue to discuss RFD 170 Manta Picker Component.

KodyKantor · 2019-06-10T19:59:15Z

Thanks! I'm curious to see what other things are added to this RFD.

Meanwhile, here are some questions to consider.

How does Muskie discover picker instances, and which picker is chosen to service a given request?
What happens when Muskie can't reach a picker instance?
What sort of latency do picker HTTP requests add to muskie requests? Can we track it somehow?
What happens when a picker instance responds very slowly?
What happens if picker can't query Postgres?
- Say we have three picker instances - one in each DC. Maybe one picker instance is segmented from the Postgres primary. Does it keep serving requests? Does it somehow inform its clients that it can't reach Postgres? How do the clients handle that?
If we want/need to change the picker API, do you have any suggestions for making that manageable without much/any downtime?
What do you think about deploying picker as an SMF service within the muskie zone rather than as a separate instance?

rjloura · 2019-06-11T12:11:10Z

Thanks for the comments @KodyKantor. I'm interested to hear your thoughts on my responses below.

* How does Muskie discover picker instances, and which picker is chosen to service a given request?

Once picker is registered with DNS/zookeeper muskie can look it up at picker.<dc>.<domain>

* What happens if picker can't query Postgres?

  * Say we have three picker instances - one in each DC. Maybe one picker instance is segmented from the Postgres primary. Does it keep serving requests? Does it somehow inform its clients that it can't reach Postgres? How do the clients handle that?

The picker queries shard 1 moray. Regardless, I believe today if the picker can't provide updated data it uses stale information(need to double check that). The Picker component could return a timestamp with each request letting the requester know how stale the data is.

* If we want/need to change the picker API, do you have any suggestions for making that manageable without much/any downtime?

This brings up a good point. It would probably make sense to allow for versioning via the header or the API path.

* What do you think about deploying picker as an SMF service within the muskie zone rather than as a separate instance?

This would still be an improvement, but it would provide us less flexibility. I cannot be certain but I doubt we need the same number of pickers as we need muskies.

* What happens when Muskie can't reach a picker instance?
* What sort of latency do picker HTTP requests add to muskie requests? Can we track it somehow?
* What happens when a picker instance responds very slowly?

These questions are more about how muskie would leverage a separate picker component. I was originally thinking this would be a separate RFD. My immediate goal here is to provide a picker that can be used with the rebalancer RFD 162. However, I do see the additional risk here and perhaps, in the short to medium term, breaking out the picker as a separate SMF instance on the same muskie zone will satisfy the rebalancer requirements, provide some relief to MANTA-4091, and reduce overall risk. A couple considerations for this approach:

The rebalancer will need updated muskies in order to function. Whereas a separate picker zone could be ignored by existing muskies and only used by the rebalancer. Muskies could be updated at a later date to use the new picker zone. We'd also need to consider how many muskies would need to be deployed as we don't want the rebalancer hitting a single muskie for every update. Although, that might not be that bad assuming there is only a single rebalancer zone, and it issues a single request every 30-60 seconds.
We still have the risk that the picker process (as a separate SMF instance) becomes unavailable while muskie is still available. I think in general, from a user perspective, if the picker is unavailable it is identical to muskie being unavailable.

If we do ultimately decide to use the separate picker zone approach we could, in much the same way that GETs are handled, query 2 picker zones at once for each PUT and simply choose the first one that responds. If one is delayed or unavailable it is unlikely that two of say 50 pickers (a 90% reduction in our current deployment) are both pathological.

Another idea would be to leave the choosing algorithm as part of the muskie zone, and only query the picker zone for an updated view of the storage nodes. This would essentially reduce the risk to the same one we have today where muskie uses stale data when the shard 1 moray is overloaded. The rebalancer is going to have to use it's own choosing algorithm anyway (or add additional filtering to the current choosing algorithm in the picker).

One final alternative is to abandon this work completely and simply have the rebalancer query the manta_storage bucket every 30-60s. The concern there however is that the rebalancer is only adding to the problem of an overloaded shard 1 moray. Currently there are 656 muskie processes running queries to the shard 1 moray every 30 seconds in AP southeast, so adding only one more isn't going to move the needle that much, but it does add some fuel the to the fire. However, if I'm going to do the work I'd rather do something beneficial to the entire product.

cburroughs · 2019-06-11T18:11:27Z

GET /poll Is it necessary to maintain exactly the same urls as before, or could we add versioning to this api?
At what point can we remove these moved APIs from muskie?

rjloura · 2019-06-11T18:23:04Z

* `GET /poll` Is it necessary to maintain exactly the same urls as before, or could we add versioning to this api?

I chose poll simply because that is the mpicker subcommand. I am open to suggestions.

I was thinking of adding versioning to the header, but I'd be interested to hear your thoughts on path versioning vs header versioning.

* At what point can we remove these moved APIs from muskie?

I thought that muskie didn't expose picker APIs, but I could be mistaken.

cburroughs · 2019-06-12T16:10:53Z

I don't have a strong opinion on headers vs path, more that were we have had versioned APIs it has generally been a source of regret. Path seems the more common approach, but where we do use versioning today (by way of restify) I think it is by header.

rmustacc · 2019-06-12T21:47:27Z

Thanks for starting this up Rui. I have a couple of initial thoughts and some notes on the discussion that's happened so far.

When exactly is it that someone is supposed to use /poll or /choose? It's not clear to me, in part because /choose hasn't really been defined.
Right now /poll is defined to return an entire snapshot of all of the storage nodes that exist. Presumably this will need to be paginated right? If so, are there any consistency guarantees that you see as being related to this?
Consolidating the shard 1 usage is important and probably important for us to further scale. Perhaps it's worth talking about this in the RFD?

Right now the RFD has the following assumption:

A given Manta deployment requires significantly less than M picker
components where M is the total number of muskie processes (M = Muskie
zones * Muskie SMF instances).

It'd be a little useful to discuss how we come to that. I agree in theory that should be true, but I guess it depends a lot on the implementation. Is there a reason that this is essential to the RFD? If this becomes false at some point, doe that invalidate anything that we have or we're planning? Is it just the issue over overload or is there something else here?

* What happens if picker can't query Postgres?

  * Say we have three picker instances - one in each DC. Maybe one picker instance is segmented from the Postgres primary. Does it keep serving requests? Does it somehow inform its clients that it can't reach Postgres? How do the clients handle that?
The picker queries shard 1 moray. Regardless, I believe today if the picker can't provide updated data it uses stale information(need to double check that). The Picker component could return a timestamp with each request letting the requester know how stale the data is.

If we go down this route, how would you propose a client handle that stale data? At what point should it cut off from using a given picker versus asking another? Recording the timestamp is kind of useful, but then we should also suggest a policy for what that is.

I think a lot of this will depend on what the interfaces will be, how many pickers are asked in parallel about information, and what the muskies and rebalancers end up needing to do. One gotcha with going to multiple pickers in parallel is it does mean that the pickers need to be able to handle 2-3x the reqs/sec that all the deployed muskies would if we do choosing in the pickers. That drops dramatically based on some of the other discussion points about how muskie uses picker.

* What do you think about deploying picker as an SMF service within the muskie zone rather than as a separate instance?
This would still be an improvement, but it would provide us less flexibility. I cannot be certain but I doubt we need the same number of pickers as we need muskies.

If we're going to filter it out, I think it's probably better that it just become its own service rather than process. The main question is basically should this element of scale (the picker) be tied directly to another one (muskie). In general, if the pieces aren't implemented together for an explicit reason, then we should probably say no. This is part of what makes it complicated to scale binders as they always embed a zookeeper.

Another way to look at this is should the life cycle of a picker be the same as the muskie? By default, I don't know of any reason it needs to be or that they'd be directly related. If we wanted to update one of the two, it'd be nice to do so without impacting the other.

If we go down the path picker as being delivered in the muskie zone, we should still probably think of them as discrete components and a given muskie zone should be able to use all of the picker services and not rely on the in-zone one, even if it for some reason prefers it.

In general, I think the discussion of rolling out a picker for the rebalancer separately without touching Muskie makes a lot of sense and is probably a good way to start going down this without impacting the data path. It'll probably be easier to experiment and iterate on the rebalancer and how it uses the picker while we're doing this.

kellymclaughlin · 2019-06-13T14:15:25Z

I don't have a strong opinion on headers vs path, more that were we have had versioned APIs it has generally been a source of regret.

@cburroughs I'd be interested to hear more about what has caused the regret around versioning. Is it around the maintenance burden it requires or it just didn't turn out to serve the intended purpose or something else? Thanks!

cburroughs · 2019-06-13T14:21:59Z

Oh, I'm very sorry for causing confusion. I meant to say that unversioned apis in Ttriton (that is not having a version) has been a source of regret as it fossilizes both client and server.

kellymclaughlin · 2019-06-13T14:27:11Z

In general I like the separate service idea. The picker inside muskie already functions independently of the HTTP request path handling. I think it'd be fine to just have the picker service just offer a current view of the available storage nodes rather than doing the actual choosing. The view should include a last modified timestamp and that would allow consumers of the service to make decisions about what to do if the data becomes too stale and the definition of too stale can be made by each consumer. If we go that route then it doesn't make sense to call it the picker actually, but the point is the source of contention is the querying of the shard 1 morays so that is what the new service primarily needs to be concerned with to address any potential scaling problems. The muskie change could be very small in that the picker module could remain mostly intact, but using a client for this new service versus the moray client used today.

One thing to note in the RFD is that the picker could be scaled up just like many other services such as muskie. It doesn't really matter which instance of picker a consumer's request is serviced by because they are all presenting a view of the same data.

kellymclaughlin · 2019-06-13T14:27:54Z

Oh, I'm very sorry for causing confusion. I meant to say that unversioned apis in Ttriton (that is not having a version) has been a source of regret as it fossilizes both client and server.

Ah ok, great. Thanks for the clarification!

rjloura · 2019-06-13T14:46:59Z

@rmustacc Thanks for your comments. Responses inline.

Thanks for starting this up Rui. I have a couple of initial thoughts and some notes on the discussion that's happened so far.
* When exactly is it that someone is supposed to use /poll or /choose?  It's not clear to me, in part because /choose hasn't really been defined.

I am not convinced there is a need for a /choose endpoint. Currently the picker's choose method picks a tuple of sharks based on certain inputs. Since the two known consumers are muskie and rebalancer, and since both of them require slightly different "choose filters", there isn't much benefit to providing it.

For the future, when muskie switches over to the separate picker model, it could either:

call picker's /choose endpoint, and that picker would choose a tuple of sharks based on its most recent poll data.
call /poll, and based on that data, muskie could use its own local choose filter to pick the tuple.

One advantage to #1 is that there is almost certainly less data going over the wire, but longer processing time on the picker zone. Also, the rebalancer needs (highly desires) #2.

* Right now /poll is defined to return an entire snapshot of all of the storage nodes that exist. Presumably this will need to be paginated right? If so, are there any consistency guarantees that you see as being related to this?

That's a good point. I think breaking chunks at the server boundary will be sufficient. Muskie currently operates off of data that is 30s old. I don't think the changes between when the first and last chunks arrive would be any more significant than they currently are.

* Consolidating the shard 1 usage is important and probably important for us to further scale. Perhaps it's worth talking about this in the RFD?

Do you mean to say that this RFD should cover all consolidation of the shard 1 moray's consumers?

Right now the RFD has the following assumption:

A given Manta deployment requires significantly less than M picker
components where M is the total number of muskie processes (M = Muskie
zones * Muskie SMF instances).

It'd be a little useful to discuss how we come to that. I agree in theory that should be true, but I guess it depends a lot on the implementation. Is there a reason that this is essential to the RFD? If this becomes false at some point, doe that invalidate anything that we have or we're planning? Is it just the issue over overload or is there something else here?

If the assumption doesn't hold then there is significantly less value in splitting off the picker. The overall additional load to ap-southeast (for example) would be about 0.15% if we just had the rebalancer query the shard 1 moray directly.

(There are currently 656 muskie processes in ap-southeast, each with its own picker, 1/657)

* What happens if picker can't query Postgres?

  * Say we have three picker instances - one in each DC. Maybe one picker instance is segmented from the Postgres primary. Does it keep serving requests? Does it somehow inform its clients that it can't reach Postgres? How do the clients handle that?
The picker queries shard 1 moray. Regardless, I believe today if the picker can't provide updated data it uses stale information(need to double check that). The Picker component could return a timestamp with each request letting the requester know how stale the data is.
If we go down this route, how would you propose a client handle that stale data? At what point should it cut off from using a given picker versus asking another? Recording the timestamp is kind of useful, but then we should also suggest a policy for what that is.

I'd like this to be covered in a separate RFD. I think the answer here depends on the consumer, as well as the scale and workload of the region. Perhaps, providing the timestamp is not valuable, but I could foresee a circumstance where we may regret not providing some primitive quantification of value.

I think a lot of this will depend on what the interfaces will be, how many pickers are asked in parallel about information, and what the muskies and rebalancers end up needing to do. One gotcha with going to multiple pickers in parallel is it does mean that the pickers need to be able to handle 2-3x the reqs/sec that all the deployed muskies would if we do choosing in the pickers. That drops dramatically based on some of the other discussion points about how muskie uses picker.

That is a good point. We could impose a delay instead of a parallel request. Say if you haven't gotten a response in query another picker and use the data that you get back first, that muskie could then update its preferred picker based on the result.

* In general, I think the discussion of rolling out a picker for the rebalancer separately without touching Muskie makes a lot of sense and is probably a good way to start going down this without impacting the data path. It'll probably be easier to experiment and iterate on the rebalancer and how it uses the picker while we're doing this.

This is what I am currently working on as we iterate on this discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFD 170 Manta Picker Component #132

RFD 170 Manta Picker Component #132

rjloura commented Jun 10, 2019

KodyKantor commented Jun 10, 2019

rjloura commented Jun 11, 2019

cburroughs commented Jun 11, 2019

rjloura commented Jun 11, 2019

cburroughs commented Jun 12, 2019

rmustacc commented Jun 12, 2019

kellymclaughlin commented Jun 13, 2019

cburroughs commented Jun 13, 2019

kellymclaughlin commented Jun 13, 2019

kellymclaughlin commented Jun 13, 2019

rjloura commented Jun 13, 2019

RFD 170 Manta Picker Component #132

RFD 170 Manta Picker Component #132

Comments

rjloura commented Jun 10, 2019

KodyKantor commented Jun 10, 2019

rjloura commented Jun 11, 2019

cburroughs commented Jun 11, 2019

rjloura commented Jun 11, 2019

cburroughs commented Jun 12, 2019

rmustacc commented Jun 12, 2019

kellymclaughlin commented Jun 13, 2019

cburroughs commented Jun 13, 2019

kellymclaughlin commented Jun 13, 2019

kellymclaughlin commented Jun 13, 2019

rjloura commented Jun 13, 2019