-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set Pod Anti-Affinity to distribute primaries and secondaries over Nodes #51
Comments
Do you have |
@cin we do have Cluster Status: │ Status: │
│ Cluster: │
│ Label Selector Path: app.kubernetes.io/component=database,app.kubernetes.io/instance=cluster,app.kubernetes.io/name=node-for-redis,redis-operator.k8s.io/cluster-name=cluster-node-for-redis │
│ Max Replication Factor: 1 │
│ Min Replication Factor: 1 │
│ Nodes: │
│ Id: e40ed93492a443b683633c0c278461673cc613fc │
│ Ip: 192.168.250.197 │
│ Pod Name: rediscluster-cluster-node-for-redis-whcb7 │
│ Port: 6379 │
│ Role: Primary │
│ Slots: │
│ 0-1 │
│ 5464-10923 │
│ Zone: cnaDevPool-NoSRIOV │
│ Id: a698d551e968af578b93c102808b9de063067b4f │
│ Ip: 192.168.76.69 │
│ Pod Name: rediscluster-cluster-node-for-redis-m87m2 │
│ Port: 6379 │
│ Primary Ref: 89b8f81f6e390caf9d4080b92e96c9bc6d5bc432 │
│ Role: Replica │
│ Zone: cnaDevPool-NoSRIOV │
│ Id: cc3cbb3d49e9c6cc1df8e751d31cb4f4401d6f34 │
│ Ip: 192.168.76.70 │
│ Pod Name: rediscluster-cluster-node-for-redis-hh7ms │
│ Port: 6379 │
│ Role: Primary │
│ Slots: │
│ 2-5461 │
│ Zone: cnaDevPool-NoSRIOV │
│ Id: a2906900b89c057b4d175c9803e8a4847ebd9c4c │
│ Ip: 192.168.20.197 │
│ Pod Name: rediscluster-cluster-node-for-redis-5rvsx │
│ Port: 6379 │
│ Primary Ref: e40ed93492a443b683633c0c278461673cc613fc │
│ Role: Replica │
│ Zone: cnaDevPool-NoSRIOV │
│ Id: ff18d5f2ee4c073edf9763bad42a310a4829d8fb │
│ Ip: 192.168.76.71 │
│ Pod Name: rediscluster-cluster-node-for-redis-qfhtx │
│ Port: 6379 │
│ Primary Ref: cc3cbb3d49e9c6cc1df8e751d31cb4f4401d6f34 │
│ Role: Replica │
│ Zone: cnaDevPool-NoSRIOV │
│ Id: 89b8f81f6e390caf9d4080b92e96c9bc6d5bc432 │
│ Ip: 192.168.250.196 │
│ Pod Name: rediscluster-cluster-node-for-redis-t57vw │
│ Port: 6379 │
│ Role: Primary │
│ Slots: │
│ 5462-5463 │
│ 10924-16383 │
│ Zone: cnaDevPool-NoSRIOV |
Here's the logic in the chart. We're using the |
FWIW, you can define a host anti-affinity (or w/e affinity you'd like) by overriding the |
looks like all our nodes have the same label value for
Ah awesome. This is super helpful I can try it out. Thanks for your help! |
@cin how would this make sure that pods for the same shard are not scheduled in the same node? Wouldn't this apply generally to all pods? |
Good point. An affinity won't help here. I will take a look at the scheduling code -- there's probably something we can do there. This has never come up for us as we generally don't share database nodes' resources. Just curious...how big is your cluster? Why aren't you using multiple zones for your workers? |
Actually, can you link your rediscluster plugin output? |
yeah plugin output:
|
It's interesting that the k8s scheduler put 3 pods on the 104 worker. After the pods are scheduled, that's when the operator kicks in to decide what type of Redis pod it's going to be. In this case, it had to go into best effort mode. One thing you could possibly do to help this would be to give your Redis pods more memory/CPU so no more than 2 Redis pods can be scheduled on the same worker. I think that's still not going to eliminate issues where two primaries or replicas get scheduled on the same worker. We should confirm this after bumping up the resources per Redis pod. Do you have info level logging enabled by chance? We log quite a bit of info about what choices where made when determining the Redis node type (primary/replica). TBH, I'm not sure you want your cluster setup this way (not saying we shouldn't look into making the operator behave better in this case -- especially for dev/qa/testing clusters). Is there a reason you're all in one zone? Are you sure you'll safely be able to share resources w/replicas and primaries running on the same worker? One or more nodes in this zone could go out for w/e reason. Obviously a zone outage would be certain downtime. |
This is a dev environment. The prod env might be different from this.
I do not, I can enable it and try again Apart from this, do you think it would make sense to add labels to pods identifying which shard they belong to? |
I just tried it several times w/out it and at least the primaries and replicas were all on different nodes. You definitely wouldn't want two primaries (or two replicas) on a worker node. I'll look at the code and see if there's anything we can do to ensure primaries and their replicas don't end up on the same worker.
Do you mean, adding a label for which slots the primary holds? My only argument against doing that is that labels have a length limit and as nodes get added/removed their slots get shifted around (this list can actually grow pretty large). We added a lot of this information to the redis cluster plugin for this purpose. Until then it was hard to even figure out which nodes were primaries. |
Thanks!
I mean all the pods belonging to the same shard (primaries and replicas) should have a label like
Since we are just keeping track of just the current shard members I don't think this would grow big |
Ah, it's all coming back to me now -- it's been a while since I looked at this part of the code. The way the redis node type selection is done is a bit odd. Pods are not part of a deployment or statefulset or anything. This gives a lot of flexibility in what you can do w/downed workers, outages, etc. Only after all pods have been created are primaries and replicas sorted out. It's all done in one pass so the algorithm suffers from the problem of not knowing where other primaries/replicas have been scheduled (or could be scheduled); so by the time it gets to the end of the list, there's only one pod that can become a replica for the primary in question (and it happens to be on the same worker sometimes). I think I can make it better by not allowing it to pick replicas on the same worker in the "optimal" selection phase. I will test some things out today and see if it helps. In regard to the label, where do you get the shard ID? I am not aware of such a property in Redis. There's the ID string but that wouldn't help you identify the replicas easily. The other issues I have with this is we'd be adding the label after the pod has been created (bc we don't know what slots a pod will hold until after it's been scheduled), and it also feels like treating the pods more like pets than cattle. |
Hey thanks again for looking into it. I was checking out the code and found the simple logic that is used for pod scheduling. For our dev environment, I replaced the topologyKey from
Yeah, there isn't such an id. I just meant the shards could be given a unique id in the operator which we could use to keep track of a shard's primaries and secondaries.
In that case, the labels won't really help with scheduling. |
I'm glad you have a work around for the scheduling behavior. :) I tend to treat zones as racks in our operators because you don't really get that level of control/detail on managed cloud services. In your prod configuration, if you use the In regard to the PR, things seem to be working much better. Deterministically placing (striping) the pods has helped immensely. You can still end up w/some weird situations when one node gets more pods scheduled than others, but that is to be expected (and can be fixed by altering the pod's resources to better fit the node). We should add some similar logic on scale down as it isn't picking the "ideal" pod(s) to remove. For example, if the RF is greater than or equal to the number of nodes, then the current algorithm will often times leave you with replicas on the primary node. |
Unfortunately, the workaround is just for the dev environment. In our prod env we would still need
That's awesome. I'll try to read through the scheduling logic. Thanks |
I'm confused how your cluster is setup. So you want to run this cross region (across multiple kubernetes clusters)? There's no concept of racks in kubernetes, so I'm not sure exactly what you're proposing. https://kubernetes.io/docs/setup/best-practices/multiple-zones/ is the only thing I'm aware of in this regard. |
It's a single k8s cluster.
maybe I can try creating a PR for it Apologies for the late response, I was out sick. |
It looks like your PR is actually trying to do this |
Hey folks, we were testing out the operator and saw that when deploying primary and secondary pods were not being distributed over nodes. Is there a way to set pod anti-affinity for redis node pods so that primary and secondary pods are not scheduled on the same node?
The text was updated successfully, but these errors were encountered: