Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set Pod Anti-Affinity to distribute primaries and secondaries over Nodes #51

Open
4n4nd opened this issue Aug 9, 2022 · 20 comments
Open

Comments

@4n4nd
Copy link
Contributor

4n4nd commented Aug 9, 2022

Hey folks, we were testing out the operator and saw that when deploying primary and secondary pods were not being distributed over nodes. Is there a way to set pod anti-affinity for redis node pods so that primary and secondary pods are not scheduled on the same node?

@4n4nd 4n4nd changed the title Set Pod Anti-Affinity to distribute primaries over Nodes Set Pod Anti-Affinity to distribute primaries and secondaries over Nodes Aug 9, 2022
@cin
Copy link
Contributor

cin commented Aug 9, 2022

Do you have zoneAwareReplication enabled? We try not to put primaries and replicas in the same zone if possible.

@4n4nd
Copy link
Contributor Author

4n4nd commented Aug 9, 2022

@cin we do have zoneAwareReplication enabled. How does the operator identify zones?

Cluster Status:

│ Status:                                                                                                                                                                                                                                                                                                                                       
│   Cluster:                                                                                                                                                                                                                                                                                                                                    
│     Label Selector Path:     app.kubernetes.io/component=database,app.kubernetes.io/instance=cluster,app.kubernetes.io/name=node-for-redis,redis-operator.k8s.io/cluster-name=cluster-node-for-redis                                                                                                                                          │
│     Max Replication Factor:  1                                                                                                                                                                                                                                                                                                                │
│     Min Replication Factor:  1                                                                                                                                                                                                                                                                                                                │
│     Nodes:                                                                                                                                                                                                                                                                                                                                    
│       Id:        e40ed93492a443b683633c0c278461673cc613fc                                                                                                                                                                                                                                                                                     │
│       Ip:        192.168.250.197                                                                                                                                                                                                                                                                                                              │
│       Pod Name:  rediscluster-cluster-node-for-redis-whcb7                                                                                                                                                                                                                                                                                    │
│       Port:      6379                                                                                                                                                                                                                                                                                                                         │
│       Role:      Primary                                                                                                                                                                                                                                                                                                                      │
│       Slots:                                                                                                                                                                                                                                                                                                                                  
│         0-1                                                                                                                                                                                                                                                                                                                                   │
│         5464-10923                                                                                                                                                                                                                                                                                                                            │
│       Zone:         cnaDevPool-NoSRIOV                                                                                                                                                                                                                                                                                                        │
│       Id:           a698d551e968af578b93c102808b9de063067b4f                                                                                                                                                                                                                                                                                  │
│       Ip:           192.168.76.69                                                                                                                                                                                                                                                                                                             │
│       Pod Name:     rediscluster-cluster-node-for-redis-m87m2                                                                                                                                                                                                                                                                                 │
│       Port:         6379                                                                                                                                                                                                                                                                                                                      │
│       Primary Ref:  89b8f81f6e390caf9d4080b92e96c9bc6d5bc432                                                                                                                                                                                                                                                                                  │
│       Role:         Replica                                                                                                                                                                                                                                                                                                                   │
│       Zone:         cnaDevPool-NoSRIOV                                                                                                                                                                                                                                                                                                        │
│       Id:           cc3cbb3d49e9c6cc1df8e751d31cb4f4401d6f34                                                                                                                                                                                                                                                                                  │
│       Ip:           192.168.76.70                                                                                                                                                                                                                                                                                                             │
│       Pod Name:     rediscluster-cluster-node-for-redis-hh7ms                                                                                                                                                                                                                                                                                 │
│       Port:         6379                                                                                                                                                                                                                                                                                                                      │
│       Role:         Primary                                                                                                                                                                                                                                                                                                                   │
│       Slots:                                                                                                                                                                                                                                                                                                                                  
│         2-5461                                                                                                                                                                                                                                                                                                                                │
│       Zone:         cnaDevPool-NoSRIOV                                                                                                                                                                                                                                                                                                        │
│       Id:           a2906900b89c057b4d175c9803e8a4847ebd9c4c                                                                                                                                                                                                                                                                                  │
│       Ip:           192.168.20.197                                                                                                                                                                                                                                                                                                            │
│       Pod Name:     rediscluster-cluster-node-for-redis-5rvsx                                                                                                                                                                                                                                                                                 │
│       Port:         6379                                                                                                                                                                                                                                                                                                                      │
│       Primary Ref:  e40ed93492a443b683633c0c278461673cc613fc                                                                                                                                                                                                                                                                                  │
│       Role:         Replica                                                                                                                                                                                                                                                                                                                   │
│       Zone:         cnaDevPool-NoSRIOV                                                                                                                                                                                                                                                                                                        │
│       Id:           ff18d5f2ee4c073edf9763bad42a310a4829d8fb                                                                                                                                                                                                                                                                                  │
│       Ip:           192.168.76.71                                                                                                                                                                                                                                                                                                             │
│       Pod Name:     rediscluster-cluster-node-for-redis-qfhtx                                                                                                                                                                                                                                                                                 │
│       Port:         6379                                                                                                                                                                                                                                                                                                                      │
│       Primary Ref:  cc3cbb3d49e9c6cc1df8e751d31cb4f4401d6f34                                                                                                                                                                                                                                                                                  │
│       Role:         Replica                                                                                                                                                                                                                                                                                                                   │
│       Zone:         cnaDevPool-NoSRIOV                                                                                                                                                                                                                                                                                                        │
│       Id:           89b8f81f6e390caf9d4080b92e96c9bc6d5bc432                                                                                                                                                                                                                                                                                  │
│       Ip:           192.168.250.196                                                                                                                                                                                                                                                                                                           │
│       Pod Name:     rediscluster-cluster-node-for-redis-t57vw                                                                                                                                                                                                                                                                                 │
│       Port:         6379                                                                                                                                                                                                                                                                                                                      │
│       Role:         Primary                                                                                                                                                                                                                                                                                                                   │
│       Slots:                                                                                                                                                                                                                                                                                                                                  
│         5462-5463                                                                                                                                                                                                                                                                                                                             │
│         10924-16383                                                                                                                                                                                                                                                                                                                           │
│       Zone:                         cnaDevPool-NoSRIOV    

@cin
Copy link
Contributor

cin commented Aug 9, 2022

Here's the logic in the chart. We're using the topology.kubernetes.io/zone label as the topologyKey. Is that maybe not set or do you only have one zone and we maybe have a bug?

@cin
Copy link
Contributor

cin commented Aug 9, 2022

FWIW, you can define a host anti-affinity (or w/e affinity you'd like) by overriding the affinity value in the chart.

@4n4nd
Copy link
Contributor Author

4n4nd commented Aug 9, 2022

We're using the topology.kubernetes.io/zone label as the topologyKey.

looks like all our nodes have the same label value for topology.kubernetes.io/zone.

FWIW, you can define a host anti-affinity (or w/e affinity you'd like) by overriding the affinity value in the chart.

Ah awesome. This is super helpful I can try it out. Thanks for your help!

@4n4nd
Copy link
Contributor Author

4n4nd commented Aug 9, 2022

@cin how would this make sure that pods for the same shard are not scheduled in the same node? Wouldn't this apply generally to all pods?

@cin
Copy link
Contributor

cin commented Aug 9, 2022

Good point. An affinity won't help here. I will take a look at the scheduling code -- there's probably something we can do there. This has never come up for us as we generally don't share database nodes' resources. Just curious...how big is your cluster? Why aren't you using multiple zones for your workers?

@cin
Copy link
Contributor

cin commented Aug 9, 2022

Actually, can you link your rediscluster plugin output?

@4n4nd
Copy link
Contributor Author

4n4nd commented Aug 9, 2022

Actually, can you link your rediscluster plugin output?

yeah

plugin output:

  POD NAME                                     IP               NODE           ID                                        ZONE                USED MEMORY  MAX MEMORY  KEYS  SLOTS                  
  + rediscluster-cluster-node-for-redis-hh7ms  192.168.76.70    10.164.33.104  cc3cbb3d49e9c6cc1df8e751d31cb4f4401d6f34  cnaDevPool-NoSRIOV  20.56M       9.44G             2-5461                 
  | rediscluster-cluster-node-for-redis-qfhtx  192.168.76.71    10.164.33.104  ff18d5f2ee4c073edf9763bad42a310a4829d8fb  cnaDevPool-NoSRIOV  2.58M        9.44G                                    
  + rediscluster-cluster-node-for-redis-t57vw  192.168.250.196  10.164.33.103  89b8f81f6e390caf9d4080b92e96c9bc6d5bc432  cnaDevPool-NoSRIOV  2.67M        9.44G             5462-5463 10924-16383  
  | rediscluster-cluster-node-for-redis-m87m2  192.168.76.69    10.164.33.104  a698d551e968af578b93c102808b9de063067b4f  cnaDevPool-NoSRIOV  14.66M       9.44G                                    
  + rediscluster-cluster-node-for-redis-whcb7  192.168.250.197  10.164.33.103  e40ed93492a443b683633c0c278461673cc613fc  cnaDevPool-NoSRIOV  8.61M        9.44G             0-1 5464-10923         
  | rediscluster-cluster-node-for-redis-5rvsx  192.168.20.197   10.164.33.102  a2906900b89c057b4d175c9803e8a4847ebd9c4c  cnaDevPool-NoSRIOV  2.62M        9.44G                                    
                                                                
  NAME                    NAMESPACE  PODS   OPS STATUS  REDIS STATUS  NB PRIMARY  REPLICATION  ZONE SKEW     
  cluster-node-for-redis  default    6/6/6  ClusterOK   OK            3/3         1-1/1        0/0/BALANCED

@cin
Copy link
Contributor

cin commented Aug 9, 2022

It's interesting that the k8s scheduler put 3 pods on the 104 worker. After the pods are scheduled, that's when the operator kicks in to decide what type of Redis pod it's going to be. In this case, it had to go into best effort mode. One thing you could possibly do to help this would be to give your Redis pods more memory/CPU so no more than 2 Redis pods can be scheduled on the same worker. I think that's still not going to eliminate issues where two primaries or replicas get scheduled on the same worker. We should confirm this after bumping up the resources per Redis pod.

Do you have info level logging enabled by chance? We log quite a bit of info about what choices where made when determining the Redis node type (primary/replica).

TBH, I'm not sure you want your cluster setup this way (not saying we shouldn't look into making the operator behave better in this case -- especially for dev/qa/testing clusters). Is there a reason you're all in one zone? Are you sure you'll safely be able to share resources w/replicas and primaries running on the same worker? One or more nodes in this zone could go out for w/e reason. Obviously a zone outage would be certain downtime.

@4n4nd
Copy link
Contributor Author

4n4nd commented Aug 9, 2022

Is there a reason you're all in one zone?

This is a dev environment. The prod env might be different from this.

Do you have info level logging enabled by chance?

I do not, I can enable it and try again

Apart from this, do you think it would make sense to add labels to pods identifying which shard they belong to?

@cin
Copy link
Contributor

cin commented Aug 9, 2022

I just tried it several times w/out it and at least the primaries and replicas were all on different nodes. You definitely wouldn't want two primaries (or two replicas) on a worker node. I'll look at the code and see if there's anything we can do to ensure primaries and their replicas don't end up on the same worker.

Apart from this, do you think it would make sense to add labels to pods identifying which shard they belong to?

Do you mean, adding a label for which slots the primary holds? My only argument against doing that is that labels have a length limit and as nodes get added/removed their slots get shifted around (this list can actually grow pretty large). We added a lot of this information to the redis cluster plugin for this purpose. Until then it was hard to even figure out which nodes were primaries.

@4n4nd
Copy link
Contributor Author

4n4nd commented Aug 10, 2022

I'll look at the code and see if there's anything we can do to ensure primaries and their replicas don't end up on the same worker.

Thanks!

Do you mean, adding a label for which slots the primary holds?

I mean all the pods belonging to the same shard (primaries and replicas) should have a label like my-redis-shard=0

My only argument against doing that is that labels have a length limit and as nodes get added/removed their slots get shifted around (this list can actually grow pretty large).

Since we are just keeping track of just the current shard members I don't think this would grow big

@cin
Copy link
Contributor

cin commented Aug 10, 2022

Ah, it's all coming back to me now -- it's been a while since I looked at this part of the code. The way the redis node type selection is done is a bit odd. Pods are not part of a deployment or statefulset or anything. This gives a lot of flexibility in what you can do w/downed workers, outages, etc. Only after all pods have been created are primaries and replicas sorted out. It's all done in one pass so the algorithm suffers from the problem of not knowing where other primaries/replicas have been scheduled (or could be scheduled); so by the time it gets to the end of the list, there's only one pod that can become a replica for the primary in question (and it happens to be on the same worker sometimes). I think I can make it better by not allowing it to pick replicas on the same worker in the "optimal" selection phase. I will test some things out today and see if it helps.

In regard to the label, where do you get the shard ID? I am not aware of such a property in Redis. There's the ID string but that wouldn't help you identify the replicas easily. The other issues I have with this is we'd be adding the label after the pod has been created (bc we don't know what slots a pod will hold until after it's been scheduled), and it also feels like treating the pods more like pets than cattle.

@4n4nd
Copy link
Contributor Author

4n4nd commented Aug 15, 2022

Hey thanks again for looking into it. I was checking out the code and found the simple logic that is used for pod scheduling. For our dev environment, I replaced the topologyKey from topology.kubernetes.io/zone to a node unique label, but for our prod env we want the pods to be spread over multiple zones as well as racks. To do that, we were thinking about adding a new parameter to the chart rackAwareReplication which would essentially add another topology key in the RedisCluster CR and some changes to the scheduling logic mentioned above. Wdyt about that solution?

In regard to the label, where do you get the shard ID? I am not aware of such a property in Redis.

Yeah, there isn't such an id. I just meant the shards could be given a unique id in the operator which we could use to keep track of a shard's primaries and secondaries.

The other issues I have with this is we'd be adding the label after the pod has been created (bc we don't know what slots a pod will hold until after it's been scheduled), and it also feels like treating the pods more like pets than cattle.

In that case, the labels won't really help with scheduling.

@cin
Copy link
Contributor

cin commented Aug 15, 2022

I'm glad you have a work around for the scheduling behavior. :) I tend to treat zones as racks in our operators because you don't really get that level of control/detail on managed cloud services. In your prod configuration, if you use the zoneAwareReplication feature along with the default zone label, pods should distribute properly across zones and replicas should end up in different zones than primaries (and other replicas). This should work even w/out the PR that I started.

In regard to the PR, things seem to be working much better. Deterministically placing (striping) the pods has helped immensely. You can still end up w/some weird situations when one node gets more pods scheduled than others, but that is to be expected (and can be fixed by altering the pod's resources to better fit the node). We should add some similar logic on scale down as it isn't picking the "ideal" pod(s) to remove. For example, if the RF is greater than or equal to the number of nodes, then the current algorithm will often times leave you with replicas on the primary node.

@4n4nd
Copy link
Contributor Author

4n4nd commented Aug 17, 2022

I'm glad you have a work around for the scheduling behavior.

Unfortunately, the workaround is just for the dev environment. In our prod env we would still need zoneAwareReplication as well as rackAwareReplication since we want our pods to be distributed over zones (regions) as well as racks.

In regard to the PR, things seem to be working much better.

That's awesome. I'll try to read through the scheduling logic. Thanks

@cin
Copy link
Contributor

cin commented Aug 17, 2022

I'm confused how your cluster is setup. So you want to run this cross region (across multiple kubernetes clusters)? There's no concept of racks in kubernetes, so I'm not sure exactly what you're proposing. https://kubernetes.io/docs/setup/best-practices/multiple-zones/ is the only thing I'm aware of in this regard.

@4n4nd
Copy link
Contributor Author

4n4nd commented Aug 23, 2022

So you want to run this cross region (across multiple kubernetes clusters)?

It's a single k8s cluster.
We have multiple zones and each zone has multiple k8s nodes. Ideally the operator would schedule masters and replicas in different zones, but when it cannot do that i.e. it goes into best effort mode, the operator should also try to keep the master and replica on different k8s nodes within the zone.

so I'm not sure exactly what you're proposing.

maybe I can try creating a PR for it

Apologies for the late response, I was out sick.

@4n4nd
Copy link
Contributor Author

4n4nd commented Aug 23, 2022

the operator should also try to keep the master and replica on different k8s nodes within the zone.

It looks like your PR is actually trying to do this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants