Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DSIP-9][registry] add raft registration plugin #16352

Open
wants to merge 8 commits into
base: dev
Choose a base branch
from

Conversation

MatthewAden
Copy link

@MatthewAden MatthewAden commented Jul 20, 2024

Purpose of the pull request

This is a project for the OSPP 2024. This PR is close #10874

Brief change log

  • Added a new module: /dolphinscheduler-registry/dolphinscheduler-plugins/dolphinscheduler-registry-raft
  • Added raft's dependency in dolphinscheduler-bom/pom.xml
  • Added dolphinscheduler-registry-raft dependency in dolphinscheduler-registry/dolphinscheduler-registry-all/pom.xml

Verify this pull request

Unit Test Results:

image

Pseudo-Cluster Deployment Status:

3081722177424_ pic
3091722177435_ pic

Cluster Test

Test conditions:
2 master nodes, 1 worker node
OS: mac
Steps:
Start node1, node2, and node3 in sequence

  • Be able to normally elect a Leader
  • When the Leader of the cluster crashes, be able to normally elect a new Leader
  • The original crashed Leader re-joins the cluster and is automatically downgraded to a Follower

Copy link

boring-cyborg bot commented Jul 20, 2024

Thanks for opening this pull request! Please check out our contributing guidelines. (https://github.com/apache/dolphinscheduler/blob/dev/docs/docs/en/contribute/join/pull-request.md)

@SbloodyS SbloodyS added this to the 3.3.0 milestone Jul 20, 2024
@SbloodyS SbloodyS added the first time contributor First-time contributor label Jul 20, 2024
@SbloodyS SbloodyS requested a review from caishunfeng July 20, 2024 13:03
@SbloodyS SbloodyS added the feature new feature label Jul 20, 2024
@github-actions github-actions bot removed the document label Jul 20, 2024
@MatthewAden MatthewAden changed the title [Feature-10874][registry]add raft registration plugin [Feature-10874][registry] add raft registration plugin Jul 21, 2024
@MatthewAden MatthewAden force-pushed the feat_register_raft branch 3 times, most recently from 6f299ab to 71de6b2 Compare July 30, 2024 00:13
@SbloodyS SbloodyS requested a review from Radeity August 19, 2024 06:25
Comment on lines +575 to +591
jraft-core 1.3.14: https://mvnrepository.com/artifact/com.alipay.sofa/jraft-core/1.3.14 Apache 2.0
jctools-core 2.1.1: https://mvnrepository.com/artifact/org.jctools/jctools-core/2.1.1 Apache 2.0
commons-lang 2.6 https://mvnrepository.com/artifact/commons-lang/commons-lang/2.6 Apache 2.0
hessian 3.3.6: https://mvnrepository.com/artifact/com.alipay.sofa/hessian/3.3.6 Apache 2.0
metrics-core 4.2.11: https://mvnrepository.com/artifact/io.dropwizard.metrics/metrics-core/4.2.11 Apache 2.0
affinity 3.1.7: https://mvnrepository.com/artifact/net.openhft/affinity/3.1.7, Apache 2.0
disruptor 3.3.7: https://mvnrepository.com/artifact/com.lmax/disruptor/3.3.7, Apache 2.0
commons-compress 1.21: https://mvnrepository.com/artifact/org.apache.commons/commons-compress/1.21 Apache 2.0
bolt 1.6.4: https://mvnrepository.com/artifact/com.alipay.sofa/bolt/1.6.4, Apache 2.0
rocksdbjni 8.8.1: https://mvnrepository.com/artifact/org.rocksdb/rocksdbjni/8.8.1, Apache 2.0
jraft-rheakv-core 1.3.14: https://mvnrepository.com/artifact/com.alipay.sofa/jraft-rheakv-core/1.3.14, Apache 2.0
sofa-common-tools 1.0.12: https://mvnrepository.com/artifact/com.alipay.sofa.common/sofa-common-tools/1.0.12 Apache 2.0
annotations 12.0: https://mvnrepository.com/artifact/com.intellij/annotations/12.0 Apache 2.0
protostuff-core 1.7.2: https://mvnrepository.com/artifact/io.protostuff/protostuff-core/1.7.2 Apache 2.0
protostuff-api 1.7.2: https://mvnrepository.com/artifact/io.protostuff/protostuff-api/1.7.2 Apache 2.0
protostuff-runtime 1.7.2: https://mvnrepository.com/artifact/io.protostuff/protostuff-runtime/1.7.2 Apache 2.0
protostuff-collectionschema 1.7.2: https://mvnrepository.com/artifact/io.protostuff/protostuff-collectionschema/1.7.2 Apache 2.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to import so many dependencies from alipay?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We introduced this direct dependency.

    <dependency>
            <groupId>com.alipay.sofa</groupId>
            <artifactId>jraft-rheakv-core</artifactId>
            <version>${jraft.version}</version>
        </dependency>
    </dependencies>

These are all transitive dependencies. I'm not sure if these need to be written in the license.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove all dependencies from alipay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove all dependencies from alipay.

+1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The raft plugin depends on alipay jraft's rheakv, and other raft implementations do not have this kv store, if can't rely on alipay's jar, then this pr can't be continued, do you want to terminate this pr?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove all dependencies from alipay.

Why not accept Alipay's dependency?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The raft plugin depends on alipay jraft's rheakv, and other raft implementations do not have this kv store, if can't rely on alipay's jar, then this pr can't be continued, do you want to terminate this pr?

Open source projects are submitted to PR voluntarily and free of charge. Please notice that not all PR will be accepted and must be merged.

Why not accept Alipay's dependency?

Registry is the core component of DS. So that it must be robust and maintainable. The alipay's jraft-rheakv-core dependency obviously do not have these conditions and it has a lot of CVE. After using this dependency, every time you fix a problem and CVE in the future, you need to rely on its upgrade, and there may be incompatible updates during the upgrade process. This is unacceptable for the core components.

Copy link
Member

@ruanwenjun ruanwenjun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no master can this work? We should select the leader from the whole cluster rather than select the leader from master, I didn't see any design related to this PR, you should provide the newly design doc related to this PR.

@MatthewAden
Copy link
Author

If there are no master can this work? We should select the leader from the whole cluster rather than select the leader from master, I didn't see any design related to this PR, you should provide the newly design doc related to this PR.

Hi, this is an OSPP project. My mentor is @zhuxt2015. The design plan for the raft plugin has been discussed. The design link is Add new registry plugin based on raft. According to the discussed plan, masters will form a raft cluster, and the leader will only be elected among the masters, just like in ZK. For the three master nodes, at least two must be alive to provide services normally.

@SbloodyS SbloodyS added the DSIP label Aug 30, 2024
@SbloodyS SbloodyS changed the title [Feature-10874][registry] add raft registration plugin [DSIP-9][registry] add raft registration plugin Aug 30, 2024
@MatthewAden MatthewAden force-pushed the feat_register_raft branch 2 times, most recently from 9f50e1c to d98caa1 Compare September 11, 2024 07:35
Copy link

Please retry analysis of this Pull-Request directly on SonarCloud

@SbloodyS
Copy link
Member

Please add mysql and postgresql with raft registry cluster test in CI. You can refer to https://github.com/apache/dolphinscheduler/tree/dev/.github/workflows/cluster-test

@ruanwenjun
Copy link
Member

If there are no master can this work? We should select the leader from the whole cluster rather than select the leader from master, I didn't see any design related to this PR, you should provide the newly design doc related to this PR.

Hi, this is an OSPP project. My mentor is @zhuxt2015. The design plan for the raft plugin has been discussed. The design link is Add new registry plugin based on raft. According to the discussed plan, masters will form a raft cluster, and the leader will only be elected among the masters, just like in ZK. For the three master nodes, at least two must be alive to provide services normally.

If the cluster does have master, then the cluster will not work? This is unacceptable.

@zhuxt2015
Copy link
Contributor

If there are no master can this work? We should select the leader from the whole cluster rather than select the leader from master, I didn't see any design related to this PR, you should provide the newly design doc related to this PR.

Hi, this is an OSPP project. My mentor is @zhuxt2015. The design plan for the raft plugin has been discussed. The design link is Add new registry plugin based on raft. According to the discussed plan, masters will form a raft cluster, and the leader will only be elected among the masters, just like in ZK. For the three master nodes, at least two must be alive to provide services normally.

If the cluster does have master, then the cluster will not work? This is unacceptable.

Just like zk plugin, if the zk cluster goes down, the ds cluster will also go down.

@ruanwenjun
Copy link
Member

If there are no master can this work? We should select the leader from the whole cluster rather than select the leader from master, I didn't see any design related to this PR, you should provide the newly design doc related to this PR.

Hi, this is an OSPP project. My mentor is @zhuxt2015. The design plan for the raft plugin has been discussed. The design link is Add new registry plugin based on raft. According to the discussed plan, masters will form a raft cluster, and the leader will only be elected among the masters, just like in ZK. For the three master nodes, at least two must be alive to provide services normally.

If the cluster does have master, then the cluster will not work? This is unacceptable.

Just like zk plugin, if the zk cluster goes down, the ds cluster will also go down.

If so, big -1 for this plugin, this doesn't help with SLA, I can not find any reason why we wouldn't use jdbc registry plugin instead of this plugin.

Comment on lines +65 to 79
@PostConstruct
public void initializeRegistryPaths() {
try {
if (!registry.exists(RegistryNodeType.MASTER.getRegistryPath())) {
registry.put(RegistryNodeType.MASTER.getRegistryPath(), EMPTY, false);
}
if (!registry.exists(RegistryNodeType.WORKER.getRegistryPath())) {
registry.put(RegistryNodeType.WORKER.getRegistryPath(), EMPTY, false);
}
if (!registry.exists(RegistryNodeType.ALERT_SERVER.getRegistryPath())) {
registry.put(RegistryNodeType.ALERT_SERVER.getRegistryPath(), EMPTY, false);
}
} catch (Exception e) {
log.error("Failed to initialize registry paths", e);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't eat the exception here, why you change the code? Just for happy?

Comment on lines +61 to +65
while (System.currentTimeMillis() < endTimeMillis) {
if (raftRegistryClient.isConnectivity()) {
return;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will cause the CPU bump, and process killed.

Comment on lines +143 to +156
public Collection<String> getRegistryDataChildren(String key) {
String basePath = null;
if (key.startsWith(RegistryNodeType.MASTER.getRegistryPath())) {
basePath = RegistryNodeType.MASTER.getRegistryPath();
} else if (key.startsWith(RegistryNodeType.WORKER.getRegistryPath())) {
basePath = RegistryNodeType.WORKER.getRegistryPath();
} else if (key.startsWith(RegistryNodeType.ALERT_SERVER.getRegistryPath())) {
basePath = RegistryNodeType.ALERT_SERVER.getRegistryPath();
} else {
throw new UnsupportedOperationException("unsupported get registry data children by key:" + key);
}
List<KVEntry> kvEntries = rheaKvStore.bScan(basePath + Constants.SINGLE_SLASH,
basePath + Constants.SINGLE_SLASH + Constants.RAFT_END_KEY);
return getRegistryList(kvEntries);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's strange, the plugin layer needs to know the business logic.

@SbloodyS
Copy link
Member

If so, big -1 for this plugin, this doesn't help with SLA, I can not find any reason why we wouldn't use jdbc registry plugin instead > of this plugin.

+1, High availability and stability are the first things we should ensure. Any new function should not be based on lowering the standards of these two.

@zhuxt2015
Copy link
Contributor

If so, big -1 for this plugin, this doesn't help with SLA, I can not find any reason why we wouldn't use jdbc registry plugin instead > of this plugin.

+1, High availability and stability are the first things we should ensure. Any new function should not be based on lowering the standards of these two.

This plugin does not reduce the stability and availability of the ds cluster, and will maintain the same impact on the cluster as other plugins.

@SbloodyS
Copy link
Member

This plugin does not reduce the stability and availability of the ds cluster, and will maintain the same impact on the cluster as other plugins.

If the cluster does have master, then the cluster will not work.

This is against zk and jdbc registry since neither of these two types will cause the worker server and alert server to fail to operate normally because of the downtime of the master server.

@zhuxt2015
Copy link
Contributor

This plugin does not reduce the stability and availability of the ds cluster, and will maintain the same impact on the cluster as other plugins.

If the cluster does have master, then the cluster will not work.

This is against zk and jdbc registry since neither of these two types will cause the worker server and alert server to fail to operate normally because of the downtime of the master server.

Raft plugin won‘t do that either.

@SbloodyS
Copy link
Member

Raft plugin won‘t do that either.

Based on this PR, this problem exists. Why do you say this?

@zhuxt2015
Copy link
Contributor

Raft plugin won‘t do that either.

Based on this PR, this problem exists. Why do you say this?

I’m sorry that I didn‘t explain it clearly. I mean we will fix this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[DSIP-9][Feature][Server] Add new registry plugin based on raft
4 participants