Skip to content
This repository has been archived by the owner on Dec 17, 2024. It is now read-only.

Add the ability to join networks to "seed mode" so that we can run a seed cluster #182

Merged
merged 11 commits into from
Apr 10, 2024

Conversation

bhechinger
Copy link
Contributor

To be able to reasonably run this in kubernetes we need to be able to run multiple seed pods.

This change has nodes run in p2p-beacon mode to try to join an existing network. This allows us two things.

  • Running a seed cluster where there are multiple seed nodes
  • Completely shut down the seed nodes being able to start them back up by bootstrapping off an existing network

At start the seed will attempt to contact the nodes listed in the discovery list. After enough attempts instead of failing it will fall back to bootstrapping a brand new network.

@bhechinger bhechinger added the enhancement New feature or request label Apr 10, 2024
@bhechinger bhechinger requested review from tuommaki and musitdev April 10, 2024 12:08
@bhechinger bhechinger self-assigned this Apr 10, 2024
tuommaki and others added 3 commits April 10, 2024 13:12
This change adds support for prometheus metrics export via P2P
protocol. Metrics are exported with `TextEncoder` and compressed with
`zstd` on default compression level (`3` at the implementation time).
* add program id Tx dep missing detection

* add tx hash to save db error logs

* sync Tx dep verification and tx save to be sure parent are saved before children

* add program id Tx dep missing detection

* add tx hash to save db error log

* sync Tx dep verification and tx save to be sure parent are saved before children

* test run tx with all its program id

* add program id Tx dep missing detection

* add tx hash to save db error logs

* sync Tx dep verification and tx save to be sure parent are saved before children

* test run tx with all its program id

* add program id Tx dep missing detection

* sync Tx dep verification and tx save to be sure parent are saved before children

---------

Co-authored-by: Tuomas Mäkinen <[email protected]>
Copy link
Contributor

@musitdev musitdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some modifications that came from my branch that I merge yesterday. Perhaps you'll need to rebase from main?

@bhechinger
Copy link
Contributor Author

I think there are some modifications that came from my branch that I merge yesterday. Perhaps you'll need to rebase from main?

I had already re-based but it mangled the formatting. I just needed to re-run cargo fmt

Copy link
Contributor

@musitdev musitdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you can test the watchdog modification to see?

Comment on lines +407 to 424
// Start a basic healthcheck so kubernetes has something to wait on.
let listener = TcpListener::bind(config.http_healthcheck_listen_addr).await?;
tracing::info!("Healthcheck listening on: {}", listener.local_addr()?);

loop {
sleep(Duration::from_secs(1));
let (stream, _) = listener.accept().await?;

let io = TokioIo::new(stream);

tokio::task::spawn(async move {
if let Err(err) = http1::Builder::new()
.serve_connection(io, service_fn(ok))
.await
{
tracing::error!("Error serving connection: {:?}", err);
}
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid starting a new http server you can move the lines 243,244:

       let scheduler_watchdog_sender =
        watchdog::start_healthcheck(config.http_healthcheck_listen_addr).await?;

Just before the if for example line 239 and remove these.
It will activate the health check watchdog for all type of node. The scheduler verification will fail, but the / entry point will be ok.
I was thinking it will log an error. If it's too annoying, keep the http server you add, and I'll remove it when I'll change the watchdog to activate per service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure if it was safe to re-use that so I just whipped up something quick to get me past the issue I had (without the readiness check kubernetes won't wait to start the next statefulset pod). I'm ok with ignoring any errors it logs for now so fix it when you get around to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, as discussed on slack, this is actually an issue as it generates thousands of errors a second. I'll merge this then we can update it when you've changed the watchdog.

Copy link
Contributor

@tuommaki tuommaki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with the PR in general, but maybe Nix files could be extracted into separate PR and checked for dependencies?

Comment on lines +24 to +27
p.google-cloud-sdk-gce
(p.google-cloud-sdk.withExtraComponents [p.google-cloud-sdk.components.gke-gcloud-auth-plugin])
p.k9s
p.kube-capacity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These shouldn't be needed with the node directly, right?

@bhechinger bhechinger merged commit 683bff9 into main Apr 10, 2024
4 checks passed
@bhechinger bhechinger deleted the seed_mode_add_discovery branch April 10, 2024 15:17
@bhechinger
Copy link
Contributor Author

I'm fine with the PR in general, but maybe Nix files could be extracted into separate PR and checked for dependencies?

Oh shoot, I missed this message before merging. Let's catch up on slack about how to handle this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants