Add the ability to join networks to "seed mode" so that we can run a seed cluster #182

bhechinger · 2024-04-10T12:08:46Z

To be able to reasonably run this in kubernetes we need to be able to run multiple seed pods.

This change has nodes run in p2p-beacon mode to try to join an existing network. This allows us two things.

Running a seed cluster where there are multiple seed nodes
Completely shut down the seed nodes being able to start them back up by bootstrapping off an existing network

At start the seed will attempt to contact the nodes listed in the discovery list. After enough attempts instead of failing it will fall back to bootstrapping a brand new network.

…a new one

This change adds support for prometheus metrics export via P2P protocol. Metrics are exported with `TextEncoder` and compressed with `zstd` on default compression level (`3` at the implementation time).

* add program id Tx dep missing detection * add tx hash to save db error logs * sync Tx dep verification and tx save to be sure parent are saved before children * add program id Tx dep missing detection * add tx hash to save db error log * sync Tx dep verification and tx save to be sure parent are saved before children * test run tx with all its program id * add program id Tx dep missing detection * add tx hash to save db error logs * sync Tx dep verification and tx save to be sure parent are saved before children * test run tx with all its program id * add program id Tx dep missing detection * sync Tx dep verification and tx save to be sure parent are saved before children --------- Co-authored-by: Tuomas Mäkinen <[email protected]>

musitdev

I think there are some modifications that came from my branch that I merge yesterday. Perhaps you'll need to rebase from main?

bhechinger · 2024-04-10T14:00:19Z

I think there are some modifications that came from my branch that I merge yesterday. Perhaps you'll need to rebase from main?

I had already re-based but it mangled the formatting. I just needed to re-run cargo fmt

musitdev

Perhaps you can test the watchdog modification to see?

musitdev · 2024-04-10T13:59:09Z

crates/node/src/main.rs

+    // Start a basic healthcheck so kubernetes has something to wait on.
+    let listener = TcpListener::bind(config.http_healthcheck_listen_addr).await?;
+    tracing::info!("Healthcheck listening on: {}", listener.local_addr()?);
+
    loop {
-        sleep(Duration::from_secs(1));
+        let (stream, _) = listener.accept().await?;
+
+        let io = TokioIo::new(stream);
+
+        tokio::task::spawn(async move {
+            if let Err(err) = http1::Builder::new()
+                .serve_connection(io, service_fn(ok))
+                .await
+            {
+                tracing::error!("Error serving connection: {:?}", err);
+            }
+        });
    }


To avoid starting a new http server you can move the lines 243,244:

let scheduler_watchdog_sender = watchdog::start_healthcheck(config.http_healthcheck_listen_addr).await?;

Just before the if for example line 239 and remove these.
It will activate the health check watchdog for all type of node. The scheduler verification will fail, but the / entry point will be ok.
I was thinking it will log an error. If it's too annoying, keep the http server you add, and I'll remove it when I'll change the watchdog to activate per service.

I wasn't sure if it was safe to re-use that so I just whipped up something quick to get me past the issue I had (without the readiness check kubernetes won't wait to start the next statefulset pod). I'm ok with ignoring any errors it logs for now so fix it when you get around to it.

Ok, as discussed on slack, this is actually an issue as it generates thousands of errors a second. I'll merge this then we can update it when you've changed the watchdog.

tuommaki

I'm fine with the PR in general, but maybe Nix files could be extracted into separate PR and checked for dependencies?

tuommaki · 2024-04-10T14:59:36Z

flake.nix

+                p.google-cloud-sdk-gce
+                (p.google-cloud-sdk.withExtraComponents [p.google-cloud-sdk.components.gke-gcloud-auth-plugin])
+                p.k9s
+                p.kube-capacity


These shouldn't be needed with the node directly, right?

bhechinger · 2024-04-10T15:18:50Z

I'm fine with the PR in general, but maybe Nix files could be extracted into separate PR and checked for dependencies?

Oh shoot, I missed this message before merging. Let's catch up on slack about how to handle this.

bhechinger added 7 commits April 8, 2024 15:23

added flake

751cf45

have p2p-beacon mode try to join an existing network before starting …

dfcb8f8

…a new one

Added basic healthcheck for the p2p-beacon

0b10fa3

we need to add retry because the service takes a while to become active

28acf48

make retry limit configurable

e96a3a1

don't log discovery errors when in seed mode

a56189b

need to do an await

85e32fe

bhechinger added the enhancement New feature or request label Apr 10, 2024

bhechinger requested review from tuommaki and musitdev April 10, 2024 12:08

bhechinger self-assigned this Apr 10, 2024

tuommaki and others added 3 commits April 10, 2024 13:12

Add prometheus metrics export for P2P (#180)

17018fa

This change adds support for prometheus metrics export via P2P protocol. Metrics are exported with `TextEncoder` and compressed with `zstd` on default compression level (`3` at the implementation time).

Merge branch 'main' into seed_mode_add_discovery

2ceeb09

musitdev approved these changes Apr 10, 2024

View reviewed changes

re-ran cargo fmt

a8b27ff

musitdev approved these changes Apr 10, 2024

View reviewed changes

tuommaki approved these changes Apr 10, 2024

View reviewed changes

bhechinger merged commit 683bff9 into main Apr 10, 2024
4 checks passed

bhechinger deleted the seed_mode_add_discovery branch April 10, 2024 15:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the ability to join networks to "seed mode" so that we can run a seed cluster #182

Add the ability to join networks to "seed mode" so that we can run a seed cluster #182

bhechinger commented Apr 10, 2024

musitdev left a comment

bhechinger commented Apr 10, 2024

musitdev left a comment

musitdev Apr 10, 2024

bhechinger Apr 10, 2024

bhechinger Apr 10, 2024

tuommaki left a comment

tuommaki Apr 10, 2024

bhechinger commented Apr 10, 2024

Add the ability to join networks to "seed mode" so that we can run a seed cluster #182

Add the ability to join networks to "seed mode" so that we can run a seed cluster #182

Conversation

bhechinger commented Apr 10, 2024

musitdev left a comment

Choose a reason for hiding this comment

bhechinger commented Apr 10, 2024

musitdev left a comment

Choose a reason for hiding this comment

musitdev Apr 10, 2024

Choose a reason for hiding this comment

bhechinger Apr 10, 2024

Choose a reason for hiding this comment

bhechinger Apr 10, 2024

Choose a reason for hiding this comment

tuommaki left a comment

Choose a reason for hiding this comment

tuommaki Apr 10, 2024

Choose a reason for hiding this comment

bhechinger commented Apr 10, 2024