Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JDS can't submit block if some transactions are missing #912

Closed
GitGab19 opened this issue May 18, 2024 · 32 comments · Fixed by #1025
Closed

JDS can't submit block if some transactions are missing #912

GitGab19 opened this issue May 18, 2024 · 32 comments · Fixed by #1025
Assignees
Milestone

Comments

@GitGab19
Copy link
Collaborator

As you can see from the logs attached here, this is the messages flow which causes the issue:

  • NewTemplate received by JDC
  • RequestTransactionData from JDC to TP
  • RequestTransactionDataSuccess from TP to JDC
  • DeclareMiningJob from JDC to JDS
  • ProvideMissingTransactions from JDS to JDC
  • In the meantime that JDC is answering with ProvideMissingTransactionsSuccess a block is found, so there's a SubmitSolution from JDC to JDS
  • JDS doesn't have txs that were asked to JDC through ProvideMissingTransaction message, so errors are generated

JDS logs:
Screenshot 2024-05-17 at 12 13 18

JDC logs:
Screenshot 2024-05-17 at 12 13 46

The big problem here is that connection between JDS and JDC is closed in this case, so JDC stops working properly after that.
We should manage this situation better, avoiding the connection to be closed.

@GitGab19
Copy link
Collaborator Author

GitGab19 commented Jun 3, 2024

I just want to add that I got this issue during last weeks of testing on testnet3.
As I switched to testnet4 I'm not getting into this again. So I don't know exactly why this was happening, probably something related to the huge amount of txs in the mempool we had on testnet3 during last month (?).
The issue that needs to be fixed anyway (imo) is that JDS should not disconnect client if this happens

@plebhash

This comment was marked as resolved.

@GitGab19

This comment was marked as resolved.

@plebhash

This comment was marked as resolved.

@plebhash plebhash moved this from Todo 📝 to In Progress 🏗️ in SV2 Roadmap 🛣️ Jun 4, 2024
@plebhash plebhash added the bug Something isn't working label Jun 7, 2024
@plebhash
Copy link
Collaborator

plebhash commented Jun 8, 2024

I'm writing a Message Generator test to reproduce this scenario.
Here is my current strategy:

main test:

  • launches real JDS
  • launches JDC mock

JDC mock:

  • sends SetupConnection to JDS and waits for SetupConnection.Success
  • sends AllocateMiningJobToken to JDS and waits for AllocateMiningJobToken.Success
  • sends DeclareMiningJob to JDS (while missing txs), and waits for ProvideMissingTransactions
  • does not send ProvideMissingTransactions, but sends SubmitSolution instead

@GitGab19 since you reported this bug and have a more refined understanding of the message flow, can you make sure I'm not missing anything?

Please note that this is not following strictly the issue description. Since JDC is a mock, there's no need to include messages that represent the interactions between JDC and TP.

Also, here's a few questions that come to mind:

  • does JDS also need to interact with a TP? or in other words, do we also need to mock a TP? or maybe launch a real TP?
  • what should be the content of DeclareMiningJob.mining_job_token?

@GitGab19
Copy link
Collaborator Author

It seems everything is there (in the message flow) 👍
JDS sends the block (once it can reconstruct it) through RPC (usign the mini_client_rpc developed by @lorbax) to the node.
But in order to reproduce this, it should also work without any TP. Because the error is generated even before the block is submitted through RPC, since the JDS is not able to recontruct it before sending.
The most important point of this issue I opened, is that even if JDS is not able to reconstruct the block, it should not close the connection with JDC. Block is also propagated by JDC and Pool as well, so there's no reason to stop everything, closing connection with JDC.

@GitGab19
Copy link
Collaborator Author

It seems everything is there (in the message flow) 👍 JDS sends the block (once it can reconstruct it) through RPC (usign the mini_client_rpc developed by @lorbax) to the node. But in order to reproduce this, it should also work without any TP. Because the error is generated even before the block is submitted through RPC, since the JDS is not able to recontruct it before sending. The most important point of this issue I opened, is that even if JDS is not able to reconstruct the block, it should not close the connection with JDC. Block is also propagated by JDC and Pool as well, so there's no reason to stop everything, closing connection with JDC.

@plebhash to be completely sure about the unnecessity of TP in the MG test, try to run JDS without it and see if JDC is able to connect to it.

@GitGab19
Copy link
Collaborator Author

The content of DeclareMiningJob.mining_job_token should be the token sent by JDS through AllocateMiningJobToken message afaik

@lorbax
Copy link
Collaborator

lorbax commented Jun 10, 2024

The most important point of this issue I opened, is that even if JDS is not able to reconstruct the block, it should not close the connection with JDC. Block is also propagated by JDC and Pool as well, so there's no reason to stop everything, closing connection with JDC.

If I remember correctly, some time ago we decided that if the JDS cannot propagate the block, or if there an RPC error, the JDS should close the connection. So, I think this is the expected behavior. I don't like it too, tbh

@GitGab19
Copy link
Collaborator Author

The most important point of this issue I opened, is that even if JDS is not able to reconstruct the block, it should not close the connection with JDC. Block is also propagated by JDC and Pool as well, so there's no reason to stop everything, closing connection with JDC.

If I remember correctly, some time ago we decided that if the JDS cannot propagate the block, or if there an RPC error, the JDS should close the connection. So, I think this is the expected behavior. I don't like it too, tbh

I don't remember this, imo it shouldn't close connection for this specific error. Maybe it would make sense for other RPC errors (I don't think so but I don't have a strong opinion on this) but not for this one

@lorbax
Copy link
Collaborator

lorbax commented Jun 18, 2024

A posssible approach is to use rpc client in order relay transactions to the bitcoin network

@Fi3
Copy link
Collaborator

Fi3 commented Jun 18, 2024

yep I think that JDS should send to bitcoind every missing txs that it receive

@plebhash
Copy link
Collaborator

plebhash commented Jun 21, 2024

@plebhash to be completely sure about the unnecessity of TP in the MG test, try to run JDS without it and see if JDC is able to connect to it.

right now my test is incomplete, and it looks like this:

main test:

  • launches real JDS
  • launches JDC mock

JDC mock:

  1. sends SetupConnection to JDS and waits for SetupConnection.Success
  2. sends AllocateMiningJobToken to JDS and waits for AllocateMiningJobToken.Success
  3. sends DeclareMiningJob to JDS (while missing txs), and waits for ProvideMissingTransactions

if I remove step 3, JDS never complains about not being able to reach TP.

But with step 3, I see these logs:

STD OUT: actiondoc: "This action sends DeclareMiningJob and awaits for a ProvideMissingTransactions"
STD OUT: SEND JobDeclaration(
STD OUT:     DeclareMiningJob(
STD OUT:         DeclareMiningJob {
...
STD OUT:         },
STD OUT:     ),
STD OUT: )
STD OUT: Working on result 1/1: MatchMessageType: 85 (0x55)
STD OUT: RECV Sv2Frame {
STD OUT:     header: Header {
STD OUT:         extension_type: 0,
STD OUT:         msg_type: 85,
STD OUT:         msg_length: U24(
STD OUT:             8,
STD OUT:         ),
STD OUT:     },
STD OUT:     payload: None,
STD OUT:     serialized: Some(
STD OUT:         Slice {
STD OUT:             offset: 0x00007f884de37010,
STD OUT:             len: 14,
STD OUT:             index: 1,
STD OUT:             shared_state: SharedState(
STD OUT:                 128,
STD OUT:             ),
STD OUT:             owned: None,
STD OUT:         },
STD OUT:     ),
STD OUT: }
STD OUT: MATCHED MESSAGE TYPE 85
STD OUT: TEST OK
STD OUT: Disconnecting from client due to error receiving: receiving from an empty and closed channel - 127.0.0.1:34264
STD OUT: 2024-06-21T17:49:36.393849Z  INFO roles_logic_sv2::handlers::job_declaration: Received DeclareMiningJob with id: 0
STD OUT: 2024-06-21T17:49:36.393857Z DEBUG roles_logic_sv2::handlers::job_declaration: DeclareMiningJob: DeclareMiningJob { request_id: 0, mining_job_token: Ref([1, 0, 0, 0]), version: 0, coinbase_prefix: Ref([2, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 255, 255, 255, 255, 55, 2, 128, 121, 0, 83, 116, 114, 97, 116, 117, 109, 32, 118, 50, 32, 83, 82, 73, 32, 80, 111, 111, 108]), coinbase_suffix: Ref([255, 255, 255, 255, 2, 168, 247, 5, 42, 1, 0, 0, 0, 22, 0, 20, 235, 225, 183, 220, 194, 147, 204, 170, 14, 231, 67, 168, 111, 137, 223, 130, 88, 194, 8, 252, 0, 0, 0, 0, 0, 0, 0, 0, 38, 106, 36, 170, 33, 169, 237, 226, 201, 13, 62, 213, 94, 164, 53, 216, 76, 246, 14, 110, 125, 255, 48, 66, 12, 220, 90, 217, 209, 75, 129, 37, 185, 117, 116, 254, 30, 81, 159, 1, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]), tx_short_hash_nonce: 0, tx_short_hash_list: Seq064K([Ref([95, 135, 113, 8, 147, 179])], PhantomData<&binary_codec_sv2::datatypes::non_copy_data_types::inner::Inner<true, 6, 0, 0>>), tx_hash_list_hash: Ref([133, 189, 184, 91, 252, 203, 225, 42, 233, 16, 77, 119, 76, 134, 93, 189, 192, 159, 221, 130, 150, 196, 18, 32, 54, 212, 138, 255, 57, 63, 118, 74]), excess_data: Ref([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]) }
STD OUT: 2024-06-21T17:49:36.393898Z DEBUG jd_server::lib::job_declarator: Send message: PMT. Updating the JDS mempool.
STD OUT: 2024-06-21T17:49:36.394037Z ERROR jd_server::lib::mempool::error: NoClient
STD OUT: 2024-06-21T17:49:36.394050Z ERROR jd_server::lib::mempool::error: Unable to establish RPC connection with Template Provider (possible reasons: not fully synced, down)
STD OUT: 2024-06-21T17:49:36.394372Z ERROR network_helpers_sv2::noise_connection_tokio: Disconnected from client while reading : early eof - 127.0.0.1:40364
STD OUT: 2024-06-21T17:49:36.394390Z DEBUG jd_server::lib::status: Error: ChannelRecv(RecvError)
STD OUT: 2024-06-21T17:49:36.394466Z ERROR network_helpers_sv2::noise_connection_tokio: Disconnecting from client due to error receiving: receiving from an empty and closed channel - 127.0.0.1:40364
STD OUT: 2024-06-21T17:49:36.394517Z ERROR jd_server: SHUTDOWN from Downstream: Channel recv failed: `RecvError`
STD OUT: Try to restart the downstream listener
STD OUT: 2024-06-21T17:49:46.315564Z  INFO jd_server: Interrupt received
TEST OK

this is not doing anything related to SubmitSolution yet, and we are already seeing the connection being dropped

@GitGab19 do you think this is enough to reproduce the issue you wanted to report here? or should SubmitSolution be part of the test and I'm still missing something?

one interesting thing is that JDS sends a ProvideMissingTransactions to JDC before closing the connection.

plebhash added a commit to plebhash/stratum that referenced this issue Jun 21, 2024
@GitGab19
Copy link
Collaborator Author

@plebhash are you using our hosted TP for this test? Could it be that JDS is not using the correct port for RPC?

@GitGab19
Copy link
Collaborator Author

GitGab19 commented Jun 24, 2024

Btw, SubmitSolution (the one sent by JDC to JDS) should be included in some way. Or at least we should try to propagate the block on JDS side through its rpc_client, so that we check if JDS is able to reconstruct the block or not, and if it's going to close connection with JDC or not

@plebhash
Copy link
Collaborator

@plebhash are you using our hosted TP for this test? Could it be that JDS is not using the correct port for RPC?

so far there's absolutely no TP on this setup, but it seems we need one

I don't think we are able to mock a TP for RPC, right? it seems to me that the only solution is to launch a real TP

@GitGab19
Copy link
Collaborator Author

@plebhash are you using our hosted TP for this test? Could it be that JDS is not using the correct port for RPC?

so far there's absolutely no TP on this setup, but it seems we need one

I don't think we are able to mock a TP for RPC, right? it seems to me that the only solution is to launch a real TP

I think that connecting this JDS to our hosted one could be just fine!

@lorbax
Copy link
Collaborator

lorbax commented Jun 24, 2024

@plebhash are you using our hosted TP for this test? Could it be that JDS is not using the correct port for RPC?

so far there's absolutely no TP on this setup, but it seems we need one
I don't think we are able to mock a TP for RPC, right? it seems to me that the only solution is to launch a real TP

I think that connecting this JDS to our hosted one could be just fine!

The node will likely refuse remote rpc connection. I remember that when I dev the rpc client wasn't possible to connect to VPS for this reason. It can be disable btw, but I don't remember well.

@GitGab19
Copy link
Collaborator Author

@plebhash are you using our hosted TP for this test? Could it be that JDS is not using the correct port for RPC?

so far there's absolutely no TP on this setup, but it seems we need one
I don't think we are able to mock a TP for RPC, right? it seems to me that the only solution is to launch a real TP

I think that connecting this JDS to our hosted one could be just fine!

The node will likely refuse remote rpc connection. I remember that when I dev the rpc client wasn't possible to connect to VPS for this reason. It can be disable btw, but I don't remember well.

I enabled this yesterday: #990 (comment)

@lorbax
Copy link
Collaborator

lorbax commented Jun 24, 2024

IMO, if the JDS cannot reconstruct the block, there is no way it can send a submit_block, and then it returns an error.
Then only thing we can do about it is to handle this error, possibly disconnecting if this happens.

So, here is the exact part where the JDS disconnects if there is a mempool error.

// this is called by `error_handling::handle_result!`
pub async fn handle_error(sender: &Sender, e: JdsError) -> error_handling::ErrorBranch {
    tracing::debug!("Error: {:?}", &e);
    match e {
        JdsError::Io(_) => send_status(sender, e, error_handling::ErrorBranch::Break).await,
        JdsError::ChannelSend(_) => {
            //This should be a continue because if we fail to send to 1 downstream we should continue
            //processing the other downstreams in the loop we are in. Otherwise if a downstream fails
            //to send to then subsequent downstreams in the map won't get send called on them
            send_status(sender, e, error_handling::ErrorBranch::Continue).await
        }
        JdsError::ChannelRecv(_) => {
            send_status(sender, e, error_handling::ErrorBranch::Break).await
        }
        JdsError::BinarySv2(_) => send_status(sender, e, error_handling::ErrorBranch::Break).await,
        JdsError::Codec(_) => send_status(sender, e, error_handling::ErrorBranch::Break).await,
        JdsError::Noise(_) => send_status(sender, e, error_handling::ErrorBranch::Continue).await,
        JdsError::RolesLogic(_) => send_status(sender, e, error_handling::ErrorBranch::Break).await,
        JdsError::Custom(_) => send_status(sender, e, error_handling::ErrorBranch::Break).await,
        JdsError::Framing(_) => send_status(sender, e, error_handling::ErrorBranch::Break).await,
        JdsError::PoisonLock(_) => send_status(sender, e, error_handling::ErrorBranch::Break).await,
        JdsError::Sv2ProtocolError(_) => {
            send_status(sender, e, error_handling::ErrorBranch::Break).await
        }
        JdsError::MempoolError(_) => {
            send_status(sender, e, error_handling::ErrorBranch::Break).await
        }
        JdsError::ImpossibleToReconstructBlock(_) => {
            send_status(sender, e, error_handling::ErrorBranch::Continue).await
        }
        JdsError::NoLastDeclaredJob => {
            send_status(sender, e, error_handling::ErrorBranch::Continue).await
        }
    }
}

it is in status.rs and you can see that the unwanted error is this one.

        JdsError::MempoolError(_) => {
            send_status(sender, e, error_handling::ErrorBranch::Break).await
        }

To fix it, you can either returning a error_handling::ErrorBranch::Continue, but in this way the jds never disconnects for any mempool error, or remove the catch all variant and add some granularity in error management.

@plebahash this is related to a plan that you wanted to do some moneths ago (I share you the document in pvt)

plebhash added a commit to plebhash/stratum that referenced this issue Jul 9, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 9, 2024
@plebhash plebhash moved this from In Progress 🏗️ to Ready For Review 🔍 in SV2 Roadmap 🛣️ Jul 9, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 9, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 9, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 9, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 9, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 9, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 9, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 12, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 12, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 12, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 18, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 18, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 18, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 18, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 18, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 23, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 23, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 25, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 26, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 26, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 26, 2024
this is the actual fix for stratum-mining#912

use handle_result! macro

following suggestion by @lorbax stratum-mining#1025 (comment)

the implementation diverged a bit from the suggestion, but it was still a good reminder that we should leverage `handle_result!` macro here
plebhash added a commit to plebhash/stratum that referenced this issue Jul 26, 2024
plebhash added a commit to plebhash/stratum that referenced this issue Jul 26, 2024
this is the actual fix for stratum-mining#912

use handle_result! macro

following suggestion by @lorbax stratum-mining#1025 (comment)

the implementation diverged a bit from the suggestion, but it was still a good reminder that we should leverage `handle_result!` macro here
plebhash added a commit to plebhash/stratum that referenced this issue Jul 26, 2024
@github-project-automation github-project-automation bot moved this from Ready For Review 🔍 to Done ✅ in SV2 Roadmap 🛣️ Jul 26, 2024
Fi3 pushed a commit to Fi3/stratum-1 that referenced this issue Nov 26, 2024
this is the actual fix for stratum-mining#912

use handle_result! macro

following suggestion by @lorbax stratum-mining#1025 (comment)

the implementation diverged a bit from the suggestion, but it was still a good reminder that we should leverage `handle_result!` macro here
Fi3 pushed a commit to Fi3/stratum-1 that referenced this issue Nov 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants