Concurrency issues when reserving Tuples in parallel #10

kindlich · 2021-11-30T10:00:20Z

Hey there,

in some tests we found out, that requesting tuples in parallel can run into timing problems:
If the latter reservation gets applied before the prior, then trying to apply the prior will result in an error

Referenced tuples have already been consumed or reserved.

Since the issue is about timing, the example below can succeed or fail, but after a few times you usually get the error.
I'm using GNU-Parallel to create parallel requests:

# Create 2 secrets in parallel (amphora needs input-mask tuples so it fails when reserving these)
parallel java -jar cs.jar amphora create-secret {} ::: {1..2}

Below is a graphic of what I think is the case:

Two requests to reserve tuples are made in parallel
The Master castor sends the 2nd Tuple Reservation to the slave castors first
When the Master then sends the 1st Reservation to the slave castor, the slave will fail.

BR,
kindlich

The text was updated successfully, but these errors were encountered:

sbckr · 2021-11-30T14:11:11Z

Hi @kindlich.
Thank you for the detailed description.

This is definitely not the expected behaviour and will cause inconsistency or make some tuples inaccessible.

We will look into this issue and try to improve the handling of reservation as soon as possible.

As a quick workaround until this problem is fixed, you could for example wrap your command in a simple retry logic like the following:

# retry secret creation up to 3 times. print secret id and break loop if cli returns exit code 0; sleep for 50..100ms if not
parallel "for i in 1 2 3; do o=\$(java -jar cs.jar amphora create-secret {} 2> /dev/null); [ \$? -eq 0 ] && echo \"\$o\" &&  break || sleep \$(printf \"0.%03d\" \$((50+\$RANDOM%50))); done" ::: {1..2}

However,

this will certainly not solve the problem in general, but it might allow parallelization for the moment.
It looks like this could still lead to an inconsistent state where Castor is no longer able to create reservations. Unfortunately, you will have to redeploy Carbyne Stack in this case - I am sorry for the inconvenience this causes.

Best regards,
Sebastian

sbckr · 2021-12-01T10:25:56Z

If you are having problems with inconsistencies where Castor is no longer able to create or share reservations, you can also try clearing all reservations from the cache and resetting the consumption- and reservation-markers to 0. The following is a script that would perform the described task for a standard deployment:

cat <<"EOF" | bash -s
#!/bin/bash
for vcp in starbuck apollo
do
  kubectl config use-context kind-"$vcp"
  kubectl exec $(kubectl get pods | awk '/^cs-redis/{print $1}') -- bash -c "redis-cli --scan --pattern castor* | xargs redis-cli del"
  kubectl exec $(kubectl get pods | awk '/cs-postgres/{print $1}') -- psql -U cs castor -c "UPDATE tuple_chunk_meta_data SET consumed_marker = 0, reserved_marker = 0;"
done
EOF

This way, you may not have to redeploy the entire Carbyne Stack Virtual Cloud.

⚠️ While this might be acceptable as long as Carbyne Stack is in an alpha stage and should only be used in test and development environments anyway, I would still like to point out that this leads to a highly insecure environment as tuples can be reused and should never be applied in a production environment.

Signed-off-by: Sebastian Becker <[email protected]>

sbckr · 2021-12-16T13:12:03Z

In an upcoming version, TupleChunkMetaData will be replaced by TupleChunkFragment, which will no longer refer to an entire tuple chunk using only two markers for reservation and consumption, but will allow to refer to segments (hence fragments) of tuple sequences within a chunk.

This concept allows for more fine-grained handling of tuples and reservations, and would also allow for extensions, such as releasing tuples that have already been reserved but never retrieved/used.

The flows for the individual processes could be as follows:

Create Reservation (Primary)	Confirm Reservation (Secondary)	Consume Reservation (Both)

Signed-off-by: Sebastian Becker <[email protected]>

* Add test to reproduce / cover concurrency issue described by #10 * Add TupleChunkFragment -Entity, -Repository and -Service * Add initial chunk fragmentation to improve handling of concurrent requests Signed-off-by: Sebastian Becker <[email protected]>

strieflin · 2022-06-02T13:55:02Z

Is there anything left to do here @sbckr? Otherwise, let's close this issue.

kindlich · 2022-06-02T18:06:33Z

Well, there is still an issue, if too many parallel requests come in.
Currently the Tuple Chunk is split into N Fragments

If more than N requests come in at once, the master will block some HTTP calls since no DB Entities are available.
The HTTP connection stays open, but the request will take a bit longer
Meanwhile the clients will time out since "No released tuple reservation" has been found, because the master hasn't processed that request yet.

Let's imagine we have 1 Tuple Fragment available, but 2 concurrent requests:

Request 1 (master) comes in, master starts working --> locks fragment
Request 1 (slave) comes in, slave starts checking for released reservations
Request 2 (master) comes in, no free tuple fragment --> waiting
Request 2 (slave) comes in, slave starts checking for released reservations --> timeout timer started
Reservation 1 (master) distributed to all clients
Request 1 (master) done --> unlocks fragment
Request 2 (slave) timed out, since no reservation was found within the timeout
Request 2 (master) starts

Though the question is, is this something where you just have to

"properly configure" the timeouts
make sure enough tuple chunks are available (have metric of how many tuple chunks are available accessible somehow, and use that to later schedule new klyshko tuple generation)

Or should this be handled at another level?

github-actions · 2022-11-24T15:38:53Z

This issue has been marked stale because it has been open for 90 days with no activity. It will be automatically closed in 30 days if no further activity occurs.

strieflin · 2022-11-28T16:41:41Z

@sbckr: Anything left here?

sbckr added the kind/bug Categorizes issue or PR as related to a bug. label Nov 30, 2021

sbckr added a commit that referenced this issue Dec 1, 2021

Add test to reproduce / cover concurrency issue described by #10

39c79ae

Signed-off-by: Sebastian Becker <[email protected]>

sbckr mentioned this issue Dec 13, 2021

Fix empty reservation #12

Merged

sbckr self-assigned this Mar 9, 2022

sbckr added a commit that referenced this issue Mar 28, 2022

Add test to reproduce / cover concurrency issue described by #10

d4d200a

Signed-off-by: Sebastian Becker <[email protected]>

sbckr mentioned this issue Mar 28, 2022

Reservation handling #29

Merged

sbckr added a commit that referenced this issue Mar 29, 2022

Add test to reproduce / cover concurrency issue described by #10

ec19feb

Signed-off-by: Sebastian Becker <[email protected]>

sbckr added a commit that referenced this issue Apr 29, 2022

Add test to reproduce / cover concurrency issue described by #10

78c20b3

Signed-off-by: Sebastian Becker <[email protected]>

sbckr added a commit that referenced this issue May 11, 2022

Add test to reproduce / cover concurrency issue described by #10

05321fc

Signed-off-by: Sebastian Becker <[email protected]>

sbckr mentioned this issue Jul 28, 2022

Tuples are reserved and consumed twice #37

Closed

github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 24, 2022

strieflin added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Nov 24, 2022

github-actions bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrency issues when reserving Tuples in parallel #10

Concurrency issues when reserving Tuples in parallel #10

kindlich commented Nov 30, 2021

sbckr commented Nov 30, 2021

sbckr commented Dec 1, 2021

sbckr commented Dec 16, 2021 •

edited

Loading

strieflin commented Jun 2, 2022

kindlich commented Jun 2, 2022

github-actions bot commented Nov 24, 2022

strieflin commented Nov 28, 2022

Concurrency issues when reserving Tuples in parallel #10

Concurrency issues when reserving Tuples in parallel #10

Comments

kindlich commented Nov 30, 2021

sbckr commented Nov 30, 2021

sbckr commented Dec 1, 2021

sbckr commented Dec 16, 2021 • edited Loading

strieflin commented Jun 2, 2022

kindlich commented Jun 2, 2022

github-actions bot commented Nov 24, 2022

strieflin commented Nov 28, 2022

sbckr commented Dec 16, 2021 •

edited

Loading