Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(wal): reduce concurrent conflicts between block write operations and poll operations #1554

Closed
wants to merge 7 commits into from

Conversation

CLFutureX
Copy link
Contributor

@Chillax-0v0 Chillax-0v0 changed the title feat(wal):reduce concurrent conflicts between block write operations and poll operations (#1550) feat(wal): reduce concurrent conflicts between block write operations and poll operations (#1550) Jul 11, 2024
@Chillax-0v0 Chillax-0v0 changed the title feat(wal): reduce concurrent conflicts between block write operations and poll operations (#1550) feat(wal): reduce concurrent conflicts between block write operations and poll operations Jul 11, 2024
@CLAassistant
Copy link

CLAassistant commented Jul 11, 2024

CLA assistant check
All committers have signed the CLA.

@Chillax-0v0
Copy link
Contributor

Hello, @RapperCL, you can update your code within the same PR; there's no need to create multiple PRs.

@Chillax-0v0

This comment was marked as resolved.

@CLFutureX

This comment was marked as resolved.

@Chillax-0v0

This comment was marked as resolved.

@Chillax-0v0
Copy link
Contributor

We have provided a tool called "WriteBench" for testing WAL performance. You can use it to compare the performance before and after the modifications to confirm their effectiveness.

@CLFutureX
Copy link
Contributor Author

We have provided a tool called "WriteBench" for testing WAL performance. You can use it to compare the performance before and after the modifications to confirm their effectiveness.

Okay, I'll find some time later to test it.

@Chillax-0v0
Copy link
Contributor

#1550 (comment)

Chillax-0v0
Chillax-0v0 previously approved these changes Jul 12, 2024
@CLFutureX
Copy link
Contributor Author

CLFutureX commented Jul 13, 2024

We have provided a tool called "WriteBench" for testing WAL performance. You can use it to compare the performance before and after the modifications to confirm their effectiveness.

Hey, I conducted some simple local tests based on WriteBench on my side, and the test results are basically consistent with the expectations.:The performance improvement becomes more evident as the concurrency level increases. Here are the test data for two different scenarios:
Current local computer environment: 8c, 32GB , 1TB SSD (KXG60ZNV1T02 by KIOXIA). (There are other applications running locally, so the test results can be used as a reference.)

first: -p="D:\Users\chenyong152\AppData\Local\autmo\test2.log" -c=2000000000 -d=8 --iops=3000 --threads=1000 --throughput=100000000 --record-size=1000 --duration=600
After multiple tests, we have captured relatively average and consistent data.
before optimization:
Append task | Append Rate 9189 msg/s 8974 KB/s | Avg Latency 8134.035 ms | Max Latency 8499.516 ms
Append task | Append Rate 13265 msg/s 12954 KB/s | Avg Latency 8941.929 ms | Max Latency 9461.795 ms
Append task | Append Rate 11912 msg/s 11633 KB/s | Avg Latency 9463.223 ms | Max Latency 10415.449 ms
Append task | Append Rate 13178 msg/s 12870 KB/s | Avg Latency 10461.069 ms | Max Latency 11141.391 ms
Append task | Append Rate 9241 msg/s 9024 KB/s | Avg Latency 11108.357 ms | Max Latency 11949.226 ms
Append task | Append Rate 13931 msg/s 13605 KB/s | Avg Latency 12160.423 ms | Max Latency 12670.754 ms
Append task | Append Rate 11233 msg/s 10969 KB/s | Avg Latency 13010.865 ms | Max Latency 13496.373 ms
Append task | Append Rate 14456 msg/s 14117 KB/s | Avg Latency 14119.631 ms | Max Latency 14467.095 ms
after optimization:
Append task | Append Rate 18671 msg/s 18234 KB/s | Avg Latency 12118.829 ms | Max Latency 12655.492 ms
Append task | Append Rate 17453 msg/s 17044 KB/s | Avg Latency 13147.148 ms | Max Latency 13486.921 ms
Append task | Append Rate 26202 msg/s 25588 KB/s | Avg Latency 13849.179 ms | Max Latency 14235.799 ms
Append task | Append Rate 13330 msg/s 13017 KB/s | Avg Latency 14573.423 ms | Max Latency 15058.457 ms
Append task | Append Rate 20215 msg/s 19741 KB/s | Avg Latency 15530.781 ms | Max Latency 15899.685 ms
Append task | Append Rate 17378 msg/s 16970 KB/s | Avg Latency 16253.505 ms | Max Latency 16724.087 ms
Append task | Append Rate 14256 msg/s 13922 KB/s | Avg Latency 17183.877 ms | Max Latency 17574.566 ms
Append task | Append Rate 25682 msg/s 25080 KB/s | Avg Latency 17916.577 ms | Max Latency 18312.136 ms

second: Increase the number of threads to 5000 to simulate higher concurrency.
-p="D:\Users\chenyong152\AppData\Local\autmo\test2.log" -c=2000000000 -d=8 --iops=3000 --threads=5000 --throughput=100000000 --record-size=1000 --duration=600
After multiple tests, we have captured relatively average and consistent data.
before optimization:
Append task | Append Rate 2298 msg/s 2244 KB/s | Avg Latency 3097.933 ms | Max Latency 3661.817 ms
Append task | Append Rate 1959 msg/s 1913 KB/s | Avg Latency 4245.371 ms | Max Latency 4683.423 ms
Append task | Append Rate 2353 msg/s 2297 KB/s | Avg Latency 4873.614 ms | Max Latency 5710.320 ms
Append task | Append Rate 1408 msg/s 1375 KB/s | Avg Latency 6015.723 ms | Max Latency 6967.013 ms
Append task | Append Rate 2704 msg/s 2641 KB/s | Avg Latency 7113.552 ms | Max Latency 7427.943 ms
Append task | Append Rate 4166 msg/s 4069 KB/s | Avg Latency 7803.141 ms | Max Latency 8241.843 ms
Append task | Append Rate 1369 msg/s 1337 KB/s | Avg Latency 8977.165 ms | Max Latency 9297.051 ms
Append task | Append Rate 2984 msg/s 2914 KB/s | Avg Latency 9725.852 ms | Max Latency 10238.513 ms
Append task | Append Rate 4475 msg/s 4370 KB/s | Avg Latency 10668.280 ms | Max Latency 11192.358 ms
Append task | Append Rate 3246 msg/s 3170 KB/s | Avg Latency 11473.678 ms | Max Latency 12005.956 ms
Append task | Append Rate 3431 msg/s 3351 KB/s | Avg Latency 12385.363 ms | Max Latency 13277.032 ms
Append task | Append Rate 3800 msg/s 3711 KB/s | Avg Latency 13457.989 ms | Max Latency 13866.654 ms
Append task | Append Rate 3802 msg/s 3713 KB/s | Avg Latency 14345.427 ms | Max Latency 14766.458 ms
after optimization:
Append task | Append Rate 7140 msg/s 6972 KB/s | Avg Latency 7088.313 ms | Max Latency 8904.666 ms
Append task | Append Rate 8430 msg/s 8232 KB/s | Avg Latency 7390.201 ms | Max Latency 9563.043 ms
Append task | Append Rate 18891 msg/s 18448 KB/s | Avg Latency 8437.945 ms | Max Latency 10448.201 ms
Append task | Append Rate 12305 msg/s 12016 KB/s | Avg Latency 9075.242 ms | Max Latency 10991.726 ms
Append task | Append Rate 5990 msg/s 5850 KB/s | Avg Latency 9932.271 ms | Max Latency 11851.828 ms
Append task | Append Rate 25127 msg/s 24538 KB/s | Avg Latency 10879.178 ms | Max Latency 13030.854 ms
Append task | Append Rate 23232 msg/s 22688 KB/s | Avg Latency 11595.691 ms | Max Latency 12423.481 ms
Append task | Append Rate 14419 msg/s 14081 KB/s | Avg Latency 12774.103 ms | Max Latency 13222.653 ms
Append task | Append Rate 19914 msg/s 19447 KB/s | Avg Latency 13402.113 ms | Max Latency 14173.474 m

@superhx
Copy link
Collaborator

superhx commented Jul 15, 2024

This PR did indeed greatly improve the performance of WAL in many competitive scenarios.

However, it also makes the correctness of WAL in a multi-concurrent context difficult for humans to understand. Considering that the number of concurrent writes in reality is usually consistent with the number of CPU cores, the current locking model is not a bottleneck for WAL performance. It is not recommended to introduce a complex locking model to avoid increasing maintenance costs.

Is it possible to have better thread & concurrency models in the future for WAL to have both good performance and better correctness guarantees? (For example, single-thread or multi single-thread)

@CLFutureX
Copy link
Contributor Author

CLFutureX commented Jul 15, 2024

This PR did indeed greatly improve the performance of WAL in many competitive scenarios.

However, it also makes the correctness of WAL in a multi-concurrent context difficult for humans to understand. Considering that the number of concurrent writes in reality is usually consistent with the number of CPU cores, the current locking model is not a bottleneck for WAL performance. It is not recommended to introduce a complex locking model to avoid increasing maintenance costs.

Is it possible to have better thread & concurrency models in the future for WAL to have both good performance and better correctness guarantees? (For example, single-thread or multi single-thread)

In fact, previously, i had considered optimizing with multiple single-threaded models, but given the strong ordering guarantees required in the current scenario, it's difficult to isolate concurrent competition through this approach.

I understand that optimizing with multiple single-threaded models is not suitable for scenarios that globally guarantee ordering, but is more suited to scenarios where local ordering is required.

Based on the current write design, it is necessary to ensure the ordering of blocks during writing as well as the correctness of the windowData.startOffset update operations. This, in turn, requires ensuring the ordering of writeBlocks, and subsequently, the ordering of poll operations.

This leads to the need for concurrent safety in managing competition among write operations, between write and poll operations, between poll and IO operations, as well as among IO operations.

Previously, global ordering was ensured through blockLock.

After the current optimization: Within write operations, blockLock is used; while for the competition between write and poll operations, between poll and IO operations, and among IO operations, pollBlockLock is introduced.

Based on the multiple single-threaded model, potential optimization points include:

  • Optimizing the IO thread pool into a multiple single-threaded model to reduce concurrent competition among IO threads for the same blocking queue. However, considering the current design of the poll thread with frequency control and batching, the competition for the blocking queue may not be intense.

Additionally, the introduction of pollBlockLock in the current pull request (PR), based on the principle of lock splitting, does not significantly increase complexity. Conceptually, it serves as a lock for poll nodes and does not conflict with existing concepts. In practice, upon closer inspection, this lock merely replaces the original BlockLock in its places of usage.

@CLFutureX
Copy link
Contributor Author

@Chillax-0v0 @superhx PTAL
The following new changes have been made:

Not only optimizations in performance but also enhancements in the threading model.

  1. Removal of Additional Lock: The newly added lock, pollBlockLock, has been removed.

  2. Single-Threaded Model for Poll Operations: Based on the designed wakeup mechanism, the poll operation has been completely decoupled from the write operation. The write thread no longer participates in the poll operation, ensuring a pure single-threaded model for poll operations.

  3. Updating startOffset for IO Operations: The update process for startOffset is now handled through a combination of volatile variables and a concurrent-safe queue. This decoupling separates IO operations from both poll and write operations.

Adjusted Threading Model: The reactor master-slave threading model is now implemented as follows:

  • Master Thread: poll thread.
  • Slave Threads: IO threads.

@CLFutureX CLFutureX closed this Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants