-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(D)CUBIC, NewReno and Prague refactoring #1818
base: master
Are you sure you want to change the base?
Conversation
…atellite dcubic test, unused variables fix,
# Conflicts: # picoquictest/edge_cases.c
…imited, cc_slow_start_increase pass through
Wow! That a lot of work. I will review later, but this is impressive. Regarding recovery: yes, that's a bug. We should make sure that recovery is properly entered for cubic, etc. But I share your concern about trying to change too much in a single PR. It is probably better to first complete the code cleanup that you are doing while keeping test results unchanged, then do a series of smaller PRs each fixing an individual issue. That way, we can separate refactor, which should not change test results, and individual improvements, for which we can verify that the test results change in the expected ways. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a series of comments, mostly asking for a few comments in the code in places when we will need further actions, plus a suggested name change for one of the cc functions.
I see that all tests are passing, with just a few requiring minor adjustment of the completion time target. For me, that's the key. If you address the comments below, we can probably merge that now. I would not try to solve all problems in this one large PR, because I am concerned about potential merge conflicts if we wait much longer.
We should then open specialized PR. First, one PR to add support for recovery. Probably another to verify the "app limited" detection and reaction. And another for DCubic, with additional tests to simulate competition on the same link between Cubic (plain) and Dcubic, and investigate potential issues.
/* Otherwise, return current cwin. */ | ||
return path_x->cwin; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is not estimating bandwidth. It is estimating a value of cwin, similar to the "target cwin" used in BBR. Maybe rename to picoquic_cc_target_cwin_estimation
or some such?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true.
I have used the word "increase" for the functions which only return the portion of increased bytes and "update" for functions which return the CWIN.
Is picoquic_cc_update_target_cwin_estimation
an option?
/* In consecutive Slow Start. */ | ||
return picoquic_cc_slow_start_increase(path_x, nb_delivered / PICOQUIC_HYSTART_PP_CSS_GROWTH_DIVISOR); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant for hystart++ support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is.
I copied a part of my HyStart++ macros/constants to the code for CSS version of the increase function.
So the current code defines the recommended growth divisor (=4, as specified by the HyStart++ RFC) as a macro.
Another option is to change the in_css
parameter to a double to specify the divisor from the outside of the function. (more flexible)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My next paper will be about HyStart++. I will share my results here. But I can't tell any date, because I'm very busy with careful resume currently.
@@ -203,7 +281,7 @@ uint64_t picoquic_cc_increased_window(picoquic_cnx_t* cnx, uint64_t previous_win | |||
else { | |||
double w = (double)previous_window; | |||
w /= (double)PICOQUIC_TARGET_RENO_RTT; | |||
w *= (cnx->path[0]->rtt_min > PICOQUIC_TARGET_SATELLITE_RTT)? PICOQUIC_TARGET_SATELLITE_RTT: cnx->path[0]->rtt_min; | |||
w *= (cnx->path[0]->rtt_min > PICOQUIC_TARGET_SATELLITE_RTT) ? PICOQUIC_TARGET_SATELLITE_RTT : (double)cnx->path[0]->rtt_min; | |||
new_window = (uint64_t)w; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You copied the existing code, and that's fine. But it is a bit funky. It is means to open the window faster when the RTT is long, but since that we came out with "careful resume" that solves a large part of the problem when a previous bandwidth estimate is available, and then with increases based on peak bandwidth estimate, which also provides for rapid growth of cwin in these high delay links. So maybe we should add a note that this code should be revisited and perhaps removed. If we keep it, it would be nice to use integer arithmetic instead of (double).
But let's keep the code as is for now. Maybe add a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our careful resume (CR) implementation this part of the code is already removed. :)
https://github.com/hfstco/picoquic/tree/cr
We submitted a paper about CR and I think it will be published in March. It focuses on CR and first measurements on satellite connections. (GEO, LEO, ...) I want it to share it, but I think the IEEE copyright forbids me to. Maybe I can share a pre-submitted version of it. I have to take a look.
But yes, these part of code can be removed, but currently don't have to.
* | ||
* Isn't win_cubic always larger than cubic_state->W_reno? | ||
* If clause is unnecessary here. | ||
*/ | ||
/* Pick the largest */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's review that closely. The Cubic specification assumes a set of "dimensions", such as times in seconds and windows expressed in packets. The picoquic code expects CWIN to be expressed in bytes, with a typical target being "Cwin = bandwith * RTT", with bandwidth in bytes per second and RTT in seconds. Actually using ""Cwin = bandwidth * RTTus / 1000000" since times are measured in microseconds. I think that w_reno and w_cubic are both computed in bytes, and that the test here just follows the spec. I would expect w_reno to be larger than w_cubic when the RTT is small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I wrote this comment early in December. I don't can clearly remember and have to check that again.
cubic_state->W_reno = W_cubic * (double)path_x->send_mtu; | ||
cubic_state->ssthresh = (uint64_t)(cubic_state->W_max * cubic_state->beta * (double)path_x->send_mtu); | ||
cubic_state->ssthresh = (uint64_t)(cubic_state->W_max * PICOQUIC_CUBIC_BETA * (double)path_x->send_mtu); | ||
path_x->cwin = (uint64_t)cubic_state->W_reno; | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a request to change the code, but a comment. If we implement "recovery", the spurious detection could be tied to exiting recovery: either go back to avoidance through "correct spurious", or switch to avoidance with the current parameters if a packet sent after entering recovery is acked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, in this case it would be good to save the used beta of the CWIN reduction into the CC state to recover the correct values.
@@ -1502,6 +1502,8 @@ int initial_pto_srv_test() | |||
int has_initial; | |||
int has_handshake; | |||
uint64_t simulated_time = 0; | |||
//uint64_t simulated_rtt = 20000; | |||
//uint64_t simulated_pto = 4 * simulated_rtt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious. These two lines are not present in my copy of the code. Is this a code synchronization issue? We could have a merge conflict here, but the fix of the conflict is obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe. My IDE told me that these values are not used. That's why I have commented them out. Then I will remove them completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest you just remove them. Otherwise, this will be flagged as a merged conflict.
@@ -130,7 +130,8 @@ int l4s_prague_test() | |||
{ | |||
picoquic_congestion_algorithm_t* ccalgo = picoquic_prague_algorithm; | |||
|
|||
int ret = l4s_congestion_test(ccalgo, 1, 3500000, 7, 1500, 0, NULL); | |||
/* TODO increased max_completion_time for 100ms, because of "app limited" changes */ | |||
int ret = l4s_congestion_test(ccalgo, 1, 3600000, 7, 1500, 0, NULL); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That increase is probably OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, app limited slightly increase the max_completion_time
of some test cases. To be honest, I haven't checked in which CC era app limited hits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's open an issue about testing of app limited. We should do a set of connection trials, some app limited and some not. Then monitor the "app limited" signals, to verify that it matches the state of the application. The issue will capture that, and then we can do a separate PR to develop the tests and correct whatever bugs surface.
return demo_server_test(PICOHTTP_ALPN_HQ_LATEST, picoquic_h09_server_callback, NULL, satellite_test_scenario, nb_satellite_test_scenario, | ||
demo_test_stream_length, 1, 0, 10750000, 0, NULL, NULL, NULL, 0); | ||
demo_test_stream_length, 1, 0, 11000000, 0, NULL, NULL, NULL, 0); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a dramatic increase, but it would be nice to understand what triggered it. Just like for the h3zero test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as mentioned above, I will add a simple printf which tells me the timestamp app limited hits. So we can find out if app limited hits at the beginning to the connection/SS and prevents CWIN to grow up more quickly or is spread over the whole connection. I will investigate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's why I'm careful with the changes of the test suite and didn't include app limited for NewReno. I think if we understand what app limited changes in these test cases we can fix the other failing test cases too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you can do it simply, yes do that. But see previous comment about opening an issue for improved testing of app limited, which we will need to do anyhow.
return satellite_test_one(picoquic_bbr_algorithm, 100000000, 5300000, 250, 3, 0, 0, 0, 0, 0, 0); | ||
/* TODO test changed, app limited, verify. */ | ||
/* return satellite_test_one(picoquic_bbr_algorithm, 100000000, 5300000, 250, 3, 0, 0, 0, 0, 0, 0); */ | ||
return satellite_test_one(picoquic_bbr_algorithm, 100000000, 5500000, 250, 3, 0, 0, 0, 0, 0, 0); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That increase is probably OK. The 5.5 second time largely passes the "less than 7 seconds" bar. See previous comment about funky code for increasing CWIN faster with long delays. But the "app limited" comment is surprising. The satellite tests are not supposed to be app limited, ever. We probably need to open an issue to revisit the whole "app limited" detection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above.
/* TODO check max_completion_time */ | ||
return satellite_test_one(picoquic_prague_algorithm, 100000000, 5300000, 250, 3, 0, 0, 0, 1, 0, 0); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. Both the dcubic and prague values appear better than BBR! And yes, adding these tests is a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added them to cover the seed_cwin
parts of prague and dcubic. Maybe there are other test cases which are better suited to cover these parts of code.
Generally, my goal is to reach more than 90% code coverage.
Most of the uncovered lines in (D)CUBIC are part of the "recovery alg_state
not really set/entered"-problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First, we do not need to worry too much about test coverage of the tests themselves. By definition, the test code will include branches for potential errors in the main code; if the main code works well, these branches will not be used.
Second, my goal in the previous coverage PR was to get to at least 90% lines and 75% branches. I am not sure that the version on which your PR is based includes the results of that coverage effort. That's one of the reasons why I would like you to encourage to merge that PR sooner rather than later, and open separate issues for the areas that need further work.
WORK IN PROGRESS
Refactor of CC code. I moved many common parts to cc_common, but I like to keep the code readable and understandable, as well as what the CC algorithm is doing, by just looking at its code. So, some parts stay in the CC code file even if some code duplicates another CC algorithm. (avoiding general picoquic_cc_do_congestion_avoidance(cc_algo, cc_param) functions)
Additionally, I try to change the CWIN (and other path variables) inside the CC code file. Most of the common CC functions return new values or increase them by values.
While doing some coverage runs, I recognized that CUBIC will not enter the recovery phase anyway. (_ alg_state _ isn't set to _ picoquic_cubic_alg_recovery _ at any time.) Even if we call enter_recovery(), we immediately transition to SS or CA. Of course, recovery_sequence is set. Is this intended, or can we remove the recovery state from the enum? Or set it to recovery while the recovery_sequence is set.
Changes:
TODO:
- loss filter in NewReno
- app limited for NewReno
Any feedback, changes, and help would be appreciated.