-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Private/mmeeks/debongmore #10334
Private/mmeeks/debongmore #10334
Conversation
Wow - the more I read here, the more scared I get; some questions.
Then there are more questions:
What is this 'isIdle' condition I just renamed ? - why is the _pollTimeout related in any way to when we should close sockets ? perhaps it should have the same value - but the _pollTimeout - is the default amount of time the poll runs when there is no event - it seems totally conceptually unrelated to the amount of time we should wait before closing sockets - luckily it is plausible time. However as a constant - not a member it should not have an '_' prefix - it looks like an instance variable. What does the 'TO Criteria:' mean ? in the comment - can we make that more readable, less shouty, and so on ? Concerned really. |
net/WebSocketHandler.hpp
Outdated
@@ -674,7 +674,8 @@ class WebSocketHandler : public ProtocolHandlerInterface | |||
LOG_WRN("CheckTimeout: Timeout websocket: Ping: last " << _pingMicroS.last() << "us, avg " | |||
<< _pingMicroS.average() << "us >= " << _pingTimeout.count() << "us over " | |||
<< (int)_pingMicroS.duration() << "s"); | |||
shutdownSilent(); | |||
if( isIPType() ) | |||
shutdownSilent(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outch, missed the internal kit usage obviously, as also not considered in the modified orig code.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs more refinement (WIP), i.e. determination whether to use or not use WS ping checkTimeout to also allow unit testing. The latter uses a local socket.
.. needs not to return true if local.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this disconnect the internal connection between coolwsd and kit?
The internal connection should never be disconnected, even if it idles for hours. We should have no disconnection logic for them, whatsoever.
It was included in the PR #9916 description By memory, it was discussed with you and/or Ash .. i.e.
The original code had a period==timeout of 18s. This code has a period of 3s and a timeout of 2s. However, this timeout is not triggered in case of a one-off, World Ping Test - global ping test, Most other utilities I have re-checked now seem to use higher fixed values
Originally the WS ping/pong used (fixed 18s) without disconnecting. I re-reviewed the referenced documentation and the code and will followup on it in a bit. |
Yes, its about inactivity (badly named idle) and wrongly named _pollTimeout
Note that this inactivity is disrupted by WS ping, hence also caps the WS ping average timeout to 64s.
An immutable private entity like previous code's
Sorry, TO is the common abbreviation for timeout - so I should have used lower case |
8835ff0
to
5ad12dc
Compare
Rebased and adjusted the |
Considering Zato's websocket-timeouts The
Bottom line, should render our WS ping more robust. Edit: I have wrapped this up in #10337 and will provide PR. |
Adding a commit to this PR using above naming scheme, expanding the inactivity timeout to 3600s and adding a unit test. The fix for WS Ping will go into another PR addressing #10337 |
Resulted in the forced push'ed three commits
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs fixing as described - but we can merge this to avoid the random closure of sockets and data-loss from documents with their Kit connection cut out from under them.
<< (int)_pingMicroS.duration() << "s"); | ||
shutdownSilent(); | ||
return true; | ||
std::shared_ptr<StreamSocket> socket = _socket.lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the lock and the copy when this comes in as a parameter ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite disturbed by closing connections on longer average pings too - where does that come from ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite disturbed by closing connections on longer average pings too - where does that come from ?
I described the source (and idea) of WS ping TO instrument #10334 (comment) as well as wrapped it up in #10337
I am currently about to finish my draft PR for #10337 and will post its initial version,
currently refining the UTs for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the lock and the copy when this comes in as a parameter ?
the _socket.lock()
has moved up to checkTimeout
to cover for the socket->isIPType()
check, which you have correctly introduced with your change request (but not coded all out).
net/Socket.hpp
Outdated
@@ -1067,7 +1063,7 @@ class StreamSocket : public Socket, | |||
HostType hostType, ReadType readType = ReadType::NormalRead, | |||
std::chrono::steady_clock::time_point creationTime = std::chrono::steady_clock::now() ) : | |||
Socket(fd, type, creationTime), | |||
_pollTimeout( net::Defaults::get().SocketPollTimeout ), | |||
_inactivityTimeout( net::Defaults::get().InactivityTimeout ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we copy this number into a member variable - rather than using it directly in the arithmetic ? surely getting that number from the defaults is far from slow - and it would then be clearer, simpler, fewer lines of code, less memory used for the socket class and so on (?). Please fix that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have cleaned up these parameter for a WIP for #10337, only querying the net::Defaults::get() once.
Due to the singleton magic nature of net::Defaults::get() (for UTs) it must be queried
to get the thread-safe single instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we have a perfectly good way of changing defaults for UTs that I highlighted in reviews (configure()
), which was ignored in favor or introducing this (unnecessary) net::Defaults
(and a new function configNet()
).
I really don't understand why we need configNet()
that takes net::Defaults
when configure()
does that in a homogeneous way, vis-a-vis configuration options. This new approach doesn't make these values "configurable", it creates a parallel way of changing them, that is orthogonal to the configuration file. Not a good idea. Also, net::Defaults
has all-public members, so there is absolutely no thread safety here.
It's very easy to go in the wrong direction, with uninformed changes, thinking we're making major improvement. Better to understand the current logic and reasoning, before replacing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will put these changes in this branch (simplified net::Defaults properties cache).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @mmeeks. Looks much simpler, although still more complex than my original suggestion to stick to checkTimeout()
as much as possible.
It would be great to get a confirmation if this fixes a potential disconnection of the kits. Need to know if I should stop the hunt or not.
net/WebSocketHandler.hpp
Outdated
@@ -674,7 +674,8 @@ class WebSocketHandler : public ProtocolHandlerInterface | |||
LOG_WRN("CheckTimeout: Timeout websocket: Ping: last " << _pingMicroS.last() << "us, avg " | |||
<< _pingMicroS.average() << "us >= " << _pingTimeout.count() << "us over " | |||
<< (int)_pingMicroS.duration() << "s"); | |||
shutdownSilent(); | |||
if( isIPType() ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (isIPType())
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revised and fixed with my amended commit 5ad12dc#diff-ee87941216cc8c8f64d6d5d5a0c1e50f07cb88bbc5f5295fbb1740e90a678fb1R679
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this disconnect the internal connection between coolwsd and kit?
yes, internal connections are of type unix to the LocalServerSocket, validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re net::Default configuration:
- Its configuration via
configNet
UT callback was added in parallel since it is beyond the WSD scope,
where the general WSD configuration resides. - Lack of thread sync is due to its lifecycle the same reason as WSD configuration,
i.e. we callconfigNet
pre-creation of sockets etc. - It is correct that we had duplication of maximum sockets (ahem), which will be revised due to changed semantics. However, the build global will influence this metric. Not too great, I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIP: Dropping the net::Defaults instance and configNet
(using configure
) as requested.
net/WebSocketHandler.hpp
Outdated
@@ -674,7 +674,8 @@ class WebSocketHandler : public ProtocolHandlerInterface | |||
LOG_WRN("CheckTimeout: Timeout websocket: Ping: last " << _pingMicroS.last() << "us, avg " | |||
<< _pingMicroS.average() << "us >= " << _pingTimeout.count() << "us over " | |||
<< (int)_pingMicroS.duration() << "s"); | |||
shutdownSilent(); | |||
if( isIPType() ) | |||
shutdownSilent(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this disconnect the internal connection between coolwsd and kit?
The internal connection should never be disconnected, even if it idles for hours. We should have no disconnection logic for them, whatsoever.
net/Socket.hpp
Outdated
@@ -1067,7 +1063,7 @@ class StreamSocket : public Socket, | |||
HostType hostType, ReadType readType = ReadType::NormalRead, | |||
std::chrono::steady_clock::time_point creationTime = std::chrono::steady_clock::now() ) : | |||
Socket(fd, type, creationTime), | |||
_pollTimeout( net::Defaults::get().SocketPollTimeout ), | |||
_inactivityTimeout( net::Defaults::get().InactivityTimeout ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then we have a perfectly good way of changing defaults for UTs that I highlighted in reviews (configure()
), which was ignored in favor or introducing this (unnecessary) net::Defaults
(and a new function configNet()
).
I really don't understand why we need configNet()
that takes net::Defaults
when configure()
does that in a homogeneous way, vis-a-vis configuration options. This new approach doesn't make these values "configurable", it creates a parallel way of changing them, that is orthogonal to the configuration file. Not a good idea. Also, net::Defaults
has all-public members, so there is absolutely no thread safety here.
It's very easy to go in the wrong direction, with uninformed changes, thinking we're making major improvement. Better to understand the current logic and reasoning, before replacing it.
We discussed this exact point privately, yes. Let me quote myself:
So it would seem I didn't quite agree that "pings are the only reliable alive-checking instrument." Indeed, we already detect closed sockets, in To repeat what I've been advocating:
But most certainly we have to be extremely vigilant not to close WebSocket connections blindly and without justification, which are the lifelines of documents and their life-cycle. Hope this clarifies where I stand. Happy to discuss where there is a mismatch between my understanding/position and the requirements, of course. |
Yup.
We had this in place for replies. Now we have general client inactivity (set to 1 hour, with revised commits in this branch). The idle connection criteria was part of the requirement. This metric is checked in
In line with proposal #10337 and described here above.
This is a great addition. Hence we would want an addition property like
Yes, this is covered with the WS ping timeout within its This overall slow connection criteria was part of the requirement.
Yup, thank you. Revising the work within this branch. |
failed desktop cypress test |
Change-Id: Ifc18ae945f9946e511a394dd92d4e884e6d3ae6d Signed-off-by: Michael Meeks <[email protected]>
We have one to the Kit process - that is allowed to take way longer than the ping time to respond, and it was not clear that this new close behavior on ping/ping was introduced in: e8a3ceb Which also appears to add a 2 second non-responding websocket ping timeout for some reason (!?). Signed-off-by: Michael Meeks <[email protected]> Signed-off-by: Sven Göthel <[email protected]> Change-Id: Iad74f253bdac5636757b130b299b5deacda658db
45c2005
to
61f5b7a
Compare
Rebased and added commit 4ceaeb0 |
61f5b7a
to
4ceaeb0
Compare
Fixed CI failure mode in unit tests - only execute executable coolwsd ;-) but ... same Cypress test fails locally - interestingly. |
4e1133f
to
ffae823
Compare
thank you! I got this message too late after my push .. outch, @mmeeks please be so kind and re-push, thank you. duck and hide |
Avoiding this: + test -s ./coolwsd + echo 'Cleaning up...' Cleaning up... + ./coolwsd --disable-cool-user-checking --cleanup /tmp/jenkins9134351613063228334.sh: line 41: ./coolwsd: Permission denied Signed-off-by: Michael Meeks <[email protected]> Change-Id: I17886f1198fedd51e7b90ce1759abfcd0323835c
ffae823
to
89d229e
Compare
My cleanup work continues here #10355 |
Summary
TODO
Checklist
make prettier-write
and formatted the code.make check
make run
and manually verified that everything looks okay