Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge flush-in-batch capability #40

Merged
merged 52 commits into from
Dec 4, 2024
Merged

Merge flush-in-batch capability #40

merged 52 commits into from
Dec 4, 2024

Conversation

BLuedtke
Copy link
Collaborator

@BLuedtke BLuedtke commented Nov 26, 2024

Closes #39

So far, when bidib_flush was called (bidib_transmission_send.c), the content of the send buffer would be written byte-by-byte. This results in one write system call per byte (see the bidib_serial_port_write function). However, write is perfectly capable of writing more than one byte at a time.

I therefore added a write_n ("write n bytes") analogue to the write call we have used so far. However, we cannot simply pass the buffer and the buffer size to write_n as-is, because bytes for escaping and the crc (checksum) have to be inserted somewhere in the buffer. Various ways of dealing with this have been benchmarked, see this comment here. One of the solutions turned out to be superior, specifically the one using an auxilliary statically allocated buffer.

The old byte-by-byte flushing is still present, in case it is needed as a fallback (e.g., if the auxilliary buffer is too small, though I don't expect this to happen).

In addition to the write-stuff, I also added some logging if received packet processing takes longer than a certain threshold. Not directly related, but it was due to be added anyhow IMO.

I want to re-run the physical tests a few times before merging, but I think the code is ready to review anyway.

@BLuedtke BLuedtke linked an issue Nov 26, 2024 that may be closed by this pull request
@BLuedtke BLuedtke self-assigned this Nov 26, 2024
@BLuedtke
Copy link
Collaborator Author

BLuedtke commented Nov 27, 2024

The physical test "drive with two trains at the same time" for swtbahn-full seems relatively fragile regarding occupancy detection/misses, especially if the train's wheels are somewhat dirty. As far as I can tell, these issues are not related to the flush-in-batch changes. -> Improve physical tests.

@BLuedtke
Copy link
Collaborator Author

Issue seems to go deeper - from time to time, the whole application locks up. I don't know if its because of lock contention or something else. If any, the trackstate rwlock seems to be the problem, which is not surprising as it has to be locked extremely often. However, I can't rule out external influences of e.g. the raspberry pi. -> not sure what is actually blocking.

@BLuedtke
Copy link
Collaborator Author

Trying to analyze it with gdb -> problem doesn't occur when running via gdb. WTF.

@BLuedtke
Copy link
Collaborator Author

I rewrote the guard mechanism for the trackstate members - now there is one mutex per trackstate member, except for points and signals which are grouped together. Until now, we had one lock for the whole trackstate collection, which is very corse. The change from a read-write lock to a mutex was to keep it a bit simpler - with this more fine grained access control, I don't think we need the parallel read capability anymore for the trackstate.

@BLuedtke
Copy link
Collaborator Author

BLuedtke commented Dec 3, 2024

I implemented a good part of the trivial requested changes and marked the corresponding conversations as resolved.

@BLuedtke
Copy link
Collaborator Author

BLuedtke commented Dec 3, 2024

One more change I would like to perform:
There are 2 read-write locks, one for bidib_boards, one for bidib_trains. These collections they guard contain, as far as I can see, mostly static information parsed from the config files. However, they are defined in bidib_state.c, which is why the corresponding locks were called bidib_state_boards_rwlock and bidib_state_trains_rwlock, respectively.
The actual runtime state of entities that changes regularly is contained in bidib_track_state, also defined in bidib_state.c.

When I return to the codebase after a longer break, I'm sometimes confused by the naming - bidib_state_trains_rwlock sounds like it should guard the STATE of the trains, i.e., bidib_track_state.trains, but it guards bidib_trains. Therefore, I'd like to rename bidib_state_boards_rwlock to bidib_boards_rwlock and bidib_state_trains_rwlock to bidib_trains_rwlock. Might as well do it now, with all these changes to the locks/mutexes already done.
Will do that probably after all the previously requested changes have been adressed, to keep the timeline simpler.

@BLuedtke
Copy link
Collaborator Author

BLuedtke commented Dec 3, 2024

Tested the memory-related changes with Valgrind, fixed leaks accordingly.

@eyip002
Copy link
Member

eyip002 commented Dec 3, 2024

@BLuedtke What SWTbahn platform did you test the code on? On the SWTbahn Lite, I remember the Linux GUI freezing every so often for a few seconds, but couldn't figure out why. This was happening even when I was only navigation through folders.

@BLuedtke
Copy link
Collaborator Author

BLuedtke commented Dec 4, 2024

@BLuedtke What SWTbahn platform did you test the code on? On the SWTbahn Lite, I remember the Linux GUI freezing every so often for a few seconds, but couldn't figure out why. This was happening even when I was only navigation through folders.

So far, I've always been testing on the swtbahn-full.
I can't recall full UI freezes on the swtbahn-full, but I also suspect that the freeze/lockup we are seeing is related to the raspberry pi in one way or another. Perhaps it just overheats and throttles, perhaps it's unfortunate scheduling, maybe its when the syslog gets synchronized/written to the SD card -> syslog blocks, and so on.
We already have raspberry pi 5's here, but we haven't had the time to switch the Pi's. Also requires some setup adjustments, as for the pi 5 I'm considering switching to the raspberry pi ubuntu image instead of Raspbian. I should task Jochen with setting up the pi 5 to also work with the screen for showing the IP address, and for the VNC server setup etc..
It would still be good to know what exactly is happening, but maybe that would cost too much time. I guess we move forward with the rest of the changes and see what happens once we switch to a pi 5.

@BLuedtke
Copy link
Collaborator Author

BLuedtke commented Dec 4, 2024

One more thing I could try is to switch to the realtime kernel and see if the freezes still occur.

@BLuedtke
Copy link
Collaborator Author

BLuedtke commented Dec 4, 2024

the heartbeat thread/log causes the tests to take a bit longer, as when joining it might be in a sleep(2). Will see how that could be fixed.

@BLuedtke
Copy link
Collaborator Author

BLuedtke commented Dec 4, 2024

Heartbeat problem fixed by dividing the long sleep into several 0.1s ones and checking if bidib is still running.

The read-write locks I wrote about earlier have been renamed.

I think that's all for this pull request.

@eyip002
Copy link
Member

eyip002 commented Dec 4, 2024

Changes look good

@eyip002 eyip002 merged commit 1959bd1 into master Dec 4, 2024
1 check passed
@eyip002 eyip002 deleted the 39-flush-in-batch branch December 4, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flush in batch instead of byte-by-byte
2 participants