-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix a segfault when using sendDataOnExit with Linux on Docker. #2018
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2018 +/- ##
==========================================
- Coverage 80.23% 80.22% -0.01%
==========================================
Files 403 403
Lines 24864 24864
Branches 2993 2994 +1
==========================================
- Hits 19949 19948 -1
Misses 4134 4134
- Partials 781 782 +1
|
...edApplications/Common/MultiFunctionApplicationHelpers/MultiFunctionApplicationHelpers.csproj
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a safe way to handle the shutdown. Good job narrowing the problem down to the one sampler. I'm guessing that the problem occurs when trying to subscribe to the EventPipe after the process has started to shutdown. If you can reproduce the problem without the New Relic agent, it may be worth opening a ticket with Microsoft to see if they can add more protections to the EventPipe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I suspect there are other places where we can start something up during shutdown that we should keep an eye out for.
Description
The repro provided by the customer demonstrates one possible way the issue can occur. In the repro, sendDataOnExit is enabled and StartAgent API call is used to get the agent running. Immediately after starting the SetApplicationName API is called and then the app shuts down. This triggers a segfault in Docker running Linux (kubernetes as well). In the case of the repro, the agent gets connected and then the name change triggers the reconnect so that agent is not connected anymore. sendDataOnExit forces Connect to be synchronous as well.
I narrowed down the trouble to the clean shutdown code that attempts to send data and then disconnect - this is called by default. The aggregators then check for sendDataOnExit and if it is present they try and harvest and send data. By adding a boolean field to track the very basic connected state, the clean shutdown only occurs if we are connected.I narrowed the issue down to the GCSamplerNetCore class, at least indirectly. When the agent is shutting down we make config changes as well as requesting various services to stop, include the types derived from AbstractSampler (a ConfigurationBasedService). The problem I was seeing is that GCSamplerNetCore was stopped, but then it was started agent moments later. The log files ends with "EnableEvents" begin called and a segfault occurs. It looks like the service is starting again due to the base class (AbstractSampler) still returning "true" for Enabled. Disabling the samplers via config did not help with this in my repro.
To address this, I updated the AbstractSampler.Enabled property to also check "Agent.IsAgentShuttingDown" which is set immediately after Shutdown is called. This way, even if the config changes have not made it to every sampler, it will still return disabled and not attempt to start agent. This worked and I was able to remove my previous "connected" changes.
Author Checklist
Reviewer Checklist