fix: Fix a segfault when using sendDataOnExit with Linux on Docker. #2018

jaffinito · 2023-10-31T00:03:56Z

Description

The repro provided by the customer demonstrates one possible way the issue can occur. In the repro, sendDataOnExit is enabled and StartAgent API call is used to get the agent running. Immediately after starting the SetApplicationName API is called and then the app shuts down. This triggers a segfault in Docker running Linux (kubernetes as well). In the case of the repro, the agent gets connected and then the name change triggers the reconnect so that agent is not connected anymore. sendDataOnExit forces Connect to be synchronous as well.

I narrowed down the trouble to the clean shutdown code that attempts to send data and then disconnect - this is called by default. The aggregators then check for sendDataOnExit and if it is present they try and harvest and send data. By adding a boolean field to track the very basic connected state, the clean shutdown only occurs if we are connected.

I narrowed the issue down to the GCSamplerNetCore class, at least indirectly. When the agent is shutting down we make config changes as well as requesting various services to stop, include the types derived from AbstractSampler (a ConfigurationBasedService). The problem I was seeing is that GCSamplerNetCore was stopped, but then it was started agent moments later. The log files ends with "EnableEvents" begin called and a segfault occurs. It looks like the service is starting again due to the base class (AbstractSampler) still returning "true" for Enabled. Disabling the samplers via config did not help with this in my repro.

To address this, I updated the AbstractSampler.Enabled property to also check "Agent.IsAgentShuttingDown" which is set immediately after Shutdown is called. This way, even if the config changes have not made it to every sampler, it will still return disabled and not attempt to start agent. This worked and I was able to remove my previous "connected" changes.

Updates AbstractSampler.Enabled to also check Agent.IsAgentShuttingDown as part of its conditional.
Adds a finest level log message in AgentManager to note that a "clean" shut down is occurring (I liked the extra - but its optional).
Updates the Scheduler.ExecuteEvery "existing timer" message to specify the timer it is stopping.
Updates MultiFunctionApplicationHelpers.csproj to not run the NugetAudit. It was blocking my builds.

Author Checklist

Unit tests, Integration tests, and Unbounded tests completed
Performance testing completed with satisfactory results (if required)

Reviewer Checklist

Perform code review
Pull request was adequately tested (new/existing tests, performance tests)

codecov-commenter · 2023-10-31T00:19:16Z

Codecov Report

Merging #2018 (c59f8cd) into main (4b75587) will decrease coverage by 0.01%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##             main    #2018      +/-   ##
==========================================
- Coverage   80.23%   80.22%   -0.01%     
==========================================
  Files         403      403              
  Lines       24864    24864              
  Branches     2993     2994       +1     
==========================================
- Hits        19949    19948       -1     
  Misses       4134     4134              
- Partials      781      782       +1

Files	Coverage Δ
...nt/NewRelic/Agent/Core/Samplers/AbstractSampler.cs	`96.15% <0.00%> (-3.85%)`	⬇️
src/Agent/NewRelic/Agent/Core/Time/Scheduler.cs	`75.34% <0.00%> (ø)`

... and 1 file with indirect coverage changes

...edApplications/Common/MultiFunctionApplicationHelpers/MultiFunctionApplicationHelpers.csproj

nrcventura

This seems like a safe way to handle the shutdown. Good job narrowing the problem down to the one sampler. I'm guessing that the problem occurs when trying to subscribe to the EventPipe after the process has started to shutdown. If you can reproduce the problem without the New Relic agent, it may be worth opening a ticket with Microsoft to see if they can add more protections to the EventPipe.

chynesNR

Looks good! I suspect there are other places where we can start something up during shutdown that we should keep an eye out for.

fix: Fix a segfault when using sendDataOnExit with Linux on Docker.

e3fa584

jaffinito added 2 commits November 1, 2023 09:50

Fixed AbstractSampler.cs and GCSamplerNetCore.cs in particular.

eb4de87

Undo changes to ApiAppNameChange tests.

915c68f

jaffinito marked this pull request as ready for review November 1, 2023 20:38

jaffinito requested review from a team, nrcventura, nr-ahemsath, chynesNR, tippmar-nr and dotnet-agent-team-bot November 1, 2023 20:38

nrcventura reviewed Nov 1, 2023

View reviewed changes

...edApplications/Common/MultiFunctionApplicationHelpers/MultiFunctionApplicationHelpers.csproj Outdated Show resolved Hide resolved

nrcventura approved these changes Nov 1, 2023

View reviewed changes

chynesNR approved these changes Nov 1, 2023

View reviewed changes

Move disabling NugetAudit from CMF Helper to props for all tests.

c59f8cd

nrcventura approved these changes Nov 1, 2023

View reviewed changes

jaffinito merged commit 3ac75a0 into main Nov 2, 2023
73 checks passed

jaffinito deleted the fix/docker-segfault-fix branch November 2, 2023 15:26

dotnet-agent-team-bot mentioned this pull request Nov 2, 2023

chore(main): release 10.19.0 #1993

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix a segfault when using sendDataOnExit with Linux on Docker. #2018

fix: Fix a segfault when using sendDataOnExit with Linux on Docker. #2018

jaffinito commented Oct 31, 2023 •

edited

Loading

codecov-commenter commented Oct 31, 2023 •

edited

Loading

nrcventura left a comment

chynesNR left a comment

fix: Fix a segfault when using sendDataOnExit with Linux on Docker. #2018

fix: Fix a segfault when using sendDataOnExit with Linux on Docker. #2018

Conversation

jaffinito commented Oct 31, 2023 • edited Loading

Description

Author Checklist

Reviewer Checklist

codecov-commenter commented Oct 31, 2023 • edited Loading

Codecov Report

nrcventura left a comment

Choose a reason for hiding this comment

chynesNR left a comment

Choose a reason for hiding this comment

jaffinito commented Oct 31, 2023 •

edited

Loading

codecov-commenter commented Oct 31, 2023 •

edited

Loading