Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix a segfault when using sendDataOnExit with Linux on Docker. #2018

Merged
merged 4 commits into from
Nov 2, 2023

Conversation

jaffinito
Copy link
Member

@jaffinito jaffinito commented Oct 31, 2023

Description

The repro provided by the customer demonstrates one possible way the issue can occur. In the repro, sendDataOnExit is enabled and StartAgent API call is used to get the agent running. Immediately after starting the SetApplicationName API is called and then the app shuts down. This triggers a segfault in Docker running Linux (kubernetes as well). In the case of the repro, the agent gets connected and then the name change triggers the reconnect so that agent is not connected anymore. sendDataOnExit forces Connect to be synchronous as well.

I narrowed down the trouble to the clean shutdown code that attempts to send data and then disconnect - this is called by default. The aggregators then check for sendDataOnExit and if it is present they try and harvest and send data. By adding a boolean field to track the very basic connected state, the clean shutdown only occurs if we are connected.

I narrowed the issue down to the GCSamplerNetCore class, at least indirectly. When the agent is shutting down we make config changes as well as requesting various services to stop, include the types derived from AbstractSampler (a ConfigurationBasedService). The problem I was seeing is that GCSamplerNetCore was stopped, but then it was started agent moments later. The log files ends with "EnableEvents" begin called and a segfault occurs. It looks like the service is starting again due to the base class (AbstractSampler) still returning "true" for Enabled. Disabling the samplers via config did not help with this in my repro.

To address this, I updated the AbstractSampler.Enabled property to also check "Agent.IsAgentShuttingDown" which is set immediately after Shutdown is called. This way, even if the config changes have not made it to every sampler, it will still return disabled and not attempt to start agent. This worked and I was able to remove my previous "connected" changes.

  • Updates AbstractSampler.Enabled to also check Agent.IsAgentShuttingDown as part of its conditional.
  • Adds a finest level log message in AgentManager to note that a "clean" shut down is occurring (I liked the extra - but its optional).
  • Updates the Scheduler.ExecuteEvery "existing timer" message to specify the timer it is stopping.
  • Updates MultiFunctionApplicationHelpers.csproj to not run the NugetAudit. It was blocking my builds.

Author Checklist

  • Unit tests, Integration tests, and Unbounded tests completed
  • Performance testing completed with satisfactory results (if required)

Reviewer Checklist

  • Perform code review
  • Pull request was adequately tested (new/existing tests, performance tests)

@codecov-commenter
Copy link

codecov-commenter commented Oct 31, 2023

Codecov Report

Merging #2018 (c59f8cd) into main (4b75587) will decrease coverage by 0.01%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2018      +/-   ##
==========================================
- Coverage   80.23%   80.22%   -0.01%     
==========================================
  Files         403      403              
  Lines       24864    24864              
  Branches     2993     2994       +1     
==========================================
- Hits        19949    19948       -1     
  Misses       4134     4134              
- Partials      781      782       +1     
Files Coverage Δ
...nt/NewRelic/Agent/Core/Samplers/AbstractSampler.cs 96.15% <0.00%> (-3.85%) ⬇️
src/Agent/NewRelic/Agent/Core/Time/Scheduler.cs 75.34% <0.00%> (ø)

... and 1 file with indirect coverage changes

@jaffinito jaffinito marked this pull request as ready for review November 1, 2023 20:38
Copy link
Member

@nrcventura nrcventura left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a safe way to handle the shutdown. Good job narrowing the problem down to the one sampler. I'm guessing that the problem occurs when trying to subscribe to the EventPipe after the process has started to shutdown. If you can reproduce the problem without the New Relic agent, it may be worth opening a ticket with Microsoft to see if they can add more protections to the EventPipe.

Copy link
Member

@chynesNR chynesNR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I suspect there are other places where we can start something up during shutdown that we should keep an eye out for.

@jaffinito jaffinito merged commit 3ac75a0 into main Nov 2, 2023
73 checks passed
@jaffinito jaffinito deleted the fix/docker-segfault-fix branch November 2, 2023 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants