Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup hang caused by the deployment of New Relic #2809

Open
indy-singh opened this issue Oct 4, 2024 · 24 comments
Open

Startup hang caused by the deployment of New Relic #2809

indy-singh opened this issue Oct 4, 2024 · 24 comments
Labels
bug Something isn't working community To tag external issues and PRs

Comments

@indy-singh
Copy link

indy-singh commented Oct 4, 2024

Description

At a very high level; once we deploy New Relic (both infra agent and dotnet agent) we experience start-up hangs in our dotnet6/dotnet8 apps.

We ran into this problem last year: dotnet/aspnetcore#50317 but we didn't realise New Relic was the root cause until we tried to roll out it again this week (because I saw this fix) and the problem re-appeared.

Specifically the error that is logged at the highest level is ASP.NET Core app failed to start after 120000 milliseconds

Expected Behavior

New Relic should not cause the app to hang on start-up.

Troubleshooting or NR Diag results

It's very difficult to establish a troubleshooting workflow. It's a very intermittent and sporadic hang. It almost feels like the app waits to connect to the NR backend and it then times out after 2 mins exactly.

Steps to Reproduce

Unable to reproduce at all, only happens in production very intermittently/sporadically.

Your Environment

Windows Server 2022 (was 2019) - experienced it on both OS versions.
dotnet8 (was dotnet6) - experienced it on both dotnet versions.
On-prem
Hosted via IIS
NewRelicDotNetAgent_10.30.0_x64.msi (deployed using DSC: ADDLOCAL=NETCoreSupport)
newrelic-infra-amd64.1.57.1.msi (deployed using DSC)

Additional context

I know it's very long and information overload, but I did a ton of work in this ticket: dotnet/aspnetcore#50317

Our newrelic-infra.yml looks like this:-

# required
license_key: REDACTED
# to save cpu; we don't send process metrics, so we don't need the local agent to collect them
metrics_process_sample_rate: -1
# to save cpu; we don't use any of the plugins, so turn them off locally
disable_all_plugins: true
# to save cpu; we don't have removeable drives, so turn off detection locally
win_removable_drives: false
# to save cpu; we don't care to monitor the WU state of a host
windows_updates_refresh_sec: -1
# to save cpu; we don't care to monitor the services.msc of a host
windows_services_refresh_sec: -1

Our newrelic.config looks like this:-

<?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc.  All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="true">
	<service licenseKey="REDACTED"/>
	<application/>
	<log level="off"/>
	<allowAllHeaders enabled="true"/>
	<attributes enabled="true">
		<exclude>request.headers.cookie</exclude>
		<exclude>request.headers.authorization</exclude>
		<exclude>request.headers.proxy-authorization</exclude>
		<exclude>request.headers.x-*</exclude>
		<include>request.headers.*</include>
	</attributes>
	<transactionTracer enabled="false"/>
	<distributedTracing enabled="false"/>
	<errorCollector enabled="false"/>
	<browserMonitoring autoInstrument="false"/>
	<threadProfiling>
		<ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
		<ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
	</threadProfiling>
	<applicationLogging enabled="false"/>
	<utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
	<slowSql enabled="false"/>
	<distributedTracing enabled="false"/>
	<codeLevelMetrics enabled="false"/>
</configuration>

Our appsettings.json looks like this:-

{
  "Logging": {
    "LogLevel": {
      "Default": "Information",
      "Log4Net": "Debug",
      "Microsoft": "Warning",
      "Microsoft.Hosting.Lifetime": "Information"
    }
  },
  "AllowedHosts": "*",
  "NewRelic.AppName": "REDACTED",
  "NewRelic.AgentEnabled": "true"
}
@indy-singh indy-singh added the bug Something isn't working label Oct 4, 2024
@workato-integration
Copy link

@github-actions github-actions bot added the community To tag external issues and PRs label Oct 4, 2024
@indy-singh
Copy link
Author

Places I can find that match to 120 seconds:-

public const float MinimumSamplingDurationSeconds = 120; // 2 minutes

private static readonly TimeSpan _timeBetweenExecutions = TimeSpan.FromMinutes(2);

I don't know the architecture of the agent (yet), but I guess the most likely candidate is which ever code path is invoked during start up.

My suspicion initially falls on

if (_configuration.CollectorSyncStartup || _configuration.CollectorSendDataOnExit)

As we keep the default settings and I think we would fall into the sync path (and therefore blocking).

@indy-singh
Copy link
Author

Hmmm, https://docs.newrelic.com/docs/apm/agents/net-agent/configuration/net-agent-configuration/#service-syncStartup says it defaults to false.

So does that mean it is https://docs.newrelic.com/docs/apm/agents/net-agent/configuration/net-agent-configuration/#service-requestTimeout which implies that it is 120 seconds as we keep sendDataOnExit to the default value (false).

@nrcventura
Copy link
Member

Because you are not enabling send data on exit, or sync startup, the agent handles its connection logic on a separate thread pool thread via a timer that should be triggered without a delay.

_scheduler.ExecuteOnce(Connect, TimeSpan.Zero);

I don't think any of the other duration settings that you found in the code are related to this problem, because those things with the exception of the connect step (which is happening on a different thread), all require the agent to complete the connect process first.

In order to understand what is going on we will most likely need a memory dump from a time when the application fails to start, and we would also likely need DEBUG or FINEST logs. Because these files can be potentially sensitive, the best way to get support on this issue to work with our New Relic support team because they have the ability to receive these types of files without having them shared publicly on github. If you are unable to send these files to New Relic support, you can attempt to analyze the memory dump, but we'll need to know which threads/tasks are blocked by which other threads/tasks.

Most of the startup hang/deadlock issues that I've seen are related to applications using a Configuration Manager that needs to reach out to an external system. Usually these calls to the external system are something that the agent also instruments. We have a default list of known instrumented methods that have caused this problem.

private static HashSet<string> _deferInitializationOnTheseMethods = new HashSet<string>
{
"System.Net.Http.HttpClient.SendAsync",
"System.Net.Http.SocketsHttpHandler.SendAsync",
"System.Net.HttpWebRequest.SerializeHeaders",
"System.Net.HttpWebRequest.GetResponse"
};

We also allow extending this list using an environment variable.
var methodsFromEnvVar = System.Environment.GetEnvironmentVariable("NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST");

The difficult part will be determining what to add to that list. In the linked aspnetcore issue you provided a sample startup.cs.txt file, and in that file it looks like you might be using a postgresql database with the configuration manager. If that's the case you may need to add the classes and methods defined in the sql instrumentation xml associated to the Postgres comments. Here is an example of one of those sections.
<!-- Postgres SQL Driver-->
<match assemblyName="Npgsql" className="Npgsql.NpgsqlCommand">
<exactMethodMatcher methodName="ExecuteDbDataReader" parameters="System.Data.CommandBehavior"/>
<exactMethodMatcher methodName="ExecuteNonQuery" />
<exactMethodMatcher methodName="ExecuteScalar" />
</match>
<match assemblyName="Npgsql" className="Npgsql.NpgsqlCommand" minVersion="4.1">
<exactMethodMatcher methodName="ExecuteReader" parameters="System.Data.CommandBehavior" />
</match>
</tracerFactory>

@indy-singh
Copy link
Author

Hi @nrcventura ,

Appreciate the detail reply.

In order to understand what is going on we will most likely need a memory dump from a time when the application fails to start, and we would also likely need DEBUG or FINEST logs.

I would love to get a memory dump. Unfortunately, this is pretty hard because we only know it's hung after the deploy has failed and the process has been automatically killed off by IIS because it is deemed "unhealthy". I can change pretty much anything, but it's not something I can do quickly as it would effect the entire CI/CD pipeline.

I did however push out the change to the logging late on Friday, and this morning it has happened twice already.

<log level="all" maxLogFileSizeMB="100"/>

Last log:-

[2024-10-07T09:38:00.405Z, PID: 31444] [aspnetcorev2.dll] ASPNET_CORE_GLOBAL_MODULE::Terminate

Timeout log:-

[2024-10-07T09:37:59.247Z, PID: 31444] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489 

First log:-

[2024-10-07T09:35:59.161Z, PID: 31444] [aspnetcorev2.dll] Initializing logs for 'C:\Program Files\IIS\Asp.Net Core Module\V2\aspnetcorev2.dll'. Process Id: 31444. File Version: 18.0.24141.6. Description: IIS ASP.NET Core Module V2. Commit: .

Most of the startup hang/deadlock issues that I've seen are related to applications using a Configuration Manager that needs to reach out to an external system.

Thanks, that gives me a thread to pull on. We do store out config keys etc in PG and all apps reach out to PG at startup time to get various details, settings etc.

I'll try adding each method one by one to NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST and see if the failures stop (this may take some time given the intermittent nature of the problem).

I've attached the relevant logs from the most recent failure.

newrelic_agent_Allium.WebApp_099_31444.zip

Cheers,
Indy

@indy-singh
Copy link
Author

I've pushed out

[System.Environment]::SetEnvironmentVariable('NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST', 'Npgsql.NpgsqlCommand.ExecuteReader', [System.EnvironmentVariableTarget]::Machine)

To all of our DEMO hosts.

I guess if we don't get another error for a week we can be somewhat-confident it might be that.

Hard to prove.

@nrcventura
Copy link
Member

Thanks for the update. Yes, these types of problems are very difficult to reproduce. You might be able to reproduce the problem more reliably if you enabled syncStartup which is one of the settings that you previously found.

I suspect that you may also need the NpgSql.Connection.Open method.

@indy-singh
Copy link
Author

indy-singh commented Oct 7, 2024

Just happened again, updated environment variable to:-

[System.Environment]::SetEnvironmentVariable('NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST', 'Npgsql.NpgsqlCommand.ExecuteReader;Npgsql.NpgsqlConnection.Open', [System.EnvironmentVariableTarget]::Machine)

@indy-singh
Copy link
Author

Happened twice this morning, updated environment variable to:-

$methods = @(
    "Npgsql.NpgsqlConnection.Open"
    "Npgsql.NpgsqlCommand.ExecuteReader"
    "Npgsql.NpgsqlDataReader.Read"
    "Npgsql.NpgsqlConnection.OpenAsync"
    "Npgsql.NpgsqlCommand.ExecuteReaderAsync"
    "Npgsql.NpgsqlDataReader.ReadAsync"    
);

$concat = $methods -join ';';

[System.Environment]::SetEnvironmentVariable("NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST", $concat, [System.EnvironmentVariableTarget]::Machine);

I'm not convinced it (the hang on start-up) is fetching the configuration from PG, so I'm including all methods (that can be used) on that path. This should help us either rule it out or in.

Back to the waiting game!

Cheers,
Indy

@indy-singh
Copy link
Author

Tiny update from me;

  • I tried syncStartup for local-dev reproduction, got absolutely no where.
  • I tried to block the domains below using Windows HOSTS file (collected using ipconfig /displaydns) but the dot-net agent was fine, and simply logged retries with some sort of back off strategy so that rules out the NR backend <-> dot-net agent communication :-
0.0.0.0 collector-eu01.cell.eu.nr-data.net
0.0.0.0 collector.eu01.nr-data.net
0.0.0.0 fastly-tls12-collector.eu01.nr-data.net
0.0.0.0 fastly-tls12-infra-api.eu.newrelic.com
0.0.0.0 infra-api.cell.eu.nr-data.net
0.0.0.0 infra-api.eu.newrelic.com
0.0.0.0 infrastructure-command-api.eu.newrelic.com
0.0.0.0 newrelic.com
0.0.0.0 tls12.eu.newrelic.com.cdn.cloudflare.net
  • I did notice some HTTP 409 in the dot-net agent log, not sure if related/expected? It doesn't line up with any start-up hangs, but seeing a 409 in the wild is rare enough for me to note it.
  • I did notice some 5xx in the infra agent log, but I guess that is to be expected to some extent.
  • I peppered the CI/CD deploy logs with Get-Date -Format "o" and hope to get an exact timeline going. I've got a theory that the presence of the NR agent(s) is enough to cause the start-up time to balloon and it only tips over 120 seconds sometimes.
  • I'm basically ignoring the infra agent for now, because the only apps that are effected are the dotnet8 apps (we have some legacy 472 apps) but we only installed dotnet-new support.

Cheers,
Indy

@nrcventura
Copy link
Member

If you are not seeing the problem when syncStartup is enabled, then I'm guessing that you are not running into the deadlock problem that we've seen in the past. I opened an issue to add a feature to just disable the usage of the ConfigurationManager altogether so that we can more easily rule that problem in or out before dedicating more time to identify what to add to that delayed initialization list.

I have seen other weird cases in certain environments where enabling agent logs higher than debug or info level have slowed down application startup enough (in those environments) to trigger the default aspnetore app default startup timeout. I don't think that this is likely to be the problem that you are experiencing, but https://docs.newrelic.com/docs/apm/agents/net-agent/configuration/net-agent-configuration/#log-enabled explains how to disable our logging system (this does not affect capturing and forwarding agent log data).

Another situation where we have seen deadlocks occur is within the agent's usage of the .net Event Pipe which we use to collect garbage collection metrics for the application. The deadlock that we've seen occurs (only occasionally) when we try to unsubscribe our EventListener from the Event Pipe EventSource. The collection of this data can be disabled, and this has resolved some problems similar to this (usually noticed when the application is trying to end). To accomplish that you can add <add key="NewRelic.EventListenerSamplersEnabled" value="false" /> to the appsettings element in the new relic config file (we don't have an environment variable implemented for this). This is the thing that I would try next.

@indy-singh
Copy link
Author

So, I don't want to jinx anything, but since I added:-

$methods = @(
    "Npgsql.NpgsqlConnection.Open"
    "Npgsql.NpgsqlCommand.ExecuteReader"
    "Npgsql.NpgsqlDataReader.Read"
    "Npgsql.NpgsqlConnection.OpenAsync"
    "Npgsql.NpgsqlCommand.ExecuteReaderAsync"
    "Npgsql.NpgsqlDataReader.ReadAsync"    
);

$concat = $methods -join ';';

[System.Environment]::SetEnvironmentVariable("NEW_RELIC_DELAY_AGENT_INIT_METHOD_LIST", $concat, [System.EnvironmentVariableTarget]::Machine);

We've not had a start-up hang. I'm not saying that I think this is solved, but as time goes on my confidence does increase 🤞🏽

Cheers,
Indy

@indy-singh
Copy link
Author

indy-singh commented Oct 15, 2024

Small update, not seen the exact same start-up hang. So I'm getting closer to saying it looks like this particular issue (getting external configuration from pg) is resolved. Right now NR is only deployed to our DEMO environment - the real test will be getting it to our LIVE environment (the hang was more frequent when we first attempted NR deployment last year).

If we get to the end of the week without another start-up hang, I'll spend next week getting onto a LIVE environment and see what happens.

Tangentially; we've had two instances of the same application hanging after the app had booted and warming up other external services (in this case BigQuery/GetTable). I don't think it is NR, but I don't really believe in coincidences so I thought I'd mention it here, just in case I need to keep the issue open a bit longer.

Cheers,
Indy

@nrcventura
Copy link
Member

Thank you for the updates. I'm not aware of any problems with someone using BigQuery, but we also do not currently support out of the box instrumentation for BigQuery, so I do not know what could be the problem there.

@indy-singh
Copy link
Author

Argh!

The start-up hang has happened again.

From event viewer:-

[2024-10-16T06:11:56.536Z, PID: 8672] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489 

I have a NewRelic.Profiler.8672.log but I don't have the rotation-based for [pid: 8672]. The last 10 lines in the rotation based log file (newrelic_agent_VehicleData.WebApp_029.log) are:-

2024-10-16 06:09:13,931 NewRelic  DEBUG: [pid: 19716, tid: 111] Request(cce3a86f-56b1-4ae7-8a55-6043b9477b3d): Invocation of "analytic_event_data" yielded response : {}
2024-10-16 06:09:13,931 NewRelic FINEST: [pid: 19716, tid: 111] Transaction Event harvest finished.
2024-10-16 06:09:15,720 NewRelic FINEST: [pid: 19716, tid: 96] Error Event harvest starting.
2024-10-16 06:09:15,720 NewRelic FINEST: [pid: 19716, tid: 95] Custom Event harvest starting.
2024-10-16 06:09:15,893 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: Attempting to execute NewRelic.Providers.Wrapper.Sql.OpenConnectionAsyncWrapper found from InstrumentedMethodInfo: Method: Npgsql.NpgsqlConnection, Npgsql, Version=8.0.5.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7:OpenAsync(System.Threading.CancellationToken), RequestedWrapperName: OpenConnectionTracerAsync, IsAsync: False, RequestedMetricName: , RequestedTransactionNamePriority: , StartWebTransaction: False
2024-10-16 06:09:15,893 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: No transaction, skipping method Npgsql.NpgsqlConnection.OpenAsync(System.Threading.CancellationToken)
2024-10-16 06:09:15,893 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: Attempting to execute NewRelic.Providers.Wrapper.Sql.OpenConnectionWrapper found from InstrumentedMethodInfo: Method: Npgsql.NpgsqlConnection, Npgsql, Version=8.0.5.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7:Open(System.Boolean,System.Threading.CancellationToken), RequestedWrapperName: OpenConnectionTracer, IsAsync: False, RequestedMetricName: , RequestedTransactionNamePriority: , StartWebTransaction: False
2024-10-16 06:09:15,893 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: No transaction, skipping method Npgsql.NpgsqlConnection.Open(System.Boolean,System.Threading.CancellationToken)
2024-10-16 06:09:15,918 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: Attempting to execute NewRelic.Providers.Wrapper.Sql.SqlCommandWrapper found from InstrumentedMethodInfo: Method: Npgsql.NpgsqlCommand, Npgsql, Version=8.0.5.0, Culture=neutral, PublicKeyToken=5d8b90d52f46fda7:ExecuteReader(System.Data.CommandBehavior), RequestedWrapperName: SqlCommandTracer, IsAsync: False, RequestedMetricName: , RequestedTransactionNamePriority: , StartWebTransaction: False
2024-10-16 06:09:15,918 NewRelic FINEST: [pid: 19716, tid: 92] Trx Noop: No transaction, skipping method Npgsql.NpgsqlCommand.ExecuteReader(System.Data.CommandBehavior)

NewRelic.Profiler.8672.zip

Now looking into the other avenues of investigation.

Cheers,
Indy

@jaffinito
Copy link
Member

The profiler log is only going to have data for the profiler, but we also will want to see the managed agent logs. Managed agent logs can contain information that is potentially sensitive, especially at higher log levels. The best way for us to get these files would to be work with our New Relic support team since they have the ability to receive these types of files without having them shared publicly on github.

@indy-singh
Copy link
Author

indy-singh commented Oct 16, 2024

Hi @jaffinito,

Can you clarify what you mean by managed agent logs? I believe I uploaded those over here.

I do review the logs before I upload them. I can open a ticket through out NR account so that we can share logs if something sensitive is in them.

Cheers,
Indy

@indy-singh
Copy link
Author

indy-singh commented Oct 17, 2024

Happened again tonight at

[2024-10-17T18:13:11.391Z, PID: 7804] [aspnetcorev2_inprocess.dll] InvalidOperationException 'Managed server didn't initialize after 120000 ms.' caught at D:\a\_work\1\s\src\Servers\IIS\AspNetCoreModuleV2\InProcessRequestHandler\inprocessapplication.cpp:489 

There is only a profiler log, there aren't any agent logs that pertain to 7804.

Edit* I've not had the opportunity to pursue the other avenues of investigation.

Cheers,
Indy

@indy-singh
Copy link
Author

To accomplish that you can add <add key="NewRelic.EventListenerSamplersEnabled" value="false" /> to the appsettings element in the new relic config file (we don't have an environment variable implemented for this). This is the thing that I would try next.

This is now done. I've modified "C:\ProgramData\New Relic\.NET Agent\newrelic.config" it now looks like this:-

<?xml version="1.0"?>
<!-- Copyright (c) 2008-2020 New Relic, Inc.  All rights reserved. -->
<!-- For more information see: https://docs.newrelic.com/docs/agents/net-agent/configuration/net-agent-configuration/ -->
<configuration xmlns="urn:newrelic-config" agentEnabled="true">
	<appSettings>
		<add key="NewRelic.EventListenerSamplersEnabled" value="false"/>
	</appSettings>
	<service licenseKey="REDACTED"/>
	<application/>
	<log level="all" maxLogFileSizeMB="100"/>
	<allowAllHeaders enabled="true"/>
	<attributes enabled="true">
		<exclude>request.headers.cookie</exclude>
		<exclude>request.headers.authorization</exclude>
		<exclude>request.headers.proxy-authorization</exclude>
		<exclude>request.headers.x-*</exclude>
		<include>request.headers.*</include>
	</attributes>
	<transactionTracer enabled="false"/>
	<distributedTracing enabled="false"/>
	<errorCollector enabled="false"/>
	<browserMonitoring autoInstrument="false"/>
	<threadProfiling>
		<ignoreMethod>System.Threading.WaitHandle:InternalWaitOne</ignoreMethod>
		<ignoreMethod>System.Threading.WaitHandle:WaitAny</ignoreMethod>
	</threadProfiling>
	<applicationLogging enabled="false"/>
	<utilization detectAws="false" detectAzure="false" detectGcp="false" detectPcf="false" detectDocker="false" detectKubernetes="false"/>
	<slowSql enabled="false"/>
	<distributedTracing enabled="false"/>
	<codeLevelMetrics enabled="false"/>
</configuration>

I confirmed that changing the config did something by logs:-

2024-10-17 22:46:22,361 NewRelic   INFO: [pid: 24308, tid: 48] Reconnecting due to configuration change
2024-10-17 22:46:22,363 NewRelic FINEST: [pid: 24308, tid: 48] Sampler NewRelic.Agent.Core.Samplers.GCSamplerNetCore has been requested to stop.
2024-10-17 22:46:22,363 NewRelic FINEST: [pid: 24308, tid: 48] Sampler NewRelic.Agent.Core.Samplers.CpuSampler has been requested to stop.
2024-10-17 22:46:22,363 NewRelic FINEST: [pid: 24308, tid: 48] Sampler NewRelic.Agent.Core.Samplers.MemorySampler has been requested to stop.
2024-10-17 22:46:22,363 NewRelic FINEST: [pid: 24308, tid: 48] Sampler NewRelic.Agent.Core.Samplers.ThreadStatsSampler has been requested to stop.

On the topic of environment variables does NEW_RELIC_DISABLE_SAMPLERS not do the same thing?

Found via grokking the sdk src:-

public bool DisableSamplers { get { return EnvironmentOverrides(_localConfiguration.application.disableSamplers, "NEW_RELIC_DISABLE_SAMPLERS"); } }

Cheers,
Indy

@nrcventura
Copy link
Member

The DisableSamplers setting disables all of our samplers including the cpu and memory samplers. The one I recommended disabling only disables a couple of samplers which are the EventListener samplers for Garbage Collection metrics and threading metrics.

@indy-singh
Copy link
Author

Happened again - I should explain that I've hijacked the CI/CD pipeline on weekday evenings (00 07,19-23 * * TUE-FRI) to basically deploy our entire application tier (we call this a 'Server Recycle') so I can have a constant amount of "deploys" happening.

What's interesting is that a Server Recycle is basically iisreset + hits the urls to warmup the apps. The binaries don't change.

Before I added all the PG methods we used to get the same startup hang on actual deploys too (e.g. the binaries have changed and we don't use iisreset, we just kill the respective pid of that w3wp.exe).

I'm beginning to think that we are playing whack-a-mole on a combination of our platform code + NR causes a start-up hang some of the time.

I think our current state of affairs is that when we do an iisreset and then warmup the apps we run the risk of the start-up hang.

To prove that I'll take out the iisreset in our Server Recycle and see what happens.

Cheers,
Indy

@indy-singh
Copy link
Author

I finally replicated it on LOCAL. THANK GOD.

We have a small console app that is invoked by our OMD/nagios setup every 1-5 seconds (I think), it looks like this (I've append .txt to make it upload-able).

program.cs.txt

I leave that running in the background using powershell:-

while($true) {Start-Sleep -Seconds 1; .\CWDotNetDiagClient.exe;}

And then I simulate what a deploy/server recycle does with this (this is copied directly from the scripts, a lot of it is many years old):-

$WebApplicationCreatorConfigurationDirectory = "D:\code\CodeweaversWorld\src\WebApplicationCreator\LocalConfiguration";
$WebApplicationCreatorExecutable = "D:\code\CodeweaversWorld\src\WebApplicationCreator\app\IISApplicationExecutor\bin\IISApplicationExecutor.exe";
$Timeout = 480;
$AbsoluteUrl = "http://datamining.localiis/services/application.svc/warmup";

while ($true) {    
    Get-WmiObject -Namespace 'root\WebAdministration' -class 'WorkerProcess' -Filter "NOT AppPoolName LIKE '%servicemonitor%'" `
    | Select-Object -Property AppPoolName, ProcessId `
    | ForEach-Object { 
        Write-Host "Killing '$($_.AppPoolName)' with pid '$($_.ProcessId)'";
        Stop-Process -Id $_.ProcessId -Force -ErrorAction SilentlyContinue;
    };
    & "C:\Windows\System32\iisreset.exe" /stop ;
    & "$WebApplicationCreatorExecutable" "ZZZ_PLACEHOLDER" "RESET" "$WebApplicationCreatorConfigurationDirectory" "NOPROMPT";
    & "C:\Windows\System32\iisreset.exe" /start ;
    Start-Sleep -Seconds 2;
    & "$WebApplicationCreatorExecutable" "ZZZ_PLACEHOLDER" "ALL" "$WebApplicationCreatorConfigurationDirectory" "NOPROMPT";    
    Invoke-WebRequest -Uri $AbsoluteUrl -TimeoutSec $Timeout -UseBasicParsing -DisableKeepAlive -MaximumRedirection 0;
}

The clue re-reading the entire issue thread was:-

The deadlock that we've seen occurs (only occasionally) when we try to unsubscribe our EventListener from the Event Pipe EventSource.

Many thanks @nrcventura for that nugget, it took a while to percolate.

I really should get to bed, maybe tonight I won't dream about this issue.

Cheerio!
Indy

@indy-singh
Copy link
Author

Manage to replicate it a 2nd & 3d time (took 185 tries) on LOCAL, this time I was quick enough to grab a memory dump.

I'll share them on Monday via the NR support ticket route.

Meanwhile I had a quick poke around using dotnet dump:-

> syncblk
Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner
    2 000001E2B298B030            5         1 000001E2B3550430 413c  27   000001a235c153b0 System.Collections.Generic.List`1[[System.WeakReference`1[[System.Diagnostics.Tracing.EventSource, System.Private.CoreLib]], System.Private.CoreLib]]

> setthread 27
> clrstack
OS Thread Id: 0x413c (27)
        Child SP               IP Call Site
00000014A3AFF438 00007ffec3202f44 [HelperMethodFrame: 00000014a3aff438]
00000014A3AFF540 00007FFE141CC832 System.Threading.ThreadPool.get_CompletedWorkItemCount() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.Windows.cs @ 204]
00000014A3AFF570 00007FFE141CC79D System.Diagnostics.Tracing.RuntimeEventSource+<>c.<OnEventCommand>b__35_9() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/RuntimeEventSource.cs @ 110]
00000014A3AFF5A0 00007FFE141CC511 System.Diagnostics.Tracing.IncrementingPollingCounter.UpdateMetric() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/IncrementingPollingCounter.cs @ 57]
00000014A3AFF600 00007FFE141CC406 System.Diagnostics.Tracing.CounterGroup.ResetCounters() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 206]
00000014A3AFF670 00007FFE141CC0BD System.Diagnostics.Tracing.CounterGroup.EnableTimer(Single) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 158]
00000014A3AFF6D0 00007FFE141C6B18 System.Diagnostics.Tracing.CounterGroup.OnEventSourceCommand(System.Object, System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 73]
00000014A3AFF750 00007FFE1396CD5A System.Diagnostics.Tracing.EventSource.DoCommand(System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2738]
00000014A3AFF7B0 00007FFE1396C61E System.Diagnostics.Tracing.EventSource.SendCommand(System.Diagnostics.Tracing.EventListener, System.Diagnostics.Tracing.EventProviderType, Int32, System.Diagnostics.Tracing.EventCommand, Boolean, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords, System.Collections.Generic.IDictionary`2<System.String,System.String>) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2574]
00000014A3AFF830 00007FFE1396C45B System.Diagnostics.Tracing.EventSource+OverrideEventProvider.OnControllerCommand(System.Diagnostics.Tracing.ControllerCommand, System.Collections.Generic.IDictionary`2<System.String,System.String>, Int32) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2438]
00000014A3AFF890 00007FFE139694C9 System.Diagnostics.Tracing.EventPipeEventProvider.HandleEnableNotification(System.Diagnostics.Tracing.EventProvider, Byte*, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 58]
00000014A3AFF8E0 00007FFE13969077 System.Diagnostics.Tracing.EventProviderImpl.ProviderCallback(System.Diagnostics.Tracing.EventProvider, Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventProvider.cs @ 1310]
00000014A3AFF940 00007FFE13968E14 System.Diagnostics.Tracing.EventPipeEventProvider.Callback(Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*, Void*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 68]

> clrstack -all
OS Thread Id: 0x26c0
        Child SP               IP Call Site
00000014A3A7C738 00007ffec31fffa4 [HelperMethodFrame_1OBJ: 00000014a3a7c738] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00000014A3A7C880 00007FFE13902B8E System.Diagnostics.Tracing.EventSource.Initialize(System.Guid, System.String, System.String[]) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 1619]
00000014A3A7C930 00007FFE141C873C System.Data.DataCommonEventSource..ctor()
00000014A3A7C990 00007FFE141C862D System.Data.DataCommonEventSource..cctor() [/_/src/libraries/System.Data.Common/src/System/Data/Common/DataCommonEventSource.cs @ 12]
00000014A3A7D1F8 00007ffe7345c663 [HelperMethodFrame: 00000014a3a7d1f8]
00000014A3A7D300 00007FFE141C85DA System.Data.Common.DbConnectionStringBuilder.Clear() [/_/src/libraries/System.Data.Common/src/System/Data/Common/DbConnectionStringBuilder.cs @ 283]
00000014A3A7D340 00007FFE141BA91E Npgsql.NpgsqlConnectionStringBuilder.Init()
00000014A3A7D380 00007FFE141BA239 Npgsql.NpgsqlConnectionStringBuilder..ctor()
00000014A3A7D3B0 00007FFE141BA026 Npgsql.NpgsqlConnection..cctor()
00000014A3A7DC28 00007ffe7345c663 [HelperMethodFrame: 00000014a3a7dc28]
00000014A3A7DD30 00007FFE141B9FCC Npgsql.NpgsqlConnection..ctor()
00000014A3A7DD70 00007FFE13913900 System.RuntimeType.CreateInstanceDefaultCtor(Boolean, Boolean) [/_/src/coreclr/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs @ 3878]
00000014A3A7DDD0 00007FFE141B97DA log4net.Appender.AdoNetAppender.CreateConnection(System.Type, System.String)
00000014A3A7DE10 00007FFE141B8958 log4net.Appender.AdoNetAppender.InitializeDatabaseConnection()
00000014A3A7DE80 00007FFE13963502 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.ParseAppender(System.Xml.XmlElement)
00000014A3A7DF20 00007FFE13962CFF log4net.Repository.Hierarchy.XmlHierarchyConfigurator.FindAppenderByReference(System.Xml.XmlElement)
00000014A3A7DF90 00007FFE13963330 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.ParseAppender(System.Xml.XmlElement)
00000014A3A7E030 00007FFE13962CFF log4net.Repository.Hierarchy.XmlHierarchyConfigurator.FindAppenderByReference(System.Xml.XmlElement)
00000014A3A7E0A0 00007FFE139618E1 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.ParseChildrenOfLoggerElement(System.Xml.XmlElement, log4net.Repository.Hierarchy.Logger, Boolean)
00000014A3A7E160 00007FFE139606A5 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.ParseLogger(System.Xml.XmlElement)
00000014A3A7E1E0 00007FFE1395DC02 log4net.Repository.Hierarchy.XmlHierarchyConfigurator.Configure(System.Xml.XmlElement)
00000014A3A7E260 00007FFE1395D177 log4net.Repository.Hierarchy.Hierarchy.XmlRepositoryConfigure(System.Xml.XmlElement)
00000014A3A7E2D0 00007FFE13953004 log4net.Config.XmlConfigurator.InternalConfigure(log4net.Repository.ILoggerRepository, System.IO.Stream)
00000014A3A7E340 00007FFE13951B9F log4net.Config.XmlConfigurator.InternalConfigure(log4net.Repository.ILoggerRepository, System.IO.FileInfo)
00000014A3A7E3D0 00007FFE139515D3 log4net.Config.XmlConfigurator.InternalConfigureAndWatch(log4net.Repository.ILoggerRepository, System.IO.FileInfo)
00000014A3A7E440 00007FFE13911316 log4net.Config.XmlConfigurator.ConfigureAndWatch(System.IO.FileInfo)
00000014A3A7E4B0 00007FFE1390FEFD Codeweavers.Logging.Logger..cctor()
00000014A3A7ED68 00007ffe7345c663 [HelperMethodFrame: 00000014a3a7ed68]
00000014A3A7EE70 00007FFE1390FCA5 Codeweavers.Logging.Logger.GetLogger(System.String)
00000014A3A7EED0 00007FFE1390F451 DataMining.WebApp.Program.Main(System.String[])
OS Thread Id: 0x3180
        Child SP               IP Call Site
00000014A427F8F0 00007ffec31fffa4 [DebuggerU2MCatchHandlerFrame: 00000014a427f8f0]
OS Thread Id: 0x367c
        Child SP               IP Call Site
00000014A52CD8F8 00007ffec31fffa4 [HelperMethodFrame_1OBJ: 00000014a52cd8f8] System.Threading.Monitor.ReliableEnter(System.Object, Boolean ByRef)
00000014A52CDA40 00007FFE13902B8E System.Diagnostics.Tracing.EventSource.Initialize(System.Guid, System.String, System.String[]) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 1619]
00000014A52CDAF0 00007FFE1396E434 System.Diagnostics.Tracing.FrameworkEventSource..ctor() [/_/src/coreclr/System.Private.CoreLib/System.Private.CoreLib.Generators/Generators.EventSourceGenerator/FrameworkEventSource.g.cs @ 12]
00000014A52CDB40 00007FFE1396E34D System.Diagnostics.Tracing.FrameworkEventSource..cctor() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/FrameworkEventSource.cs @ 14]
00000014A52CE3A8 00007ffe7345c663 [HelperMethodFrame: 00000014a52ce3a8]
00000014A52CE4B0 00007FFE1396DFC5 System.Threading.ThreadPoolWorkQueue..ctor() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 442]
00000014A52CE570 00007FFE1396DB01 System.Threading.ThreadPool..cctor() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs @ 1430]
00000014A52CEDE8 00007ffe7345c663 [HelperMethodFrame: 00000014a52cede8]
00000014A52CEEF0 00007FFE1396D940 System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib]].AwaitUnsafeOnCompleted[[System.Runtime.CompilerServices.ConfiguredValueTaskAwaitable`1+ConfiguredValueTaskAwaiter[[System.Boolean, System.Private.CoreLib]], System.Private.CoreLib]](ConfiguredValueTaskAwaiter<Boolean> ByRef, System.Runtime.CompilerServices.IAsyncStateMachineBox)
00000014A52CEF50 00007FFE1396B338 Codeweavers.Logging.Async.ChannelAsyncForwardingAppender+<ForwardLoggingEventsFromBuffer>d__25.MoveNext()
00000014A52CF130 00007FFE13968EF7 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs @ 38]
00000014A52CF1A0 00007FFE139695A8 Codeweavers.Logging.Async.ChannelAsyncForwardingAppender.ForwardLoggingEventsFromBuffer()
00000014A52CF210 00007FFE13969199 Codeweavers.Logging.Async.ChannelAsyncForwardingAppender+<ForwardingThreadExecute>d__24.MoveNext()
00000014A52CF300 00007FFE13968EF7 System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncMethodBuilderCore.cs @ 38]
00000014A52CF370 00007FFE13968C38 Codeweavers.Logging.Async.ChannelAsyncForwardingAppender.ForwardingThreadExecute()
00000014A52CF3E0 00007FFE13968AED System.Threading.Tasks.Task`1[[System.__Canon, System.Private.CoreLib]].InnerInvoke() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Future.cs @ 501]
00000014A52CF420 00007FFE13968990 System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs @ 179]
00000014A52CF490 00007FFE13968655 System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread) [/_/src/libraries/System.Private.CoreLib/src/System/Threading/Tasks/Task.cs @ 2345]
00000014A52CF780 00007ffe7345c663 [DebuggerU2MCatchHandlerFrame: 00000014a52cf780]
OS Thread Id: 0x413c
        Child SP               IP Call Site
00000014A3AFF438 00007ffec3202f44 [HelperMethodFrame: 00000014a3aff438]
00000014A3AFF540 00007FFE141CC832 System.Threading.ThreadPool.get_CompletedWorkItemCount() [/_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPool.Windows.cs @ 204]
00000014A3AFF570 00007FFE141CC79D System.Diagnostics.Tracing.RuntimeEventSource+<>c.<OnEventCommand>b__35_9() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/RuntimeEventSource.cs @ 110]
00000014A3AFF5A0 00007FFE141CC511 System.Diagnostics.Tracing.IncrementingPollingCounter.UpdateMetric() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/IncrementingPollingCounter.cs @ 57]
00000014A3AFF600 00007FFE141CC406 System.Diagnostics.Tracing.CounterGroup.ResetCounters() [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 206]
00000014A3AFF670 00007FFE141CC0BD System.Diagnostics.Tracing.CounterGroup.EnableTimer(Single) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 158]
00000014A3AFF6D0 00007FFE141C6B18 System.Diagnostics.Tracing.CounterGroup.OnEventSourceCommand(System.Object, System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/CounterGroup.cs @ 73]
00000014A3AFF750 00007FFE1396CD5A System.Diagnostics.Tracing.EventSource.DoCommand(System.Diagnostics.Tracing.EventCommandEventArgs) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2738]
00000014A3AFF7B0 00007FFE1396C61E System.Diagnostics.Tracing.EventSource.SendCommand(System.Diagnostics.Tracing.EventListener, System.Diagnostics.Tracing.EventProviderType, Int32, System.Diagnostics.Tracing.EventCommand, Boolean, System.Diagnostics.Tracing.EventLevel, System.Diagnostics.Tracing.EventKeywords, System.Collections.Generic.IDictionary`2<System.String,System.String>) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2574]
00000014A3AFF830 00007FFE1396C45B System.Diagnostics.Tracing.EventSource+OverrideEventProvider.OnControllerCommand(System.Diagnostics.Tracing.ControllerCommand, System.Collections.Generic.IDictionary`2<System.String,System.String>, Int32) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventSource.cs @ 2438]
00000014A3AFF890 00007FFE139694C9 System.Diagnostics.Tracing.EventPipeEventProvider.HandleEnableNotification(System.Diagnostics.Tracing.EventProvider, Byte*, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 58]
00000014A3AFF8E0 00007FFE13969077 System.Diagnostics.Tracing.EventProviderImpl.ProviderCallback(System.Diagnostics.Tracing.EventProvider, Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventProvider.cs @ 1310]
00000014A3AFF940 00007FFE13968E14 System.Diagnostics.Tracing.EventPipeEventProvider.Callback(Byte*, Int32, Byte, Int64, Int64, EVENT_FILTER_DESCRIPTOR*, Void*) [/_/src/libraries/System.Private.CoreLib/src/System/Diagnostics/Tracing/EventPipeEventProvider.cs @ 68]

A quick Google around leads me to
dotnet/runtime#105682
dotnet/runtime#104322

So it looks like when we connect to the event pipe just as the app is initialized it can cause a hang.

What is weird is, is why the deployment of NR causes the start-up hang for us.

That console app we've been using has been in place since 2023-03 and we've never seen the hang without NR agent(s) being deployed.

Cheers,
Indy

@nrcventura
Copy link
Member

A lot of the code in the DiagnosticSource library tries to do as little work as possible if there are no subscribers to a particular source (DiagnosticSource, EventPipe, ActivitySource, Meter, etc.). The New Relic agent enables an EventLIstener by default to subscribe to the EventPipe datasource in order to capture Garbage Collection metrics (this was the primary way to capture that type of data for .net core < 3.0, which is when we started collecting that data). When .net standard 2.1 came out, and .net core 3.1 was released, more apis became available for collecting garbage collector data, but the agent still needed to support older applications. Similar .net 6 has a better way to collect this data, but those apis are not readily available unless the agent can stop targeting .net standard 2.0 and switch over to targeting .net 6 or .net 8 directly (which would be a breaking change requiring a major release). In the meantime, I've seen the eventpipe datasource (when subscribed to from within an application instead of out-of-process) have its stability increase and decrease with different .net versions. At this point, we do have plans to migrate away from the eventpipe for collecting garbage collection data, but it will result in a breaking change because the data we will be able to capture will be different. We also would like to align the data that we capture with OpenTelemetry which is what .net is increasing its alignment with in general.

Hopefully, the problem you are experiencing goes away when the event listener samplers are disabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community To tag external issues and PRs
Projects
None yet
Development

No branches or pull requests

3 participants