Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RestoreTask randomly fails after upgrading to latest version #13545

Closed
akoeplinger opened this issue Jun 13, 2024 · 27 comments · Fixed by NuGet/NuGet.Client#5905
Closed

RestoreTask randomly fails after upgrading to latest version #13545

akoeplinger opened this issue Jun 13, 2024 · 27 comments · Fixed by NuGet/NuGet.Client#5905
Assignees
Labels
Functionality:Restore Priority:1 High priority issues that must be resolved in the current sprint. Type:Bug
Milestone

Comments

@akoeplinger
Copy link

akoeplinger commented Jun 13, 2024

NuGet Product Used

dotnet.exe

Product Version

6.11.0-rc.90

Worked before?

NuGet in dotnet sdk 9.0-preview4

Impact

It's more difficult to complete my work

Repro Steps & Context

Since we updated to a dotnet 9.0-preview6 SDK we started seeing RestoreTask failing randomly in the VMR build: dotnet/sdk#41477 (comment)

After the fix in NuGet/NuGet.Client#5845 we bumped again (i.e. we're now using a very recent nuget.client) we now see these messages

D:\a_work\1\s.dotnet\sdk\9.0.100-preview.6.24311.23\NuGet.targets(170,5): The feed 'dotnet-public [https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-public/nuget/v3/index.json]' lists package 'System.Text.Json.7.0.3' but multiple attempts to download the nupkg have failed. The feed is either invalid or required packages were removed while the current operation was in progress. Verify the package exists on the feed and try again. [D:\a_work\1\s\src\nuget-client\build\restorehelper.targets]

or

D:\a_work\1\s.dotnet\sdk\9.0.100-preview.6.24311.23\NuGet.targets(170,5): error : The feed 'dotnet-public [https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet-public/nuget/v3/index.json]' lists package 'Castle.Core.5.1.1' but multiple attempts to download the nupkg have failed. The feed is either invalid or required packages were removed while the current operation was in progress. Verify the package exists on the feed and try again. [D:\a_work\1\s\src\efcore\EFCore.sln]

I've attached some binlogs: binlogs.zip

It's happening for random packages and given we only started seeing this when bumping the dotnet SDK (and thus nuget) I think it's more likely a nuget bug than an issue with the AzDO feed.

Is there some way we can enable more logging?

/cc @nkolev92 @kartheekp-ms

Verbose Logs

No response

@lewing
Copy link

lewing commented Jun 17, 2024

We're seeing this and #13551 repeatedly in CI is there something we can do to help get this resolved

cc @jebriede @aortiz-msft

@jeffkl
Copy link
Contributor

jeffkl commented Jun 17, 2024

I think what's going on is Azure DevOps Artifacts Services public upstreams is causing this. When a package is restored from a feed with upstream(s) for the first time, the account doing the restore must have Contributor permissions to populate the package. If the user is not a Contributor, the package will not be made available, and the server returns an HTTP 404.

https://learn.microsoft.com/en-us/azure/devops/artifacts/how-to/set-up-upstream-sources?view=azure-devops&tabs=nuget

image

The NuGet repo recently hit this problem and we had to get permissions fixed on the feed to get unblocked. I guess CI agents need to be using an account with the correct permissions in order to populate feeds? Can you confirm if this workflow is using upstream feeds?

@akoeplinger
Copy link
Author

The dotnet-public feed doesn't have upstreams.

@jeffkl
Copy link
Contributor

jeffkl commented Jun 17, 2024

@nkolev92 do you know what's going on? It looks like the same symptoms where the feed lists a particular version but then NuGet receives a 404 when trying to download the package.

@akoeplinger
Copy link
Author

Looking at the code this is the only place where PackageNotFoundProtocolException is thrown: https://github.com/NuGet/NuGet.Client/blob/20f05435be385abfe74737b6433dc80fd3b3b504/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L69-L88

Following ProcessNupkgStreamAsync feels like there might be a race condition hidden somewhere :)

@lewing
Copy link

lewing commented Jun 18, 2024

Also tracking in dotnet/dnceng#3100 and dotnet/runtime#103526

@nkolev92 nkolev92 added Functionality:Restore Priority:1 High priority issues that must be resolved in the current sprint. and removed Triage:Untriaged labels Jun 19, 2024
@nkolev92
Copy link
Member

@lewing, @akoeplinger

I'll reach out offline as well, but if possible, can anyone try using .NET 9, P4 and seeing if the problems persist.

@am11
Copy link

am11 commented Jun 19, 2024

#13551 is tracked as dotnet/runtime#103526. Retry always fixes the error so it's intermittent but consistently failing couple of builds a day.

@kartheekp-ms
Copy link
Contributor

kartheekp-ms commented Jun 20, 2024

I worked with @embetten from the Azure Artifacts team. I was told that the dotnet-public feed has responded with the following status codes for the last 7 days.

Response Code Count
303 63,421,011
200 52,874,790
404 8,246,998
409 233
401 17,498
202 472
500 33

Upon checking if there are any 404s for either 'System.Text.Json.7.0.3' or 'Castle.Core.5.1.1' packages, surprisingly no records were found in telemetry and IIS/AFD logs. Emily mentioned that one potential reason why the feed responded with a 404 might be that the AzDO service was still processing the package after its push while there was a request to download it, which would have failed with a 404. For example, the 'System.Text.Json.7.0.3' package was ingested into the feed on 06/13, and the issue was reported on the same day.

@nkolev92
Copy link
Member

@kartheekp-ms @embetten

dotnet/sdk#41590 lists recent failures, in particular: https://dev.azure.com/dnceng-public/public/_build/results?buildId=711675&view=logs&j=5960db8e-22cf-5780-65a0-f69bf3efbd20&t=663b2309-e924-59c7-4a84-d59cf2e17f86&l=1171 for example:

The feed 'net-sdk-supporting-feed [https://pkgs.dev.azure.com/dnceng/public/_packaging/dotnet9/nuget/v3/index.json]' lists package 'Roslyn.Diagnostics.Analyzers.3.11.0-beta1.24170.2'

However it was publishing on 3/20, https://dev.azure.com/dnceng/public/_artifacts/feed/dotnet9/NuGet/Roslyn.Diagnostics.Analyzers/overview/3.11.0-beta1.24170.2.

So I don't think timing is the likely issue.
Can you maybe check out the 404s for that feed?

@kartheekp-ms
Copy link
Contributor

kartheekp-ms commented Jun 20, 2024

However it was publishing on 3/20,
Can you maybe check out the 404s for that feed?

Yeah, it is the same case with 'Castle.Core.5.1.1' package for dotnet-public feed where the package was ingested to the feed on Jan 5, 2023 but the NuGet error log says that package is not found.

@nkolev92
Copy link
Member

Hey everyone,

I've done some more investigations with some help from @zivkan and @jeffkl, and while we don't have anything conclusive we do have a way forward.

Investigation summary:

As mentioned a few times, currently what's getting thrown is a PackageNotFoundException.
https://github.com/NuGet/NuGet.Client/blob/79e6e486ee9f80296601485358ce6ac9f5e302d1/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L82-L88
This can happen in a situation, where the server returns 404, or 204, or 500 etc. If the http call is not successful, NuGet will retry up to 6 times: https://github.com/NuGet/NuGet.Client/blob/79e6e486ee9f80296601485358ce6ac9f5e302d1/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L260.
When the calls are made, the code in https://github.com/NuGet/NuGet.Client/blob/79e6e486ee9f80296601485358ce6ac9f5e302d1/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L73-L76 won't get executed unless something was downloaded from the server.

Good news is that there's a lot of logging. We expect that 404s might be happen, and NuGet does log all http results, but it does so at normal verbosity.
There are also messages in case retries happen: https://github.com/NuGet/NuGet.Client/blob/79e6e486ee9f80296601485358ce6ac9f5e302d1/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L287-L313.
Unfortunately, these messages are not available to us for investigation right now, given that nearly all restores at minimal verbosity right now.

I did consider adding additional messages, but I believe there's plenty of information in the current logs (once we get the higher verbosity logs), to make further progress.

Call to action

  1. Given that there are reports of .NET 9 P4, can someone please attempt using .NET 9 P4 to help us narrow down whether there's a client regression, or if there's something happening server side.
  2. Can the restore verbosity be increased to detailed?

Beyond this, it'd also be helpful to know things like:
How often are these issues happening? How many failures vs how many builds?

@nkolev92 nkolev92 self-assigned this Jun 20, 2024
@akoeplinger
Copy link
Author

@nkolev92 I'm having trouble figuring out how to increase the log verbosity for the RestoreTask, would you mind showing which property you had in mind?

Note that I already posted the binlogs in the original comment in case you missed it.

@akoeplinger
Copy link
Author

akoeplinger commented Jun 21, 2024

I looked at the binlogs again and I don't see a GET request for the packages which fail later on, only for the index.json.

@nkolev92
Copy link
Member

@akoeplinger

There's no special property, the cli argument should work.

@akoeplinger
Copy link
Author

so you're saying this will emit more data into the binlog?

@nkolev92
Copy link
Member

No, my bad I should've been explicit. I was just answering that part of the question.

Checking now to see if I can understand how the Get/Cache message would never happen and yet we'd get to throw that exception.

@nkolev92
Copy link
Member

The only way I could see the log message not showing and not have another message such as cancellation or a retry one, is if it somehow exited before hitting this code path: https://github.com/NuGet/NuGet.Client/blob/d1d2e260de9b8f20175e7766aa88e1ce1ece6b4e/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L258-L263.

I think we'll probably need to add some custom logging here.

@WeihanLi
Copy link

C:\hostedtoolcache\windows\dotnet\sdk\9.0.100-preview.5.24307.3\NuGet.targets(170,5): error MSB4181: The "RestoreTask" task returned false but did not log an error. [D:\a\1\s\WeihanLi.Web.Extensions.sln]

I may meet the same issue, success with a retry

CI: https://weihanli.visualstudio.com/Pipelines/_build/results?buildId=6428&view=logs&j=12f1170f-54f2-53f3-20dd-22fc7dff55f9&t=7dae026f-9e99-5d65-075e-3f7579577f94&s=96ac2280-8cb4-5df5-99de-dd2da759617d
build logs: https://weihanli.visualstudio.com/4e7b2e5c-fece-4f31-9ba4-4bc9d0f3fa3d/_apis/build/builds/6428/logs/11
Commit: WeihanLi/WeihanLi.Web.Extensions@6fa2899

@akoeplinger
Copy link
Author

I think we'll probably need to add some custom logging here.

@nkolev92 is this being worked on? We can do some runs on AzDO with a patched nuget.client .dll if necessary

@ViktorHofer
Copy link

Kindly asking for an update on this issue. This is still affecting most of our builds.

So roughly about 100 hits in the last 5 days.

@nkolev92
Copy link
Member

nkolev92 commented Jul 1, 2024

I was out unexpectedly all of last week.
Resuming investigations today.

@nkolev92
Copy link
Member

nkolev92 commented Jul 1, 2024

Catching up here, @ViktorHofer, I think dotnet/runtime#103823 looks like it's a runtime issue and something likely unrelated to this particular problem, so I'm not tracking that one as a duplicate of this.

@nkolev92
Copy link
Member

nkolev92 commented Jul 2, 2024

I think we'll probably need to add some custom logging here.

@nkolev92 is this being worked on? We can do some runs on AzDO with a patched nuget.client .dll if necessary

@akoeplinger

Yep, we'll have to do that.
I'm adding some logging on a branch. I'll ping once I have a build.

@nkolev92
Copy link
Member

nkolev92 commented Jul 3, 2024

Update:
Added some logging and got some results. Unfortunately those are not enough to help us narrow down the issue.
We're gonna try again.

akoeplinger added a commit to akoeplinger/NuGet.Client that referenced this issue Jul 4, 2024
When multiple threads try to write to the fields we can sometimes hit a case where _retryCount is set to 0.
This caused the loop in ProcessHttpSourceResultAsync() to be skipped completely because `retry <= maxRetries`:
https://github.com/NuGet/NuGet.Client/blob/6c642c2d63717acd4bf92050a42691928020eb89/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L258-L328

Fix this by using `Lazy<T>` for initializing the fields which is thread-safe.

Fixes NuGet/Home#13545
@akoeplinger
Copy link
Author

We figured out the root cause: NuGet/NuGet.Client#5905

I ran three sdk-unified-build-full builds with this change and didn't hit the issue yet :)

akoeplinger added a commit to akoeplinger/NuGet.Client that referenced this issue Jul 4, 2024
When multiple threads try to access the `RetryCount` property we sometimes hit a case where `RetryCount` returned 0 (*).
This caused the loop in `ProcessHttpSourceResultAsync()` to be skipped completely because `retry <= maxRetries`: https://github.com/NuGet/NuGet.Client/blob/6c642c2d63717acd4bf92050a42691928020eb89/src/NuGet.Core/NuGet.Protocol/Utility/FindPackagesByIdNupkgDownloader.cs#L258-L328

Fix this by using `Lazy<T>` for initializing the fields which is thread-safe.

Fixes NuGet/Home#13545

(*) The reason is that code like `int RetryCount => _retryCount ??= 6;` gets turned into:

```csharp
int valueOrDefault = _retryCount.GetValueOrDefault();
if (!_retryCount.HasValue)
{
    valueOrDefault = 6;
    _retryCount = valueOrDefault;
    return valueOrDefault;
}
return valueOrDefault;
```

Suppose Thread A arrives first and calls `GetValueOrDefault()` (which is 0 for int) but then Thread B interjects and sets the value, now when Thread A resumes `_retryCount.HasValue` is true so we skip the if block and return valueOrDefault i.e. 0.
@nkolev92
Copy link
Member

nkolev92 commented Jul 5, 2024

Some more context for the followers of this issue:

  • We don't really think there's a NuGet change that caused this issue to appear in P5/P6 and not in P4. This has been problematic code since the first implementation, and something else around NuGet has changed that make it more likely for this timing issue to happen. An idea could be some JIT changes.
  • The NuGet enhanced retry is now default. So if you're setting env vars such as NUGET_ENABLE_ENHANCED_HTTP_RETRY, you probably don't need to do it anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Functionality:Restore Priority:1 High priority issues that must be resolved in the current sprint. Type:Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants