Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nativeaot/SmokeTests/Exceptions failing with Assertion failed: (n_heaps <= heap_number) || !gc_t_join.joined() #103839

Closed
elinor-fung opened this issue Jun 21, 2024 · 13 comments
Assignees
Labels
area-GC-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab
Milestone

Comments

@elinor-fung
Copy link
Member

elinor-fung commented Jun 21, 2024

Assertion failed: (n_heaps <= heap_number) || !gc_t_join.joined(), file D:\a\_work\1\s\src\coreclr\gc\gc.cpp, line 6988

Return code:      1
Raw output file:      C:\h\w\B51009A0\w\B29A098A\uploads\Reports\nativeaot.SmokeTests\Exceptions\Exceptions\Exceptions.output.txt
Raw output:
BEGIN EXECUTION
call C:\h\w\B51009A0\p\nativeaottest.cmd C:\h\w\B51009A0\w\B29A098A\e\nativeaot\SmokeTests\Exceptions\Exceptions\ Exceptions.dll 
Exception caught!
Null reference exception in write barrier caught!
Null reference exception caught!
Test Stacktrace with exception on stack:
   at BringUpTest.FilterWithStackTrace(Exception) + 0x28
   at BringUpTest.Main() + 0x31c
   at System.Runtime.EH.FindFirstPassHandler(Object, UInt32, StackFrameIterator&, UInt32&, Byte*&) + 0x188
   at System.Runtime.EH.DispatchEx(StackFrameIterator&, EH.ExInfo&) + 0x161
   at System.Runtime.EH.RhThrowEx(Object, EH.ExInfo&) + 0x4b
   at BringUpTest.Main() + 0xaf

Exception caught via filter!
Expected: 100
Actual: 3
END EXECUTION - FAILED

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=715849
Build error leg or test failing: nativeaot\SmokeTests\Exceptions\Exceptions\Exceptions.cmd
Pull request: #103821

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "Assertion failed: (n_heaps <= heap_number) || !gc_t_join.joined()",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Report

Build Definition Test Pull Request
782793 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution #106713
780945 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution #106662
779397 dotnet/runtime readytorun/GenericCycleDetection/Depth1Test/Depth1Test.cmd #80154
777119 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution #106474
777004 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution #106419
776719 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution #105946
775455 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution #106309
770671 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution
769702 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution #106130
768094 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution #106010
761651 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution #105757
757283 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution #105578

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 4 12
@elinor-fung elinor-fung added blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab labels Jun 21, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 21, 2024
@dotnet-policy-service dotnet-policy-service bot added the untriaged New issue has not been triaged by the area owner label Jun 21, 2024
@elinor-fung elinor-fung added area-NativeAOT-coreclr and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 21, 2024
Copy link
Contributor

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

@jkotas
Copy link
Member

jkotas commented Jun 22, 2024

Looks like a DATAs race condition. @dotnet/gc Could you please take a look?

Note that nativeaot\SmokeTests\Exceptions test is explicitly opted into server GC to get some coverage for server GC during default CI run.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Jun 26, 2024
@mangod9 mangod9 added this to the 9.0.0 milestone Jun 26, 2024
@mrsharm
Copy link
Member

mrsharm commented Jul 5, 2024

Are there any dumps available? I can't seem to find them. Tried to repro locally to no avail. Seems like it's a low probability assertion failure (2 / month).

@MichalStrehovsky
Copy link
Member

Are there any dumps available? I can't seem to find them. Tried to repro locally to no avail. Seems like it's a low probability assertion failure (2 / month).

Yeah, it doesn't look like infra captured a dump for this.

There are 4 hits per month but we don't have any dedicated server GC testing. This is the one and only test we run with server GC enabled. We rely on CoreCLR testing to catch GC bugs right now (even this test is not really testing Server GC - it just tests that setting the csproj property to enable server GC actually enables the server GC).

@mangod9
Copy link
Member

mangod9 commented Aug 9, 2024

@mrsharm @MichalStrehovsky are any dumps available for this, or is there a local repro?

@mrsharm
Copy link
Member

mrsharm commented Aug 9, 2024

I couldn't locally repro this and nor could I get to any dumps. My one guess (by a long shot) is that this might be related to the other DATAS race condition we found via Reliability Framework where there is a race in the GetHeap while change_heap_count is invoked but without a dump it's difficult to validate.

@mangod9
Copy link
Member

mangod9 commented Aug 9, 2024

The reliability framework issue was fixed correct? Looks like this issue reproed today.

@mrsharm
Copy link
Member

mrsharm commented Aug 9, 2024

The reliability framework issue was fixed correct? Looks like this issue reproed today.

It wasn't - I think we were still working on a solution. CC: @Maoni0.

@mangod9
Copy link
Member

mangod9 commented Aug 9, 2024

ah ok. We can tag it as such then, and see if the repro stops after that is fixed.

@Maoni0
Copy link
Member

Maoni0 commented Aug 21, 2024

I made a fix at #106752.

@cshung cshung closed this as completed Aug 22, 2024
@mrsharm
Copy link
Member

mrsharm commented Aug 22, 2024

image

@cshung, we should wait some time before confirming this issue has truly fixed - I am observing that the bot is still picking up the same failures.

@mrsharm mrsharm reopened this Aug 22, 2024
@cshung
Copy link
Member

cshung commented Aug 22, 2024

@mrsharm, wouldn't the bot reopen it if it finds new failures? I was hoping to confirm the fix by doing that. The builds found from the bot seems to be either 9.0 or two days ago.

@mrsharm mrsharm closed this as completed Aug 22, 2024
@github-actions github-actions bot locked and limited conversation to collaborators Sep 22, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-GC-coreclr blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' Known Build Error Use this to report build issues in the .NET Helix tab
Projects
Archived in project
Development

No branches or pull requests

8 participants
@cshung @jkotas @Maoni0 @MichalStrehovsky @elinor-fung @mangod9 @mrsharm and others