Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[.NET 8] <PublishTrimmed> no longers works but throws build process for an infinite loop #94831

Closed
DierkDroth opened this issue Nov 16, 2023 · 17 comments · Fixed by #95041
Closed
Assignees
Labels
area-Tools-ILLink .NET linker development as well as trimming analyzers regression-from-last-release
Milestone

Comments

@DierkDroth
Copy link

DierkDroth commented Nov 16, 2023

Description

So far I built my ARM64 binary by
dotnet publish -c Release -r linux-arm64 -p:PublishSingleFile=true -p:SelfContained=true -p:PublishReadyToRun=true MyProject.csproj

Here are my project settings:

	<PropertyGroup>
		<OutputType>Exe</OutputType>
		<TargetFramework>net8.0</TargetFramework>
		<VersionPrefix>1.10.14</VersionPrefix>
		<RunAnalyzersDuringLiveAnalysis>False</RunAnalyzersDuringLiveAnalysis>
		<RunAnalyzersDuringBuild>False</RunAnalyzersDuringBuild>
		<EnableNETAnalyzers>False</EnableNETAnalyzers>
		<ProduceReferenceAssembly>False</ProduceReferenceAssembly>
		<PublishTrimmed>true</PublishTrimmed>
	</PropertyGroup>

Turns out <PublishTrimmed>true</PublishTrimmed> no longers works: the build which tools ~3 minutes on .NET 7 now did not terminate the 'optimization phase' after > 9 hours.

What's going wrong and how could I trim my build as before?

Reproduction Steps

Please see above

Expected behavior

Trimming should work with .NET8 as it worked with .NET 7

Actual behavior

Trimming does not work with .NET8 as it worked with .NET 7

Regression?

Yes

Known Workarounds

None

Configuration

No response

Other information

No response

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Nov 16, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Nov 16, 2023
@DierkDroth DierkDroth changed the title [.NET 8] <PublishTrimmed> no longers work but throws build process for an infinite loop [.NET 8] <PublishTrimmed> no longers works but throws build process for an infinite loop Nov 16, 2023
@lewing lewing added the area-Tools-ILLink .NET linker development as well as trimming analyzers label Nov 16, 2023
@ghost
Copy link

ghost commented Nov 16, 2023

Tagging subscribers to this area: @agocke, @sbomer, @vitek-karas
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

So far I built my ARM64 binary by
dotnet publish -c Release -r linux-arm64 -p:PublishSingleFile=true -p:SelfContained=true -p:PublishReadyToRun=true MyProject.csproj

Here are my project settings:

	<PropertyGroup>
		<OutputType>Exe</OutputType>
		<TargetFramework>net8.0</TargetFramework>
		<VersionPrefix>1.10.14</VersionPrefix>
		<RunAnalyzersDuringLiveAnalysis>False</RunAnalyzersDuringLiveAnalysis>
		<RunAnalyzersDuringBuild>False</RunAnalyzersDuringBuild>
		<EnableNETAnalyzers>False</EnableNETAnalyzers>
		<ProduceReferenceAssembly>False</ProduceReferenceAssembly>
		<PublishTrimmed>true</PublishTrimmed>
	</PropertyGroup>

Turns out <PublishTrimmed>true</PublishTrimmed> no longers works: the build which tools ~3 minutes on .NET 7 now did not terminate the 'optimization phase' after > 9 hours.

What's going wrong and how could I trim my build as before?

Reproduction Steps

Please see above

Expected behavior

Trimming should work with .NET8 as it worked with .NET 7

Actual behavior

Trimming does not work with .NET8 as it worked with .NET 7

Regression?

Yes

Known Workarounds

None

Configuration

No response

Other information

No response

Author: DierkDroth
Assignees: -
Labels:

untriaged, area-Tools-ILLink, needs-area-label

Milestone: -

@lewing lewing removed the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Nov 16, 2023
@marek-safar marek-safar removed the untriaged New issue has not been triaged by the area owner label Nov 16, 2023
@marek-safar marek-safar added this to the 8.0.x milestone Nov 16, 2023
@vitek-karas
Copy link
Member

Thanks for reporting this. Unfortunately, there's not enough info here to figure out what's going on.

Could you please try some of these:

  • Running the "hanging" publish - can you look what process is hanging. I would expect it to be eating CPU cycles instead of just waiting on something. If it's related to trimming it would be a process dotnet.exe and with illink.dll on the command line - but I'm guessing.
  • If you find that process, can you try attaching a debugger to it (VS attach debugger would work well for this)? Then break the debugger and look for threads which have something interesting on the callstack - if it's trimming related the code for that lives in Mono.Linker namespace - the callstacks should be safe to share, the trimmer runs only code from the trimmer and the framework, there won't be anything from your project visible on the stack.
  • I assume you can't share your project, but maybe you could share the list of NuGet dependencies (it's common that the problem is caused by some code from the NuGet dependencies and those are public, so should not be a problem to share)
  • Anything special about your project you can think of? (I know it's a vague question, but you never know)

@DierkDroth
Copy link
Author

Thanks for getting back to me @vitek-karas.

Curious: would it help if I attached the untrimmed binary so you could try running the trim logic on it?

@vitek-karas
Copy link
Member

We would need all of the input binaries and the trimmer command line. You could try using https://github.com/vitek-karas/illinkrepro. Basically you run your publish with /bl to produce msbuild.binlog. The tool in that repo then parses that and extracts the trimmer command line from it. Then it copies all of the input assemblies into a repro directory and constructs a copy of the command line. If you're comfortable sharing that, it should let me reproduce the failure/hang exactly. The repro will contain all of the input assemblies (not source code, compiled IL) and the command line to trimmer (which may sometimes contains certain local paths).

@DierkDroth
Copy link
Author

DierkDroth commented Nov 16, 2023

Thanks @vitek-karas.

To clarify:

  • I run the build without the trimming option but with the /bl option
  • this creates a msbuild.binlog file

Could I just attach that msbuild.binlog here and you then would have what you needed to further analyze the case?

Note: I already had quite a few support cases/issues with the great Platform UNO team, where I happily shared BINLOG files. That's why I wouldn't have any concern on sharing that file.

Please let me know if you felt that would be a way to go or if I rather should go through the steps you outlined above. Thanks

@vitek-karas
Copy link
Member

Small correction:

  • You run the publish command which hangs (no changes) just add /bl
  • This creates the "interesting" binlog

The binlog is unfortunately not enough - it contains the command line to the trimmer, but not the input assemblies. For that something like the tool from the illinkrepro repo, to collect the input assemblies as well, is needed.

We can also arrange for a more private way of sending this data, if you want. The best way then is to create a Visual Studio Feedback item and attach the data there.

@DierkDroth
Copy link
Author

Thanks @vitek-karas. I just created a repo and filed it as Visual Studio Feedback item.

Side note: It would make the handling of your illinkrepo tool more handy if just a single, self-contained file would be created. I had to copy the illinkrepo.* files + all dlls to the target directory.

Please let me know if you needed anything else.

@jeffschwMSFT
Copy link
Member

@vitek-karas
Copy link
Member

@DierkDroth thanks a lot for the repro - I was able to investigate.

@sbomer for ideas about fixing:

The problem is a combination of the interprocedural data flow and a limit in ValueSet to avoid exponential state expansion. The way we implement the limit is to reset the size of the ValueSet when it reaches a limit (256 right now). This works well except when the value is part of hoisted local set. In that case the reset modifies the interprocedural state and causes the analysis to never reach a stable state. It basically goes from 0 to the limit size, some of the values in between are captured in the state. Then it resets back to 0, so the next iteration will see a different value, and it will start to grow again.

The repro for this is a very large state machine. The state field will be assigned constant values (in this app's case it's 533 different values), and that will grow the set, eventually it reaches 256 values and resets back to 0, but that we notice that one of the numbers we've added is missing, so we add it again -> grows to 256 again, and so on and on.

A "Simple" repro looks like:

			[ExpectedWarning ("IL2072", CompilerGeneratedCode = true)]
			static async void VeryLargeStateMachine()
			{
				Type t = GetWithPublicFields ();

				// 100
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();

				// 200
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();

				// 300
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();
				await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync (); await MethodAsync ();

				t.RequiresAll ();
			}

@vitek-karas
Copy link
Member

As for the potential fix... the only idea I was able to come up with is to:

  • We can't reset the ValueSet to empty, because it removes already added state, and we rely on the value sets to be "Add only", otherwise we can't guarantee reaching a stable state.
  • If we hit the limit of values, set the value to "overflow" special value and then in Meet - "overflow" meet "anything" => "overflow". That way the value will eventually stabilize on "overflow".
  • We would then treat the "overflow" as an empty set for the purposes of enumeration of values - this is to avoid producing warnings (empty set causes the analysis to "skip" actions on it).

@DierkDroth
Copy link
Author

DierkDroth commented Nov 20, 2023

Thanks for your feedback @vitek-karas. I'm happy to help and delighted that you had been able to isolate the cause of trouble.

Curious: you mentioned 'exponential state expansion'. Would you consider my app as being rather 'big', that this scenario was not uncovered before?!?

@sbomer
Copy link
Member

sbomer commented Nov 20, 2023

@vitek-karas thanks for investigating. I agree with your assessment - it didn't occur to me at the time that resetting the values in the meet operation breaks the invariant that meet should always move down the lattice. I agree we should add a special "overflow" (or "bottom") value with the meet behavior you described. Maybe it would be better to treat it as an unknown value (treat it as a pattern we couldn't analyze and produce a warning) instead of skipping it.

@vitek-karas
Copy link
Member

@DierkDroth

Your app is not that big, although there is one method in it which is very large and has LOT of async calls in it. That's what caused the problem. But overall it's not that crazy ("only ~500 await calls"). The exponential state problem is that the tooling simulates all potential values a variable can take in the method, and it's very easy to blow that set out of the water. For example:

				typeof (TestType).RequiresAll (); // Force data flow analysis

				object[] data = new object[20];
				if (true) data[0] = new object ();
				if (true) data[1] = new object ();
				if (true) data[2] = new object ();
				if (true) data[3] = new object ();
				if (true) data[4] = new object ();
				if (true) data[5] = new object ();
				if (true) data[6] = new object ();
				if (true) data[7] = new object ();
				if (true) data[8] = new object ();
				if (true) data[9] = new object ();
				if (true) data[10] = new object ();
				if (true) data[11] = new object ();
				if (true) data[12] = new object ();
				if (true) data[13] = new object ();
				if (true) data[14] = new object ();
				if (true) data[15] = new object ();
				if (true) data[16] = new object ();
				if (true) data[17] = new object ();
				if (true) data[18] = new object ();
				if (true) data[19] = new object ();

Now just assume that tool is not smart enough to figure out the condition, that it has to assume in each if both branches can be taken. Then the number of possible states for this test is on the order of 2^20. The tool obviously can't represent it in memory, so we have to "give up" at some point.

@vitek-karas
Copy link
Member

@sbomer
I guess you're right. The trimmer should play it safe, and thus if it can't figure something out, it should warn. In this case it might be annoying, but it's probably better than saying "everything's fine" (insert your favorite fire meme ;-))

@sbomer sbomer self-assigned this Nov 20, 2023
@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Nov 20, 2023
sbomer added a commit that referenced this issue Nov 27, 2023
#87634 introduced a limit
on the number of values tracked in dataflow analysis, but this
approach was incorrect because resetting the set of tracked
values was effectively moving up the dataflow lattice, breaking
the invariant and causing potential hangs.

The hang was observed in
#94831, where (as found
by @vitek-karas) the hoisted local `state` field of an async
method with many await points was getting a large number of
tracked values, hitting the limit, being reset to "empty", but
then growing again, causing the analysis not to converge.

The fix is to instead introduce a special value
`ValueSet<TValue>.Unknown` that is used for this case. It acts
like "bottom" in the lattice, so that it destroys any other
inputs on meet ("bottom" meet anything else is "bottom").

In this testcase the `state` field isn't involved in dataflow
warnings, so this actually allows the method to be analyzed
correctly, producing the expected warning for the `Type` local
that flows across all of the await points.

Fixes #94831
@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Nov 27, 2023
sbomer added a commit to sbomer/runtime that referenced this issue Nov 27, 2023
…et#95041)

dotnet#87634 introduced a limit
on the number of values tracked in dataflow analysis, but this
approach was incorrect because resetting the set of tracked
values was effectively moving up the dataflow lattice, breaking
the invariant and causing potential hangs.

The hang was observed in
dotnet#94831, where (as found
by @vitek-karas) the hoisted local `state` field of an async
method with many await points was getting a large number of
tracked values, hitting the limit, being reset to "empty", but
then growing again, causing the analysis not to converge.

The fix is to instead introduce a special value
`ValueSet<TValue>.Unknown` that is used for this case. It acts
like "bottom" in the lattice, so that it destroys any other
inputs on meet ("bottom" meet anything else is "bottom").

In this testcase the `state` field isn't involved in dataflow
warnings, so this actually allows the method to be analyzed
correctly, producing the expected warning for the `Type` local
that flows across all of the await points.

Fixes dotnet#94831
@DierkDroth
Copy link
Author

@sbomer thanks for working on that issue.

Curious: would this result in a .NET8 update? If so, what would be the expected timeframe?

@sbomer
Copy link
Member

sbomer commented Nov 28, 2023

Thank you for reporting it! I'm backporting the fix into .NET 8 at #95302 - if approved, it should be part of the next .NET 8 servicing release, so probably one of the next Patch Tuesdays.

@DierkDroth
Copy link
Author

Excellent! Thanks for your feedback @sbomer

@github-actions github-actions bot locked and limited conversation to collaborators Dec 29, 2023
jeffschwMSFT pushed a commit that referenced this issue Jan 2, 2024
…) (#95302)

#87634 introduced a limit
on the number of values tracked in dataflow analysis, but this
approach was incorrect because resetting the set of tracked
values was effectively moving up the dataflow lattice, breaking
the invariant and causing potential hangs.

The hang was observed in
#94831, where (as found
by @vitek-karas) the hoisted local `state` field of an async
method with many await points was getting a large number of
tracked values, hitting the limit, being reset to "empty", but
then growing again, causing the analysis not to converge.

The fix is to instead introduce a special value
`ValueSet<TValue>.Unknown` that is used for this case. It acts
like "bottom" in the lattice, so that it destroys any other
inputs on meet ("bottom" meet anything else is "bottom").

In this testcase the `state` field isn't involved in dataflow
warnings, so this actually allows the method to be analyzed
correctly, producing the expected warning for the `Type` local
that flows across all of the await points.

Fixes #94831
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Tools-ILLink .NET linker development as well as trimming analyzers regression-from-last-release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants