Running out of disk space on 1ES hosted Ubuntu VM image #1537

oleksandr-didyk · 2023-12-11T10:36:32Z

Since last week the source-build release pipeline started hitting No space left on device exception during the run of stage that requires ~6GB of disk space: https://dev.azure.com/dnceng/internal/_build/results?buildId=2332997&view=logs&j=3a04dffd-22cf-5707-fa41-9bd4708cc1db&t=43d9a424-b64c-5d30-c3bd-0fca912bd01a

Based on the agent pool description in Azure, there should be no issues with disk space for the specified pipeline as the VM has a temp disk of ~100GB. Nevertheless, reading through the issue linked above we discovered that the settings we see in the Azure for the hosted pool are not reflective of the actual state, since the VM procured has a lot less disk space available due to ephemeral disks.

It would be great if the issue could be addressed so as the source-build release pipeline could rely on the agent pool used, especially since the disk requirements for it are pretty low (maybe something like a maintenance job would be of help here?).

The main reasons for using the pool in the first place are:

Ubuntu OS (most of existing scripting was written with it in mind)
Dependent tooling available out-of-the-box, specifically az command line and its azure-devops extension. Installing the CL tool and the extension requires installing Python packages, something we don't really want to deal with especially given how restrictive Secure Supply Chain task is when it comes to PiPy.
relatively quick agent procurement given the low resource requirements - the pipeline is just moving around bits, validating release information and creating communications with partners

The text was updated successfully, but these errors were encountered:

dougbu · 2024-01-08T19:46:36Z

I believe that, like dotnet/arcade#13036, there's no further action we can take on this issue. would appreciate your thoughts on simply closing this @oleksandr-didyk, @ilyas1974, and @premun.

dougbu · 2024-01-08T20:21:59Z

also adding @mmitche since his PR for the somehow-related #1514 issue is still open. my bet is we don't need both this issue and #1514 tracking anything.

oleksandr-didyk · 2024-01-09T08:41:58Z

I believe that, like dotnet/arcade#13036, there's no further action we can take on this issue. would appreciate your thoughts on simply closing this @oleksandr-didyk, @ilyas1974, and @premun.

I do think there is some action that can be taken here, such as either forbidding the use of the image completely or clarifying somewhere the limitations of the image and any other helpful notions related to the issue.

I am not sure exactly what the team would prefer or have time for, but given that some time was spent digging and trying to understand the cause of the problem for release infrastructure I do think its worth making sure someone else doesn't run into the same problem in a few months.

CC: @tkapin

tkapin · 2024-01-19T13:30:59Z

I agree with Alex's statement above. If we don't use clean machines for each builds, we should run cleanup prior each build to ensure the machines are not filling up and failing non-deterministically.

missymessa · 2024-01-22T18:56:10Z

@oleksandr-didyk and @tkapin what action should FR take on this issue?

oleksandr-didyk · 2024-01-23T08:33:58Z

@oleksandr-didyk and @tkapin what action should FR take on this issue?

IMO: we should either forbid the use of the image, allow it and clearly document its limitations or as Tomas mentioned provide a mechanism to keep the images clean so as the disk space limits aren't reached non-deterministically.

No preference to either of these, just would be nice to have some clarity regarding it so as to not fall victim to the same issue in a few months.

garath · 2024-02-06T21:32:44Z

I'm moving this from FR to Ops as it doesn't need immediate attention and isn't blocking.

I think it would be interesting for Ops to establish some clearer expectations on what the available disk space is in this queue. Perhaps even just the calculation that is "this is the size of disk that comes with this size VM, and here is how big the OS image build ends up being. Thus you can expect n gigs free."

If we don't use clean machines for each builds, we should ...

To be clear, these machines are recycled after each build. Each job gets a new, fresh machine.

davfost · 2024-02-26T22:08:32Z

The original space issue was resolved in the parent issue by migrating pipelines to use different pools.

Since we are moving to 'stateless' 1ES Hosted Pools the 'free space' related to a machine is defined by the VM Size specified in the pool settings in the AzDo Portal:

Getting To The VM Size Setting:

Viewing The Temp Drive For The VM Type Selected:

NOTE: The 1ES Hosted Pool team has said the Temp drive should be a consistent size and be completely cleaned on each VM allocation.

oleksandr-didyk mentioned this issue Dec 11, 2023

Resolve potential disk space issue for CI agents running source-build release dotnet/source-build#3777

Closed

ilyas1974 added Ops - Facilely Operations issues that are easily accomplished or attained Ops - First Responder labels Dec 11, 2023

garath removed the Ops - First Responder label Feb 6, 2024

davfost self-assigned this Feb 26, 2024

davfost closed this as completed Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running out of disk space on 1ES hosted Ubuntu VM image #1537

Running out of disk space on 1ES hosted Ubuntu VM image #1537

oleksandr-didyk commented Dec 11, 2023 •

edited

Loading

dougbu commented Jan 8, 2024

dougbu commented Jan 8, 2024

oleksandr-didyk commented Jan 9, 2024 •

edited

Loading

tkapin commented Jan 19, 2024 •

edited

Loading

missymessa commented Jan 22, 2024

oleksandr-didyk commented Jan 23, 2024 •

edited

Loading

garath commented Feb 6, 2024

davfost commented Feb 26, 2024 •

edited

Loading

Running out of disk space on 1ES hosted Ubuntu VM image #1537

Running out of disk space on 1ES hosted Ubuntu VM image #1537

Comments

oleksandr-didyk commented Dec 11, 2023 • edited Loading

dougbu commented Jan 8, 2024

dougbu commented Jan 8, 2024

oleksandr-didyk commented Jan 9, 2024 • edited Loading

tkapin commented Jan 19, 2024 • edited Loading

missymessa commented Jan 22, 2024

oleksandr-didyk commented Jan 23, 2024 • edited Loading

garath commented Feb 6, 2024

davfost commented Feb 26, 2024 • edited Loading

Getting To The VM Size Setting:

Viewing The Temp Drive For The VM Type Selected:

oleksandr-didyk commented Dec 11, 2023 •

edited

Loading

oleksandr-didyk commented Jan 9, 2024 •

edited

Loading

tkapin commented Jan 19, 2024 •

edited

Loading

oleksandr-didyk commented Jan 23, 2024 •

edited

Loading

davfost commented Feb 26, 2024 •

edited

Loading