Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running out of disk space on 1ES hosted Ubuntu VM image #1537

Closed
oleksandr-didyk opened this issue Dec 11, 2023 · 8 comments
Closed

Running out of disk space on 1ES hosted Ubuntu VM image #1537

oleksandr-didyk opened this issue Dec 11, 2023 · 8 comments
Assignees
Labels
Ops - Facilely Operations issues that are easily accomplished or attained

Comments

@oleksandr-didyk
Copy link

oleksandr-didyk commented Dec 11, 2023

Related to dotnet/arcade#13036

Since last week the source-build release pipeline started hitting No space left on device exception during the run of stage that requires ~6GB of disk space: https://dev.azure.com/dnceng/internal/_build/results?buildId=2332997&view=logs&j=3a04dffd-22cf-5707-fa41-9bd4708cc1db&t=43d9a424-b64c-5d30-c3bd-0fca912bd01a

Based on the agent pool description in Azure, there should be no issues with disk space for the specified pipeline as the VM has a temp disk of ~100GB. Nevertheless, reading through the issue linked above we discovered that the settings we see in the Azure for the hosted pool are not reflective of the actual state, since the VM procured has a lot less disk space available due to ephemeral disks.

It would be great if the issue could be addressed so as the source-build release pipeline could rely on the agent pool used, especially since the disk requirements for it are pretty low (maybe something like a maintenance job would be of help here?).

The main reasons for using the pool in the first place are:

  • Ubuntu OS (most of existing scripting was written with it in mind)
  • Dependent tooling available out-of-the-box, specifically az command line and its azure-devops extension. Installing the CL tool and the extension requires installing Python packages, something we don't really want to deal with especially given how restrictive Secure Supply Chain task is when it comes to PiPy.
  • relatively quick agent procurement given the low resource requirements - the pipeline is just moving around bits, validating release information and creating communications with partners
@dougbu
Copy link
Member

dougbu commented Jan 8, 2024

I believe that, like dotnet/arcade#13036, there's no further action we can take on this issue. would appreciate your thoughts on simply closing this @oleksandr-didyk, @ilyas1974, and @premun.

@dougbu
Copy link
Member

dougbu commented Jan 8, 2024

also adding @mmitche since his PR for the somehow-related #1514 issue is still open. my bet is we don't need both this issue and #1514 tracking anything.

@oleksandr-didyk
Copy link
Author

oleksandr-didyk commented Jan 9, 2024

I believe that, like dotnet/arcade#13036, there's no further action we can take on this issue. would appreciate your thoughts on simply closing this @oleksandr-didyk, @ilyas1974, and @premun.

I do think there is some action that can be taken here, such as either forbidding the use of the image completely or clarifying somewhere the limitations of the image and any other helpful notions related to the issue.

I am not sure exactly what the team would prefer or have time for, but given that some time was spent digging and trying to understand the cause of the problem for release infrastructure I do think its worth making sure someone else doesn't run into the same problem in a few months.

CC: @tkapin

@tkapin
Copy link
Member

tkapin commented Jan 19, 2024

I agree with Alex's statement above. If we don't use clean machines for each builds, we should run cleanup prior each build to ensure the machines are not filling up and failing non-deterministically.

@missymessa
Copy link
Member

@oleksandr-didyk and @tkapin what action should FR take on this issue?

@oleksandr-didyk
Copy link
Author

oleksandr-didyk commented Jan 23, 2024

@oleksandr-didyk and @tkapin what action should FR take on this issue?

IMO: we should either forbid the use of the image, allow it and clearly document its limitations or as Tomas mentioned provide a mechanism to keep the images clean so as the disk space limits aren't reached non-deterministically.

No preference to either of these, just would be nice to have some clarity regarding it so as to not fall victim to the same issue in a few months.

@garath
Copy link
Member

garath commented Feb 6, 2024

I'm moving this from FR to Ops as it doesn't need immediate attention and isn't blocking.

I think it would be interesting for Ops to establish some clearer expectations on what the available disk space is in this queue. Perhaps even just the calculation that is "this is the size of disk that comes with this size VM, and here is how big the OS image build ends up being. Thus you can expect n gigs free."

If we don't use clean machines for each builds, we should ...

To be clear, these machines are recycled after each build. Each job gets a new, fresh machine.

@davfost davfost self-assigned this Feb 26, 2024
@davfost
Copy link
Contributor

davfost commented Feb 26, 2024

The original space issue was resolved in the parent issue by migrating pipelines to use different pools.

Since we are moving to 'stateless' 1ES Hosted Pools the 'free space' related to a machine is defined by the VM Size specified in the pool settings in the AzDo Portal:

Getting To The VM Size Setting:

image

Viewing The Temp Drive For The VM Type Selected:

image

NOTE: The 1ES Hosted Pool team has said the Temp drive should be a consistent size and be completely cleaned on each VM allocation.

@davfost davfost closed this as completed Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ops - Facilely Operations issues that are easily accomplished or attained
Projects
None yet
Development

No branches or pull requests

7 participants