Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file size check #1491

Merged
merged 4 commits into from
Dec 6, 2023
Merged

Add file size check #1491

merged 4 commits into from
Dec 6, 2023

Conversation

fbenke-pik
Copy link
Contributor

@fbenke-pik fbenke-pik commented Dec 1, 2023

Purpose of this PR

Following up on this discussion, this PR introduces a simple file size limit check for all files currently under version control. It is part of the workflow test-code that is run for any PRs and commits on main, master and develop branches.

The 600 kB limit is chosen to exceed the current largest file
modules/31_fossil/exogenous/input/p31_fix_fuelex.put (570.67kB)

Type of change

  • New feature

Checklist:

  • My code follows the coding etiquette
  • I performed a self-review of my own code
  • I explained my changes within the PR, particularly in hard-to-understand areas
  • I checked that the in-code documentation is up-to-date

@fbenke-pik
Copy link
Contributor Author

Remark: I adapted the magpie approach here.

Discussion points:

  • Do we want to reduce the file size limit further?
  • Do we want to restrict the check to new files only? In that case, we need to come up with a clever way to define "new files". Maybe new files in comparison to develop?
  • If we want to reduce file size limit, but not restrict it to new files, we must come up with a way to deal with files currently exceeding the limit: either fixing the files or defining hard-coded exceptions comes to mind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the large files in the repository

$ git ls-tree -r HEAD --name-only | xargs ls -sh | sort -k 1hr,1 | head -n 20
572K modules/31_fossil/exogenous/input/p31_fix_fuelex.put
216K core/input/grid/EDGAR_CO2_2010_excl_agr_and_intl_transport.mz
216K core/input/grid/EDGAR_CO2.mz
212K core/input/grid/EDGAR_SO2_2005_excl_agr_and_intl_transport.mz
212K core/input/grid/EDGAR_SO2.mz
204K config/scenario_config_21_EU11_Fit_for_55_sensitivity.csv
200K tutorials/figures/git-7-pull-request-github-1.PNG
172K tutorials/figures/git-8-pull-request-github-2.PNG
148K tutorials/figures/appResults_window.png
120K core/sets.gms
112K scripts/output/comparison/notebook_templates/AriadneComparison.Rmd
108K main.gms

I feel it should be easy to move some stuff to mrremind and go to maybe 250 KB.

Also, could this not be put under the .github/ directory? The Remind directory is cluttered as it is …

@fbenke-pik
Copy link
Contributor Author

fbenke-pik commented Dec 1, 2023

Also, could this not be put under the .github/ directory? The Remind directory is cluttered as it is …

Yes, happy to do that

I feel it should be easy to move some stuff to mrremind and go to maybe 250 KB.

True, we can look into moving stuff. For some of the listed files, I have ideas. For others, not.
@LaviniaBaumstark should this be done before or after merging this PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having looked into this a bit, I would call this a workflow or something. It is not a git hook, and should not be confused with one.

  • Do we want to restrict the check to new files only? In that case, we need to come up with a clever way to define "new files". Maybe new files in comparison to develop?

git ls-tree -r HEAD --name-only "lists the contents of a given tree object", i.e. the files currently existing in the repository.
git diff --cached --name-only lists "the changes you staged for the next commit" i.e. the new files.
So drop the first to just get changed files.

Better yet, use substring(grep("^[AM]", system("git status --short --porcelain", intern = TRUE), value = TRUE), first = 4) to get only files that have been either added or modified, and not rely on the fact that deleted files do not exist and do not have a file size (out <- out[!is.na(out$size), ]) ;)

  • If we want to reduce file size limit, but not restrict it to new files, we must come up with a way to deal with files currently exceeding the limit: either fixing the files or defining hard-coded exceptions comes to mind.

I do not see the point in checking the size of all files for every single merge request. Looks like some sort of code atavism to me.

@fbenke-pik
Copy link
Contributor Author

fbenke-pik commented Dec 5, 2023

From what I can see, the problem with both your approaches is that the only work for the latest commit.
However, the actual check is run when you try to merge into develop or push to the develop branch of your fork in the github repo and that could include more than one commit. So any solution looking at only new files must then identify all relevant changes spanning possibly multiple commits. Or am I missing sth here, Michaja?

There is one alternative, to make this check part of these checks in pre-commit.ci using the hook "check-added-large-files". If this works as I hope it would, there should be a check of file size for all new files when opening a PR.

@fbenke-pik
Copy link
Contributor Author

Looks like pre-commit check won't work as expected: https://results.pre-commit.ci/run/github/226360184/1701792228.w1y28gzhSL-kQ9aVwps_zw

It is just fine with me adding a 14 MB mif on a test branch.

@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member

Looks like pre-commit check won't work as expected: https://results.pre-commit.ci/run/github/226360184/1701792228.w1y28gzhSL-kQ9aVwps_zw

It is just fine with me adding a 14 MB mif on a test branch.

LOL.

  1. Does it work on the MAgPIE side?
  2. Which check was that? "check for added large files..............................................Passed" looks like some build-in Github stuff?
  3. Add
       files <- union(system("git ls-tree -r HEAD --name-only", intern = TRUE),
                      system("git diff --cached --name-only", intern = TRUE))
       out <- data.frame(files = files, size = round(file.size(files) / 1024, 2))
    +  print(file.info(files)["size"])
    +  stop()
       out <- out[!is.na(out$size), ]
       out <- out[out$size 
    to take a look what files the workflow actually sees.

@fbenke-pik
Copy link
Contributor Author

fbenke-pik commented Dec 6, 2023

Looks like pre-commit check won't work as expected: https://results.pre-commit.ci/run/github/226360184/1701792228.w1y28gzhSL-kQ9aVwps_zw

There is one alternative, to make this check part of these checks in pre-commit.ci using the hook "check-added-large-files". If this works as I hope it would, there should be a check of file size for all new files when opening a PR.

Sorry for the confusion, this problem does not concern the code added in this PR. It concerns an alternative approach I tried here and mentioned in the second quote above. It is not used by MAgPIE, and does not seem to work for us, so just ignore this.

So to answer your questions

  1. No, MAgPIE does not try this approach.
  2. Yes, it is built-in Github stuff, triggered by adding a hook in pre-config.ci
  3. This alternative approach does not utilize the code in .github/workflows/size-check

The approach of this PR works as expected (see below) and @LaviniaBaumstark will check what to do with the files larger 250kB and in the best case we can decrease the limit.

image

@LaviniaBaumstark
Copy link
Member

p31_fix_fuelex

Looking at the large files in the repository

$ git ls-tree -r HEAD --name-only | xargs ls -sh | sort -k 1hr,1 | head -n 20
572K modules/31_fossil/exogenous/input/p31_fix_fuelex.put
216K core/input/grid/EDGAR_CO2_2010_excl_agr_and_intl_transport.mz
216K core/input/grid/EDGAR_CO2.mz
212K core/input/grid/EDGAR_SO2_2005_excl_agr_and_intl_transport.mz
212K core/input/grid/EDGAR_SO2.mz
204K config/scenario_config_21_EU11_Fit_for_55_sensitivity.csv
200K tutorials/figures/git-7-pull-request-github-1.PNG
172K tutorials/figures/git-8-pull-request-github-2.PNG
148K tutorials/figures/appResults_window.png
120K core/sets.gms
112K scripts/output/comparison/notebook_templates/AriadneComparison.Rmd
108K main.gms

I feel it should be easy to move some stuff to mrremind and go to maybe 250 KB.

Also, could this not be put under the .github/ directory? The Remind directory is cluttered as it is …

regarding the biggest file: I would suggest to delete the whole realization https://github.com/remindmodel/remind/tree/develop/modules/31_fossil/exogenous @nicobauer?
regarding the EDGE-T data: I will talk to the transport people

@fbenke-pik fbenke-pik marked this pull request as ready for review December 6, 2023 14:31
@fbenke-pik
Copy link
Contributor Author

fbenke-pik commented Dec 6, 2023

Ok, since this is a working solution, I'd say we merge this for now and lower the threshold later as we delete/move files.

I am open to more elegant solutions replacing this later, but feel like this simple solution works well for now.

@LaviniaBaumstark
Copy link
Member

Ok, since this is a working solution, I'd say we merge this for now and lower the threshold later as we delete/move files.

I am open to more elegant solutions replacing this later, but feel like this simple solution works well for now.

I agree

@fbenke-pik fbenke-pik merged commit e1e232d into remindmodel:develop Dec 6, 2023
2 checks passed
@0UmfHxcvx5J7JoaOhFSs5mncnisTJJ6q
Copy link
Member

It concerns an alternative approach I […] mentioned in the second quote above.

Oh, I missed that one. Sorry.

From what I can see, the problem with both your approaches is that the only work for the latest commit.
However, the actual check is run when you try to merge into develop or push to the develop branch of your fork in the github repo and that could include more than one commit. So any solution looking at only new files must then identify all relevant changes spanning possibly multiple commits. Or am I missing sth here, Michaja?

Depends on when the workflow is run. Merges in git (other then fast-forwarding) work by applying all changes like a patch, stage them, and then commit them, which is why you only get one merge commit. So git status --short --procelain | grep "^[AM]" will show all files added or modified by all commits subsumed in the merging commit about to be made (when put in the pre-merge-commit hook).
So, if the workflow would be run as a git commit hook would, then it would see all the staged changes, and work on them.
If however the workflow is run on the branch to be merged (which I suspect), then there are no changes to be merged, and the system("git diff --cached --name-only", intern = TRUE) portion is quite useless.

So, the difference between hooks and workflows is not trivial.

@fbenke-pik
Copy link
Contributor Author

fbenke-pik commented Dec 6, 2023

I see, the workflow is currently triggered by push and pull request events. See

So I'd say, it would not be a proper commit hook in any case with the current solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants