Skip to content
This repository has been archived by the owner on Jan 31, 2024. It is now read-only.

Repository bloat #178

Open
OndrejMarsalek opened this issue Apr 17, 2017 · 8 comments
Open

Repository bloat #178

OndrejMarsalek opened this issue Apr 17, 2017 · 8 comments

Comments

@OndrejMarsalek
Copy link
Collaborator

Because the repository never forgets, it easily bloats with data that is checked in and then removed again. Currently, it has around 220 MB, while the working tree is only around 35 MB. I tried looking for some resources that could help and found this:

https://stubbisms.wordpress.com/2009/07/10/git-script-to-show-largest-pack-objects-and-trim-your-waist-line/

Running the script, I get a list that starts with the below listing. I think we should try and filter most of these from the history. The end of this list sorted by size is around 1 MB, so looking even further might still make sense. If we don't maintain a separate repository for examples, we need to be a bit careful so that we don't make people download hundreds of MBs if they want to run a simple simulation.

All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file.
size   pack   SHA                                       location
30565  8618   0c12d4a625b53bca605d080b3ef03514d614a9c6  examples/ppi/qtip4pf/qtip4pf.pos_7.xyz
30565  8617   19aea3e92732eb3721a38cfcf909a5372ad8c7bc  examples/ppi/qtip4pf/qtip4pf.pos_5.xyz
30565  8618   2a1c03bc59f33d6095a663541892668fd93ae724  examples/ppi/qtip4pf/qtip4pf.pos_2.xyz
30565  8617   3041cf398fb10fa955b34971bb88664f334f8965  examples/ppi/qtip4pf/qtip4pf.pos_1.xyz
30565  8618   72ecdcbb1d92a474a12411ffee7097495a40895a  examples/ppi/qtip4pf/qtip4pf.pos_4.xyz
30565  8619   c1e61916a1dda9c95133da4c734a4e1d64cdf54a  examples/ppi/qtip4pf/qtip4pf.pos_0.xyz
30565  8618   c590357735fca50c993f3973e3bd54ea8010d02f  examples/ppi/qtip4pf/qtip4pf.pos_6.xyz
30565  8617   ffc3f0c99c22b9a5dc6c5bffc8de767c19b7d219  examples/ppi/qtip4pf/qtip4pf.pos_3.xyz
30564  9198   12e029354bca5043f93fb8436b5a1ccbfea65c81  examples/ppi/qtip4pf/qtip4pf.force_3.xyz
30564  9197   31d1f02d9a1ce3f0224bcc221e0769b1c51df38d  examples/ppi/qtip4pf/qtip4pf.force_0.xyz
30564  9197   6e1ba45f6efc329e1eeae1e90caba4dc5efd45ff  examples/ppi/qtip4pf/qtip4pf.force_5.xyz
30564  9196   a11d414c934f211410b606ec6a10b4c6e15ab460  examples/ppi/qtip4pf/qtip4pf.force_6.xyz
30564  9196   a4311bf1b268d6d58e25f259a364d4f7438cda4e  examples/ppi/qtip4pf/qtip4pf.force_2.xyz
30564  9197   b9b993ac2df51a380ae5b65da42cc7da389cac13  examples/ppi/qtip4pf/qtip4pf.force_7.xyz
30564  9198   c55b9aba770369b4c6516441b553834202b519d3  examples/ppi/qtip4pf/qtip4pf.force_4.xyz
30564  9197   e7281c1b170259d7f71bacb2ec2c05fe5e6e27d3  examples/ppi/qtip4pf/qtip4pf.force_1.xyz
17721  3605   a79efe62102e38b34e29fed219f00afda8a1892d  examples/ppi/qtip4pf/benchmark/qtip4pf.energies.dat
15500  4339   71032e78a592a092937c30b2a6ce771506e8c5ff  examples/lammps/h2o-mts/MTS-Ensemble/trial-01/rpc.pos_0.xyz
13785  13162  992bda057bd0e8dd4b9fb3fafa1b39d2f0f5e2f6  data/diss-zurich10.pdf
8995   1411   25347623730ac47702369201032ad73f7aa80cda  test/ph2/test_ph2.pdb
6608   6392   fd3d30f6b2a249968f88e2157aebda584e837de2  movies/ice-cage.flv
4345   1760   cdceced9914f6622f14a1878eca1d14791b33a39  examples/lammps/paracetamol-einstein/input.xml
3374   3179   1d1fd08788b781080445763eb970b1c9eb6b5dd5  data/ceri14psik-highlight.pdf
3224   1463   34d41828d417b38a91534326c4acdd7270f4f3f9  examples/lj/nst/reference/lj-nst.pos_0.pdb
3122   2993   cff1c905a7651caa25aac7f44f2e775a50096794  data/i-pi_1.0.zip
2902   2769   8d650c408757239e49ab82138bb63c5ec124028a  data/tut-lugano10.pdf
2085   1959   9c8e64ebb9605c8f3849c71cd83ea450a9a703e1  data/lugano10.pdf
1972   266    49a85bc43c83197f25d01fb6d17f9c3638dcf93a  examples/lammps/newdyn/nst-ice.xc.pdb
1913   333    3b8aea8aae32e049510f9f41c5ffe8951182bd59  examples/cp2k/basis/dftd3.dat
1859   158    4416e8913a2d64baaa44c126ce9cc43c21ce16bf  examples/lammps/h2o-mts/MTS-Ensemble/trial-04/log.lammps
1787   1003   1ae7d88e2ac652f70a6587c768447c0d2e61c25e  examples/lj/nst/reference/lj-nst.pos_0.pdb
1736   1011   037343786bd33f7d9e74af45b0ad2512e6b0c7e1  examples/lj/nst/reference/lj-nst.pos_0.pdb
1556   431    57652c560fac09ed25128b1db0058ee993a31202  examples/lammps/h2o-mts/MTS-Ensemble/trial-04/rpc.pos_0.xyz
1344   814    14266545d35648a107bbe12aba9852e466f2a5e9  examples/lj/nst/reference/lj-nst.pos_0.pdb
1228   1197   3e63eeeddbb442bb3040a33195a5a23de7668d64  images/header-homepage.jpg
1213   1183   b82671699bbf75cbe79fcab1dbc88924612282b4  data/i-pi_hands-on.zip
1034   419    21e92a4ed33f15d863ce7798449a506eb53da4f8  examples/lammps/paracetamol-phonons/simulation-fd.dynmat
996    411    5bcaf53662751881fd0f3a261046992301e4aab8  examples/lammps/paracetamol-debye/hessian.data
989    409    da9da9f135199c5f408cd06d1d526f4cf893c843  examples/lammps/paracetamol-phonons/simulation-fd.hess
950    404    2119abc2cd589b225fac326fb5e99cd3d720bc5c  examples/lammps/paracetamol-phonons/simulation-fd.mode
@ipoltavskyi
Copy link
Collaborator

ipoltavskyi commented Apr 17, 2017 via email

@OndrejMarsalek
Copy link
Collaborator Author

That would certainly be useful and will make the working tree slimmer, but the trickier part is removing it and other large deleted files from the repository. Because this means rewriting history, I want to be careful. Does anyone have experience with filter-branch?

@tomspur
Copy link
Collaborator

tomspur commented Apr 17, 2017

I did this a few times and it worked just fine. It is just unclear to me if you need to delete than this whole repository or if the older branches can still stay. So far, I was the only user of my repositories, where I did this, so this wasn't an issue back then

@ceriottm
Copy link
Collaborator

ceriottm commented Apr 17, 2017 via email

@OndrejMarsalek
Copy link
Collaborator Author

I just tried using this tool:

https://rtyley.github.io/bfg-repo-cleaner/

and it seems to work great. After filtering all files larger than 1 M (the specifics can be tweaked for the production run, of course), I get a much more acceptable 25 MB .git directory, instead of the 220 MB. You can try yourself, locally, just make sure that you don't push anything. Note that I found it better to work without the --mirror option so that you can run git more easily, including the large file script I posted before.

It will be the push to GitHub that will be the most sensitive part of this operation. Once that is done, everyone with write access must update their local clones and never push from a clone of the old bloated repository. We need to find a way to coordinate this. I suggest setting a date and time well ahead of time, sending a big fat warning to everyone with push access and getting explicit agreement that they know about it and will not push the old repository.

@grhawk
Copy link
Contributor

grhawk commented Apr 18, 2017

The main problem is that once we rewrite the history everyone must delete his local repo and download the one with the new history...
[EDIT]
exactly as Ondrej said at the end of his last post :)

@OndrejMarsalek
Copy link
Collaborator Author

It requires some coordination, but unless we plan to turn it into a weekly activity, I think it is worth it. Best way to ensure that it is rare is to be careful when pushing stuff to the repo.

@grhawk
Copy link
Contributor

grhawk commented Apr 26, 2017

This could also be the right moment to separate the example from the repo of the actual code. It would make code revision much much simpler...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants