-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup export temp attachment files #1326
base: master
Are you sure you want to change the base?
Conversation
… in the system tmp folder which never get cleaned up, and have it return a data stream of the zip file contents intead. Next up: fix the two places in the rest of the code which were calling this, and adapt them to use the data stream in lieu of the file name.
the first caller of create_attachments_zipfile(), to treat the result as a data stream, and catch and record any IOErrors as failed Export objects in the database. The prior logic was disgusting: it created a zip file in /tmp, read it into memory to create a new file (a copy) elsewhere in the filesystem, and never bothered with the original file in /tmp.
…(), to accept the result back as a stream, and used django's FileWrapper to respond to the http request immediately, rather than calling another to go read the temp file first (which, of course, never gets cleaned up). The response_with_mimetype_and_name() function in utils.export_tools.py is just a mess: it depends on reading all its responses from the file system, and mimetype assumption that all valid response types start with 'application/' is just flat wrong!
It's not properly tested (b/c there are no good data sets for dev) but the changes are straightforward, and in the worst case scenario, someone's attached file export will fail. But on the plus side, /tmp will no longer fill up with unnecessary files, and push the server to 0% free disk space. |
I see that this still writes to the file system in the case of an export, this essentially has the same effect as the tmp folder with the disadvantage that now even after a reboot the file will still exist. Would it be ideal to use django storage here so that it can be saved on whichever default storage is in use e.g S3, rather than on the root file system always. |
Either you didn't read the code correctly, or you didn't understand it. In the current master, you create a temporary zip file, but then set delete to False[1], which goes against the default[2], and thus leaves the file in /tmp, without removing it. Instead of passing back the data stream, so that temp storage can cleanup the file on its own, all you pass back is the temporary file name, and then you have to do a read from disk again. So when an http download request happens through zip_export(), you create a zip file in /tmp, read it back into memory from the file system, respond to the http request, but leave it in /tmp because you have no way of handling the post-response condition. Worse, when you call create_attachments_zipfile() from generate_attachments_zip_export() you create a zip file in /tmp, read the that zip file into memory so that you can make a new file (a copy!) elsewhere in the filesystem[3] and you leave the zip file in /tmp! You could have just at least moved the file from /tmp, without creating a copy unnecessarily. Abusing S3 and accumulating superfluous space and bandwidth costs b/c you do not use temp storage properly is not a good idea, either; it's just sweeping the problem under the rug. Anyway, I am particularly upset since Forhub.org crashed on Monday because of this nonsense: we went from 80 gb free to 0 in a very short time. My revisions do not leave files in /tmp after they are created, if you look at them carefully. Also, there is no such thing as a file surviving in /tmp in ubuntu after reboot (which is fortunate, b/c it was the only way we recovered on Monday). [1] Line 201 in 9062e9d
[2] https://docs.python.org/2/library/tempfile.html#tempfile.NamedTemporaryFile [3] Line 691 in 5e8e176
|
The filesystem storage vs. non-storage discussion is a bit over my head. Sorry I don't have the time to dig in fully at the moment. My one comment on the pull request (PR) is to delete this test if it is no longer relevant. It is failing on travis, and seems to be no longer relevant. (There are two others, but they are not related to the functionality in question in the PR). |
So that test is still valid, though its secondary assertion[1] is technically incorrect. According to the relevant standard[2], an attached (as opposed to an inlined) file should also suggest the filename in the same string. So the test failed b/c 'attachment; filename=transportation_2011_07_25.zip' != 'attachment;' but the test assertion should be rewritten. I also did a code review with @chrisnatali and @csytan who made some good suggestions that I will implement in this branch. So let me make those updates (along with revising the test) and test on dev before this is merged. [1] formhub/main/tests/test_form_exports.py Line 121 in 9062e9d
[2] http://tools.ietf.org/html/rfc2183 |
… disposition string, which should include the filename (xform id string).
…ary file-like object corresponding to the zipped file, rather than the entire data stream of the file contents, since it is safer for the functions which invoke it to handle it themselves. As with the previous revision, this function does *not* leave files in /tmp after the request is completed.
…le, rather than the full data stream of the prior version.
… handle of the zip file, rather than the full data stream of the prior version.
The challenge now is testing this properly; I deployed this branch via fabric to dev.formhub.org but I haven't been able to confirm completely that it would work properly on the regular formhub.org. Any suggestions? |
My suggestion is to link it to the s3 account where you cloned ossap photos
|
That is long gone, though I could spin up that image quickly enough. Also, TBH, I'm not sure I'm triggering the export correctly through the UI, so let's go over that tomorrow. |
try: | ||
zip_data = create_attachments_zipfile(attachments) | ||
zip_data.seek(0) | ||
with open(os.path.join(file_path, filename), 'wb') as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this file being saved? Notice this is during an aync export, my worry was that this is in the local file system and would count towards using the storage space for the root file system. Is this file deleted at some point?
I'm not sure what you're asking here. The only point of this branch is to fix the function called create_attachments_zipfile[1] which creates zip files in /tmp and leaves them there, because of how the NamedTemporaryFile is called[2]. Since create_attachments_zipfile is called in two places in the code[3], I merely changed those interfaces to accept the file-like object TemporaryFile returns, rather than the string of the filename. In doing it this way, we do not leave a file in /tmp after the request is completed. Now, you seem to be asking me what the logic inside generate_attachments_zip_export is doing. I should be asking you that. I have no idea. I'm merely preserving the existing interfaces so that export will continue to do what it's been doing. A full refactoring (including untangling all the Byzantine logic in export and elsewhere) may not be in the cards. My only priority in this branch is making sure we never run out of disk space because /tmp filled up with zip files (this actually happened this past Monday). [1] https://github.com/SEL-Columbia/formhub/blob/master/utils/viewer_tools.py#L199 Line 564 in a5b6ef7
Line 679 in 5e8e176
|
BTW, while I was going about figuring out how to test this pull request properly, I noticed that you use delete=False for NamedTemporaryFile in at least three other places[1],[2],[3]. Only in one of those cases do you cleanup the temporary file correctly[4]. So even the function I just fixed, you're calling attachment.full_filepath[5] on each attachment, which means that even though the temporary zip file will get cleaned up, none of the media files that compose it won't, because if they were originally from S3, they've just been freshly written to /tmp using the delete=False flag! This is just astounding... [1] formhub/odk_logger/models/attachment.py Line 45 in 9062e9d
[2] formhub/odk_viewer/models/export.py Line 151 in 5e8e176
[3] Line 649 in a5b6ef7
[4] Line 655 in a5b6ef7
[5] Line 207 in 9062e9d
|
Changed create_attachments_zipfile() to stop leaving behind zip files in the system tmp folder which never get cleaned up.