Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup export temp attachment files #1326

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

dpapathanasiou
Copy link
Contributor

Changed create_attachments_zipfile() to stop leaving behind zip files in the system tmp folder which never get cleaned up.

Denis Papathanasiou added 3 commits June 3, 2014 15:46
… in the system tmp folder

which never get cleaned up, and have it return a data stream of the zip file contents intead.

Next up: fix the two places in the rest of the code which were calling this, and adapt them to
use the data stream in lieu of the file name.
the first caller of create_attachments_zipfile(), to treat the result
as a data stream, and catch and record any IOErrors as failed Export
objects in the database.

The prior logic was disgusting: it created a zip file in /tmp, read it
into memory to create a new file (a copy) elsewhere in the filesystem,
and never bothered with the original file in /tmp.
…(), to

accept the result back as a stream, and used django's FileWrapper to respond
to the http request immediately, rather than calling another to go read the
temp file first (which, of course, never gets cleaned up).

The response_with_mimetype_and_name() function in utils.export_tools.py is just
a mess: it depends on reading all its responses from the file system, and
mimetype assumption that all valid response types start with 'application/' is
just flat wrong!
@dpapathanasiou
Copy link
Contributor Author

It's not properly tested (b/c there are no good data sets for dev) but the changes are straightforward, and in the worst case scenario, someone's attached file export will fail.

But on the plus side, /tmp will no longer fill up with unnecessary files, and push the server to 0% free disk space.

@ukanga
Copy link
Contributor

ukanga commented Jun 4, 2014

I see that this still writes to the file system in the case of an export, this essentially has the same effect as the tmp folder with the disadvantage that now even after a reboot the file will still exist. Would it be ideal to use django storage here so that it can be saved on whichever default storage is in use e.g S3, rather than on the root file system always.

@dpapathanasiou
Copy link
Contributor Author

Either you didn't read the code correctly, or you didn't understand it.

In the current master, you create a temporary zip file, but then set delete to False[1], which goes against the default[2], and thus leaves the file in /tmp, without removing it.

Instead of passing back the data stream, so that temp storage can cleanup the file on its own, all you pass back is the temporary file name, and then you have to do a read from disk again.

So when an http download request happens through zip_export(), you create a zip file in /tmp, read it back into memory from the file system, respond to the http request, but leave it in /tmp because you have no way of handling the post-response condition.

Worse, when you call create_attachments_zipfile() from generate_attachments_zip_export() you create a zip file in /tmp, read the that zip file into memory so that you can make a new file (a copy!) elsewhere in the filesystem[3] and you leave the zip file in /tmp!

You could have just at least moved the file from /tmp, without creating a copy unnecessarily.

Abusing S3 and accumulating superfluous space and bandwidth costs b/c you do not use temp storage properly is not a good idea, either; it's just sweeping the problem under the rug.

Anyway, I am particularly upset since Forhub.org crashed on Monday because of this nonsense: we went from 80 gb free to 0 in a very short time.

My revisions do not leave files in /tmp after they are created, if you look at them carefully.

Also, there is no such thing as a file surviving in /tmp in ubuntu after reboot (which is fortunate, b/c it was the only way we recovered on Monday).

[1]

tmp = NamedTemporaryFile(delete=False)

[2] https://docs.python.org/2/library/tempfile.html#tempfile.NamedTemporaryFile
[3]
temp_file = open(zip_file)

@prabhasp
Copy link
Contributor

prabhasp commented Jun 4, 2014

The filesystem storage vs. non-storage discussion is a bit over my head. Sorry I don't have the time to dig in fully at the moment.

My one comment on the pull request (PR) is to delete this test if it is no longer relevant. It is failing on travis, and seems to be no longer relevant. (There are two others, but they are not related to the functionality in question in the PR).

@dpapathanasiou
Copy link
Contributor Author

So that test is still valid, though its secondary assertion[1] is technically incorrect.

According to the relevant standard[2], an attached (as opposed to an inlined) file should also suggest the filename in the same string.

So the test failed b/c 'attachment; filename=transportation_2011_07_25.zip' != 'attachment;' but the test assertion should be rewritten.

I also did a code review with @chrisnatali and @csytan who made some good suggestions that I will implement in this branch.

So let me make those updates (along with revising the test) and test on dev before this is merged.

[1]

self.assertEqual(response['Content-Disposition'], 'attachment;')

[2] http://tools.ietf.org/html/rfc2183

Denis Papathanasiou added 4 commits June 5, 2014 14:46
… disposition

string, which should include the filename (xform id string).
…ary file-like

object corresponding to the zipped file, rather than the entire data stream of the
file contents, since it is safer for the functions which invoke it to handle it
themselves.

As with the previous revision, this function does *not* leave files in /tmp after
the request is completed.
…le, rather than

the full data stream of the prior version.
… handle of the zip file,

rather than the full data stream of the prior version.
@dpapathanasiou
Copy link
Contributor Author

The challenge now is testing this properly; I deployed this branch via fabric to dev.formhub.org but I haven't been able to confirm completely that it would work properly on the regular formhub.org.

Any suggestions?

@prabhasp
Copy link
Contributor

prabhasp commented Jun 5, 2014

My suggestion is to link it to the s3 account where you cloned ossap photos
not too long ago, and trying to do an export on one of the ossap datasets.
On Jun 5, 2014 4:45 PM, "Denis Papathanasiou" [email protected]
wrote:

The challenge now is testing this properly; I deployed this branch via
fabric to dev.formhub.org but I haven't been able to confirm completely
that it would work properly on the regular formhub.org.

Any suggestions?

Reply to this email directly or view it on GitHub
#1326 (comment).

@dpapathanasiou
Copy link
Contributor Author

That is long gone, though I could spin up that image quickly enough.

Also, TBH, I'm not sure I'm triggering the export correctly through the UI, so let's go over that tomorrow.

try:
zip_data = create_attachments_zipfile(attachments)
zip_data.seek(0)
with open(os.path.join(file_path, filename), 'wb') as f:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this file being saved? Notice this is during an aync export, my worry was that this is in the local file system and would count towards using the storage space for the root file system. Is this file deleted at some point?

@dpapathanasiou
Copy link
Contributor Author

I'm not sure what you're asking here.

The only point of this branch is to fix the function called create_attachments_zipfile[1] which creates zip files in /tmp and leaves them there, because of how the NamedTemporaryFile is called[2].

Since create_attachments_zipfile is called in two places in the code[3], I merely changed those interfaces to accept the file-like object TemporaryFile returns, rather than the string of the filename.

In doing it this way, we do not leave a file in /tmp after the request is completed.

Now, you seem to be asking me what the logic inside generate_attachments_zip_export is doing.

I should be asking you that.

I have no idea. I'm merely preserving the existing interfaces so that export will continue to do what it's been doing.

A full refactoring (including untangling all the Byzantine logic in export and elsewhere) may not be in the cards.

My only priority in this branch is making sure we never run out of disk space because /tmp filled up with zip files (this actually happened this past Monday).

[1] https://github.com/SEL-Columbia/formhub/blob/master/utils/viewer_tools.py#L199
[2] https://github.com/SEL-Columbia/formhub/blob/master/utils/viewer_tools.py#L201
[3]

zip_file = create_attachments_zipfile(attachments)

zip_file = create_attachments_zipfile(attachments)

@dpapathanasiou
Copy link
Contributor Author

BTW, while I was going about figuring out how to test this pull request properly, I noticed that you use delete=False for NamedTemporaryFile in at least three other places[1],[2],[3].

Only in one of those cases do you cleanup the temporary file correctly[4].

So even the function I just fixed, you're calling attachment.full_filepath[5] on each attachment, which means that even though the temporary zip file will get cleaned up, none of the media files that compose it won't, because if they were originally from S3, they've just been freshly written to /tmp using the delete=False flag!

This is just astounding...

[1]

tmp = NamedTemporaryFile(suffix=ext, delete=False)

[2]
tmp = NamedTemporaryFile(suffix=ext, delete=False)

[3]
tmp = NamedTemporaryFile(delete=False)

[4]
os.unlink(tmp.name)

[5]
z.write(attachment.full_filepath, attachment.media_file.name)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants