Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abstract ref_to_m5 from cram_io #589

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

ThomasHickman
Copy link
Contributor

This pull request extracts the function m5_to_ref from cram_io and adds tests. m5_to_ref and helper functions for it are added in a new file called ref.c.

I've also added a new hFILE extension called hFILE_ref, internal to ref.c as a wrapper hFILE to carry information required by cram_populate_ref (such as file name, mFILE handle for MMapped files etc.). and also implemented the proposed in-memory I/O proposed in #417, so that this m5_to_ref returns a hFILE. Sorry about the number of commits, it took a bit of trying to get this working - I could squash the commits down into one if necessary

@ThomasHickman
Copy link
Contributor Author

I've squashed the commits into two commits and made a pull request for the in-memory I/O at #590. The original history is now at https://github.com/wtsi-hgi/htslib/tree/ref-old

@ThomasHickman
Copy link
Contributor Author

I've had a look at the double buffering problem - it's hard to stop buffering happening in a hFILE, so I think it's best to have a implement a "wrapper hFILE", which contains extra information, but redirects all hFILE calls to the wrapper object. I've implemented this above and made hFILE_refs a hFILE wrapper

@jkbonfield
Copy link
Contributor

Now the mem-io branch has been merged, is it possible to untangle this PR by removing those merges and rebasing on develop? I gave it a try, but it's a bit tricky and I ended up with something that didn't compile! You understand this bit better than me.

hfile.c Outdated
@@ -459,6 +488,14 @@ off_t hseek(hFILE *fp, off_t offset, int whence)

int hclose(hFILE *fp)
{
if(fp->is_hFILE_wrapper){
fp->backend->close(fp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generates a warning:

hfile.c: In function 'hclose':
hfile.c:493:27: warning: ignoring return value of function declared with attribute warn_unused_result [-Wunused-result]
         fp->backend->close(fp);

ref.c Outdated
}

int ref_close(hFILE* hf){
int error_code = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable

ref.c Outdated

if(hfref->mf)
return mfclose(hfref->mf);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Undefined return value here. Please add an else clause.

@ThomasHickman
Copy link
Contributor Author

ThomasHickman commented Dec 14, 2017

I've rebased the current develop branch onto this, and cleaned up what's in the Abstract ref_to_m5 from cram_io and Add hFile wrappers commits, so that whitespace isn't changed and commits change what they say. To do this, I've also needed to force push to wtsi-hgi's branch, so you may need to force pull to get the commits correctly.
I've also reverted the changes in open_trace_file.c, as getting the filename that this returns isn't necessary, now I've implemented hFILE wrappers, fixed some other issues that I found while looking though the code and fixed the warnings reported by @jkbonfield.

@ThomasHickman
Copy link
Contributor Author

For ease of reviewing the code, this is what I've done in this pull request:

  1. Moved get_cache_basedir, expand_cache_path, get_int_threadid, mkdir_prefix to ref.c, all of which were helpers functions of parts of cram_populate_ref which dealt with ref fetching.
  2. Created m5_to_ref from cram_populate_ref, moving over the ref fetching code.
  3. Made cram_populate_ref compatiable with results returned by m5_to_ref
  4. Created hFILE wrappers, which are hFILEs which don't contain a buffer or a file pointer but contains extra information fields and a pointer to another hFILE. All public calls are redirected to be in terms or the inner hFILE, except hclose which calls both the wrapper and the inner hFILE's close function. Backends that are hFILE_wrappers are created with only a close method in the hFILE_backend structure and the open method calls hfile_init_wrapper instead of hfile_init.
  5. Created a hFILE_ref, which is a hFILE wrapper to store information needed by cram_populate_ref and to store information to correctly close mFILEs on hclose.

If it would be easier to review, I could also move the changes that implement hFILE_wrappers into a new pull request.

@valeriuo
Copy link
Contributor

Please make sure that all tests pass, on both HTSlib and samtools. I get 13 failures when building samtools (develop) against this HTSlib branch.

@ThomasHickman
Copy link
Contributor Author

In the above commits:

aclocal.m4 Outdated
@@ -0,0 +1,290 @@
# generated automatically by aclocal 1.15.1 -*- Autoconf -*-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this file from the pull request and add it to .gitignore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think aclocal.m4 will be recreated by autoreconf -i or similar.

test/test_view -D test/xx#rg.tmp.cram > /dev/null && \
rm -rf $$REF_CACHE; \
else \
echo "Warning: testing of EBI fetching disabled"; \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complete the message with instructions about how to enable this test.

@@ -121,6 +121,11 @@ AC_ARG_ENABLE([s3],
[support Amazon AWS S3 URLs])],
[], [enable_s3=check])

AC_ARG_ENABLE([http-tests],
[AS_HELP_STRING([--disable-http-tests],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Describe the option --disable-http-tests in the Configuration section of the INSTALL file.

@@ -361,6 +365,26 @@ check test: $(BUILT_PROGRAMS) $(BUILT_TEST_PROGRAMS)
test/test-regidx
cd test && REF_PATH=: ./test.pl $${TEST_OPTS:-}

test_ebi_fetch:
@if [ "$(ENABLE_HTTP_TESTS)" = true ] ; then \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add this script to a separate .sh file and run it from here (similarly to what has been done for the tabix test above).

cram/cram_io.c Outdated
fd->refs->fp = fp;
fd->refs->fn = r->fn;
r->is_md5 = 1;
if (!(tag = sam_hdr_find_key(fd->header, ty, "M5", NULL)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you add curly braces to the else block, add them to the if block, as well, even if there is only one instruction in the block.

cram/cram_io.c Outdated
sanitise_SQ_lines(fd);
if (fd->refs->fp) {
if (bgzf_close(fd->refs->fp) != 0)
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the indentation of this row.

cram/cram_io.c Outdated
fd->refs = refs;
if (fd->refs->fp) {
if (bgzf_close(fd->refs->fp) != 0)
return -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the indentation of this row.

cram/cram_io.c Outdated
@@ -2383,7 +2168,7 @@ ref_entry *cram_ref_load(refs_t *r, int id, int is_md5) {
return NULL;
r->fn = e->fn;
if (!(r->fp = bgzf_open_ref(r->fn, "r", is_md5)))
return NULL;
return NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the tab to spaces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valeriuo All the code in the cram dir has mix of tabs and spaces. We should probably do a single commit to change it to match the rest of htslib, but for now it is like this because it was initially a direct copy from Staden io_lib which had the code in this layout.

We don't really need to fix these in cram dir in this PR, but we do when we move things to htslib's ref.c (which has been done).

AC_ARG_ENABLE([http-tests],
[AS_HELP_STRING([--disable-http-tests],
[disables tests that require querying to a http server])],
[disable_http_tests=yes], [disable_http_tests=no])
Copy link
Contributor

@valeriuo valeriuo Dec 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it was decided that HTTP tests should be disabled by default. So the option should be renamed --enable-http-tests and the associated variable should only be set to true when this option is selected.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Although @daviesrob may have a trivial perl http server (with no complex perl package dependencies) that could permit testing against a local server without needing network access. (I'm a bit wary of going that route though unless it's very little extra code to maintain.)

Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall aim of this PR is very worth while and it's long overdue for the monstrosity of reference handling in cram_io.c to be tidied up and split into its own file. So thank you for working on this.

The basic restructuring seems sound, although I haven't yet tested it in anger and I see a few potential memory problems lurking. I'll give it some real world testing to see what happens.

I do wonder though whether the whole mFILE thing should be retired in favour of the in-memory hFILE layer istelf. When the CRAM code was written there was no hFILE layer present, but now we have two things serving the a similar purpose. However perhaps this improvement is best done as a subsequent PR.

PS. The main thing that mFILE can do whereas hFILE in-memory code cannot is utilise mmap.

hfile.c Outdated
@@ -461,6 +489,15 @@ off_t hseek(hFILE *fp, off_t offset, int whence)

int hclose(hFILE *fp)
{
if(fp->is_hFILE_wrapper){
// need to close this file anyway, ignore the return value
(void)fp->backend->close(fp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to check the return value here. You can't discard it with a void cast either as the compiler still notices! :-)

hfile.c: In function 'hclose':
hfile.c:494:9: warning: ignoring return value of function declared with attribute warn_unused_result [-Wunused-result]
         (void)fp->backend->close(fp);
         ^

aclocal.m4 Outdated
@@ -0,0 +1,290 @@
# generated automatically by aclocal 1.15.1 -*- Autoconf -*-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I think aclocal.m4 will be recreated by autoreconf -i or similar.

AC_ARG_ENABLE([http-tests],
[AS_HELP_STRING([--disable-http-tests],
[disables tests that require querying to a http server])],
[disable_http_tests=yes], [disable_http_tests=no])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Although @daviesrob may have a trivial perl http server (with no complex perl package dependencies) that could permit testing against a local server without needing network access. (I'm a bit wary of going that route though unless it's very little extra code to maintain.)

cram/cram_io.c Outdated
@@ -2383,7 +2168,7 @@ ref_entry *cram_ref_load(refs_t *r, int id, int is_md5) {
return NULL;
r->fn = e->fn;
if (!(r->fp = bgzf_open_ref(r->fn, "r", is_md5)))
return NULL;
return NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valeriuo All the code in the cram dir has mix of tabs and spaces. We should probably do a single commit to change it to match the rest of htslib, but for now it is like this because it was initially a direct copy from Staden io_lib which had the code in this layout.

We don't really need to fix these in cram dir in this PR, but we do when we move things to htslib's ref.c (which has been done).

hfile.c Outdated
errno = save;
fp = ((hFILE_wrapper*)fp)->innerhf;

free(fp);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong. I think you wanted something like

hFILE *wrapper = ((hFILE_wrapper*)fp)->innerhf;
free(fp);
fp = wrapper;

However I'd need to think about it a bit to know which is the inner vs outer we should actually be freeing here.

htslib/ref.h Outdated
#ifndef HTSLIB_REF_H
#define HTSLIB_REF_H

#include "bgzf.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is bgzf.h here? I don't think this will work when installed either.

ref.c Outdated

// The buffer is freed in mfclose, but we need something for
// hclose to free (the caller of this), or it will error
((hFILE_wrapper*)hf)->innerhf->buffer = malloc(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels icky. Is the comment really true about needing something? Note we can call free(NULL) fine these days. It's something that I still end up checking for out of habit, but it's been this way for decades now.

I haven't tested it, but would prefer ((hFILE_wrapper*)hf)->innerhf->buffer = NULL if that works. Alternatively if we're about to free innerhf->buffer and it's the same pointer as in the mfile then we can call mfsteal instead of mfclose which leaves the data unfreed (maybe nuances with mmap?) and let the hclose free as normal?

ref.c Outdated
else{
mf = NULL;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure about the above.

mfsteal returns the buffer and size from the mFILE and then destroys (frees) the mFILE. It's designed to turn an mFILE into a C buffer basically. Thus after the mfsteal call mf is now undefined and we cannot dereference it and nor should we pass it to open_hfile_ref.

Copy link
Contributor

@jkbonfield jkbonfield Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm looking further I see how it works. It only uses mf if mfsteal failed, otherwise it sets mf to NULL. mfsteal can only fail if mfdetach fails, which in turn only fails if we're using mmap.

I think the logic in this function is therefore correct (sorry), but I'm wondering if it's a symptom of my mfsteal being too complex in error handling.

At the very least we probably should do something like this to get to a consistent point whether or not the steal fails:

if (!seq) {
    seq = mf->data;
    mf->data = NULL;
    mfdestroy(mf);
}

and remove the mf parameter from open_file_href. This all ties up with the ref_close comments though too, as that's where href->mf is used. Is the ref_close only expected to be doing mFILE things due to mmap?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for spamming. I've finally figured out what's happening here and I feel guilty for commenting on what turns out to be entirely my own demons!

So, just incase anyone else who is reading this cares...

The original code in cram_populate_ref was:

    /* Otherwise search full REF_PATH; slower as loads entire file */
    if ((mf = open_path_mfile(tag->str+3, ref_path, NULL))) {
        size_t sz;
        r->seq = mfsteal(mf, &sz);
        if (r->seq) {
            r->mf = NULL;
        } else {
            // keep mf around as we couldn't detach                                                              
            r->seq = mf->data;
            r->mf = mf;
        }
        r->length = sz;
        r->is_md5 = 1;
    } else {

(The comment would probably have been useful to keep, albeit with explanation that mfdetach won't free the data if we used mmap.)

Basically if the open_path_mfile used mmap on an md5 file then we need to keep the file descriptor open as we need it for munmap later on. Otherwise once it's open we can just steal the pointer can close it. This would have been the original way it was done, with mmap appearing later on. With hindsight it probably should have kept mf valid in all scenarios, for simplicity, and never attempted the mfsteal part at all. We probably needed a "mf_close_if_able" function or similar that closed the file descriptor if not mmaped so it's no longer consuming resources, but keep the rest of the mFILE struct present.

Hence hFILE has now taken the role as replacement for ref_entry *r and r->mf. It's the same code as my original, just with different structure names. Sorry Thomas. :)

I think what all of this points towards is that mFILE is overkill for what we need here. It was designed as a transparent drop in for stdio FILE I/O so it supports read/write, seeking, update, etc. We really just need a load-or-mmap function, which is what the hfile mem layer provides (minus the missing mmap bit).

PS. My head hurts!

ref.c Outdated
perror(path_tmp);

// Not fatal - we have the data already so keep going.
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We lost a return 0 (NULL now) in the if (!fp) clause from the original cram_io.c copy. The intent was if we fail to create a local temporary file name to cache the reference, then so be it - we do still have our reference and the code can continue to work, but for whatever reason (eg permission denied) being able to write to the cache is a non-fatal condition.

However continuing means we end up calling hwrite, hfclose etc on NULL.

Copy link
Contributor

@jkbonfield jkbonfield Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have validated that this is an issue and that adding return 0 after the comment fixes it.
Try setting REF_CACHE to a directory that you don't have write access to and running the test-ref tool. It'll segmentation fault.

Edit: not return 0, but something like this:

        if (!fp) {
            perror(path_tmp);

            // Not fatal - we have the data already so keep going.                                               
            free(path);
            return ref;
        }

This still leaks 56 bytes somewhere though, but that appears to happen in the other code paths too:

==29772== 56 bytes in 1 blocks are definitely lost in loss record 4 of 7
==29772==    at 0x4C2CFA7: malloc (vg_replace_malloc.c:296)
==29772==    by 0x458038: hfile_init_fixed (hfile.c:127)
==29772==    by 0x458108: create_hfile_mem (hfile.c:815)
==29772==    by 0x458108: hopenv_mem (hfile.c:864)
==29772==    by 0x459C4A: hopen (hfile.c:1086)
==29772==    by 0x45AD62: mFILE_to_hFILE_ref (ref.c:245)
==29772==    by 0x45AE53: m5_to_ref (ref.c:317)
==29772==    by 0x43F105: cram_populate_ref (cram_io.c:1934)
==29772==    by 0x43F105: cram_get_ref (cram_io.c:2280)
==29772==    by 0x429F71: cram_decode_slice (cram_decode.c:2290)
==29772==    by 0x42A4F3: cram_next_slice (cram_decode.c:3103)
==29772==    by 0x42A4F3: cram_get_seq (cram_decode.c:3176)
==29772==    by 0x42AAC0: cram_get_bam_seq (cram_decode.c:3236)
==29772==    by 0x419EFD: sam_read1 (sam.c:1346)
==29772==    by 0x40417F: main (test_view.c:285)

}

return error_code;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a newline at the end of file.

cram/cram_io.c Outdated
return -1;
}
r->seq = seq;
r->mf = hfref->mf;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have hfref=ref above on line 1981 and free(ref) at line 2001, this is accessing freed memory. The free statement needs moving down a few lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, the free(ref) isn't sufficient. If ref is an hFILE_wrapper then we also need to free the innerhf too. Eg:

            if (ref->is_hFILE_wrapper)
                free(((hFILE_wrapper *)ref)->innerhf); // hack!                                                  
            free(ref); // Can't hclose as this would free the buffer                                             

This is all feeling like the wrong solution and it's going to fail if we ever have other forms of wrappers, but without the above we leak memory.

I'm trying to recall the reasoning for the wrapper and why it can't just be an hFILE_mem hfile class without any wrapping up. I think the wrapper is there because we have another type (hFILE_ref), but perhaps we just want hFILE_mem expanded to be able to do the same as the ref one.

The ref version adds mf and file_name fields. mf is only needed due to mmap and the requirement to unmap at the end. File_name I can't see a use for in this code. If we have requirements for these then we may be better off adding them both to an extended hFILE_mem structure instead of introducing an entirely new wrapper layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbonfield The use of hfile_wrappers comes from the need to use the existing code in cram_populate_ref and functions that call it, without much modification - I didn't want to rewrite the logic in cram_populate_ref and cram_get_ref (and maybe other places). This required a structure that could hold both in-memory structures to be contained as well holding arbitrary hFiles generated from https://github.com/wtsi-hgi/htslib/blob/1a6a7e242d75c9b746653ee548d2bce430cb905b/ref.c#L304. The file name, length and mFile (if used) also needed to be stored, where the file name and length are used here: https://github.com/wtsi-hgi/htslib/blob/1a6a7e242d75c9b746653ee548d2bce430cb905b/cram/cram_io.c#L1989 and here: https://github.com/wtsi-hgi/htslib/blob/1a6a7e242d75c9b746653ee548d2bce430cb905b/cram/cram_io.c#L1983.

I could rewrite relevant code such that m5_to_ref would use an extended in-memory hFile pointing either memory or mmapped data, where this extended structure would contain a flag indicating whether this is an mmapped structure and it's length. However, it may be better to work to merge this PR as it is with the inelegant implementation, then refractor the code at a later time.

This creates a new file: `ref.c`, which contains a function `m5_to_path`. This function returns a valid path to a reference from the MD5 checksum of a reference. This replaces logic in cram_io, open_trace_file and mFILE.
This adds functionality implemented in b39e724.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants