Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excel validator #62

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

FranBonath
Copy link
Contributor

Script to check library submission sheets for:

  • duplications of indexes in pools
  • color balance of indexes
  • wrong indexes (e.g. non-nucleotide letters)
  • mix of different indexes in one pool
  • index length
  • ordered cycle number vs. fragment length average

…ss of library submission sheets. Checks for empty entries, missing values and does basic checks on indexes, like check for double index in pool, color balance and length.
@@ -0,0 +1,495 @@
#!/usr/bin/env python

# load libraries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you organise library?
I like to sort things ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting is nice, but note that from __future__ import print_function is special and must be the first line in the script.

return(False, warnings_numeric)

# currently not used
#def validate_mol(self, min_mol, max_mol):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could this part of the code be used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for future implementation of a check on pool molarity. I am waiting for Mattias to do some stuff, so Johannes (or me) can create a database, which is required to know which application requires what molarity.

@maxulysse
Copy link
Member

You know that you're better at Python than I.
But you requested me, so here are some comments (apart from that, I have no idea what the code is doing, and it looks good)

If I remember well there should be some guidelines for such coding practices.

I would comments about what the functions return, just before the definition of the function instead of at the end.
You should always add a space before # when you comment, or not, but be consistent ;-)
I love sorted lists (when importing libraries or such).

@jgruselius
Copy link

Nice work Fran. Out of curiosity how this works I looked through your code and left some comments. Sorry for sticking my nose in 😇

min_length = sorted(count_length.keys())[0]
index_list_colour = []
for index in index_seq:
index_colour = index.replace('T','G').replace('A','R').replace('C', 'R')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure what you do here (maybe you should add more comments explaining it 😛).. I'm pretty sure on two channel chemistry an A gives signal in both channels, i.e. it has both colors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right! My bad! I was too concentrated on getting the script to work...

if lev.distance(index.lower(), index_seq[i].lower()) < LibrarySheet.MAX_DISTANCE:
logger.warning('The index sequences {} and {} in pool {}'\
' display low diversity (only {} nt difference).'\
.format(index,index_seq[i], pool_name, lev.distance(index.lower(), index_seq[i].lower())))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You calculate the same Lev. distance twice if the test returns true, if it's an expensive operation maybe you should store the result from lev.distance in a variable for re-use.

index_count = 1
for index in index_seq:
# checks that indexes only contain valid letters
charRE = re.compile(r'[^ATCGNatcgn\-.]')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex doesn't change between iterations so you should probably compile it outside the loop.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was about to point out that . is a special regex character which means "any character" and that your regex was broken. However it seems that if it's within [^.] then it's a special case where it's not a special character. Anyway, I would escape it to help future people reading your code 😅 (it should still work the same when escaped)

charRE = re.compile(r'[^ATCGNatcgn\-\.]')

for row in index_list_colour:
try:
column.append(row[row_nr])
except IndexError:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should motivate why you catch and ignore this error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and log - even if it's a debug level log. pass is bad style 😉

Copy link
Member

@ewels ewels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know you didn't ask for my review, but as others were getting in on the act I couldn't resist taking a look as well..

Some of the code made me go a bit cross-eyed when trying to figure out what you were doing, but generally it looks good - nice work! 👍

I've left a few comments generally whinging about code style and being "Pythonic". None of these are a big deal though, they're just best-practice things. They make life easier for anyone else maintaining your code (typically that person is future-you). People whined at me about this stuff for several years and it was annoying, but now that it's mostly ingrained, it's my time to wine at other people.. 😁

@@ -0,0 +1,495 @@
#!/usr/bin/env python

# load libraries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting is nice, but note that from __future__ import print_function is special and must be the first line in the script.

library_information_validator.py Outdated Show resolved Hide resolved
warning_low_div, warning_index_length,\
warning_index_balance = validator.validate_index(result_index, pool, sindex)

pool_warning = pool, [warning_low_div, warning_index_length, \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, the code clarity here is a little scary. Is there a reason that you can't use a dictionary? eg. something like:

pool_warnings.append({
    'pool': pool,
    'warning_low_div': warning_low_div,
    'warning_index_length': warning_index_length,
    'warning_index_balance': warning_index_balance,
    'warnings_cycle': warnings_cycle[i],
    'warning_index_mix': warning_index_mix,
    'warning_length_comp': warning_length_comp,
})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason, just inexperience :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It made a bit more sense as I got further on into the code later. But generally, yes returning a structured response like this is better IMHO.

library_information_validator.py Outdated Show resolved Hide resolved
library_information_validator.py Outdated Show resolved Hide resolved
library_information_validator.py Show resolved Hide resolved
library_information_validator.py Outdated Show resolved Hide resolved
for row in index_list_colour:
try:
column.append(row[row_nr])
except IndexError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and log - even if it's a debug level log. pass is bad style 😉

library_information_validator.py Outdated Show resolved Hide resolved
library_information_validator.py Outdated Show resolved Hide resolved
@@ -57,12 +59,11 @@ def projectID(self):
if(len(re.findall('P\d+P\d+', plate_id))>0):
project_id_user = re.findall('P\d+', plate_id)[0]
else:
logger.error(
sys.exit(logger.error(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument for sys.exit() is usually a number which corresponds to an exit code. 0 is good, anything other than 0 is bad (error) - typically 1 when writing code.

logger.error is probably returning None or something else odd. What you want instead I think is the logger.error call followed by sys.exit(1)...

Copy link
Contributor Author

@FranBonath FranBonath Apr 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I got the idea to do it like this from the python documentation. There is says as follows:

"The optional argument arg can be an integer giving the exit status (defaulting to zero), or another type of object. [...] If another type of object is passed, None is equivalent to passing zero, and any other object is printed to stderr and results in an exit code of 1. In particular, sys.exit("some error message") is a quick way to exit a program when an error occurs."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - but this doesn't really change my point. logger.error always returns None (I just checked), which means that you are always exiting with exit status 0 - this is interpreted as exiting successfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps. To clarify: sys.exit("some error message") would exit with a non-zero exit code. It's just sys.exit(logger.error("some error message")) which won't..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:( and I thought I was right for once. But I bow to the master and will change it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

beard

@alneberg
Copy link
Member

@FranBonath should we close this one?

@FranBonath
Copy link
Contributor Author

shrug
or we update and implement it in the website, so ppl can pre-check their sample sheets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants