Skip to content

Find the contig sequence from a FASTA file of DNA sequences.

Notifications You must be signed in to change notification settings

sgoodfriend/contig-sequence

Repository files navigation

contig-sequence

To sequence the DNA for a given individual we typically fragment each chromosome to many small pieces that can be sequenced in parallel and then re-assemble the sequenced fragments into one long DNA sequence. This challenge takes on a specific subtask of this process.

Challenge

The input to the problem is at most 50 DNA sequences (i.e, the character set is limited to T/C/G/A) whose length does not exceed 1000 characters. The sequences are given in FASTA format (https://en.wikipedia.org/wiki/FASTA_format). These sequences are all different fragments of one chromosome.

The specific set of sequences you will get satisfy a very unique property: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length.

The output of your program should be this unique sequence that contains each of the given input strings as a substring.

Example input (example.fasta)

>Frag_56
ATTAGACCTG
>Frag_57
CCTGCCGGAA
>Frag_58
AGACCTGCCG
>Frag_59
GCCGGAATAC

Example output

ATTAGACCTGCCGGAATAC

Usage

Dependencies

contig_sequence.py requires Biopython, a collection of Python tools for computational biology. Biopython is used to parse the FASTA files into individual sequences. To install Biopython with pip:

$ sudo pip install biopython

contig_sequence Usage

To find the contig sequence, you can either run contig_sequence.py with or without invoking python:

$ python contig_sequence.py test_seqs/example.fasta
ATTAGACCTGCCGGAATAC

The contig sequence is outputted to stdout unless if a single contig sequence cannot be found. In such cases there are is no stdout and an error code is returned:

  • 1: Python could not import Biopython.
  • -1: The inputted file had no sequences.
  • -2: Could not find a single contig sequence that encompasses all of the sequence data.
  • -3: The contig sequence is a loop.

Unit Test Usage

To run the unit tests associated with contig_sequence.py on the command line:

$ python -m unittest test_contig_sequence

Approach

Finding the contig sequence can be split into 6 steps:

  1. Read the fasta file to a list of biopython SeqRecords using SeqIO
  2. Use the fragments to generate a SeqPrefixHashMap, which represents a lookup table of prefix subsequence hashes. This map will be used in multiple pattern Rabin-Karp string searching to find matching ends.
  3. SeqPrefixHashMap.node_to_edges() iterates through all fragments and possible suffix subsequences. If the suffix's hash matches prefixes in the SeqPrefixHashMap table, compare the suffix and prefix for a match. If they match, generate a SeqJoinEdge.
  4. Generate a directed graph SeqJoinGraph of nodes (SeqNode represents a fragment) and edges (SeqJoinEdge generated by SeqPrefixHashMap.node_to_edges()).
  5. Find the longest path that traverses all of the nodes using SeqJoinLongestPath.
  6. Use the longest path of edges from SeqJoinGraph.__longest_path() to generate the output sequence in SeqJoinGraph.sequence().

SeqNode, SeqJoinEdge, and SeqJoinGraph are structural to solving the longest path problem. SeqHash is a class object that contains hashing constants and a helper method.

The most interesting parts are found in SeqPrefixHashMap (Steps 2-3) and SeqJoinLongestPath (Step 5), so I will expand upon what these classes are doing.

Steps 2-3: SeqPrefixHashMap

The prefix hash map in SeqPrefixHashMap is generated in the constructor. A 2-level dictionary (key: subsequence length, subsequence hash, value: list of SeqNodes) is created by iterating through each fragment. A hash for the first half of the fragment is calculated. Subsequent hashes are calculated by shifting the hash 7 bits, adding the next letter's value to the hash, and then modding the new value by the large prime Q. After each hash calculation, the SeqNode for this length-fragment pair is added to the dictionary.

Building up the edges in SeqPrefixHashMap.node_to_edges() is done in similar fashion by iterating through each fragment and calculating the hash of the latter half (suffix). Lookup length-hash in the table to see if there are any prefixes that possibly match. If so, the suffix is compared to the prefix by codon. A match generates a SeqJoinEdge, which tracks source and destination nodes and also the index in the source fragment where the overlap begins. Next, an additional letter is added to the beginning of the suffix until the entire fragment is tested. The hash calculation is a bit more complicated requiring multiplying the new letter by (27*k mod Q) before adding it to the existing hash. SeqHash.rq_for_length() handles the overflow by iteratively calculating the multiplier and modding by Q and then saving the result into an array.

Step 5: SeqJoinLongestPath

SeqJoinLongestPath is an amalgamation of two algorithms to calculate the longest path. If you can guarantee that the graph is linear (SeqJoinLongestPath.is_linear_graph() checks for this), then SeqJoinLongestPath.linear_longest_path() can be used. This simply finds the root node of the graph and then iterates through each node's edge till reaching the end.

If you cannot guarantee a linear graph, then the general longest path algorithm SeqJoinLongestPath.dfs_longest_path() is used. This algorithm will try to take advantage of a root node and end node being immediately obvious, but will otherwise have to resort to finding the longest path between every pair of nodes, returning the longest result. The dfs algorithm checks every simple path between source and destination, thus making the algorithm exponential in the worst case, which can be made even worse than a normal complete digraph because each length of suffix can match with every node on its own (a fragment of length 1000 with 100 other fragments can have 50,000 edges leading out of it).

Algorithmic Complexity

Space

The SeqPrefixHashMap takes O(NL) space, where N is the number of fragments & L is the length of fragments. SeqJoinGraph takes O(N+E) space. E is typically O(N); however, can be O(N2L) in worst-case. Therefore, typical space complexity is linear O(n) = O(NL); however, can be O(N2) in worst-case.

Time

All the 6 steps occur sequentially, so the time complexity is the worst of each step:

  1. Read file: O(NL) = O(n) (linear to input)
  2. SeqPrefixHashMap generation: O(NL) = O(n)
  3. SeqJoinEdge generation using SeqPrefixHashMap: Worst-case: O(n2). Typical: O(NL) = O(n)
  4. Generate SeqJoinGraph: Worst-case: O(N2L). Typical: O(N)
  5. Find longest path: Worst-case: O(2E) = O(2N*NL) = O(2Nn). Typical: O(N)
  6. Generate sequence from longest path: O(n)

Therefore, performance is typically linear to the size of the input; however, a worst-case of exponential time to the input is possible.

About

Find the contig sequence from a FASTA file of DNA sequences.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages