-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: how to extend a HMM to use String-like types as its state space? #12
Comments
Hu @camilogarciabotero, glad that you might be using my package! |
Thank you for the quick response. The encoding part sounds nice and achievable, I'll give it a try. On the other hand, as I understood the emissions display distributions that are independent between states (although the characters' probability is dependent on the immediately previous character). So that, for instance, the Sorry if I'm not very clear with the descriptions. I am following Ch. 2 of the textbook of Axelson-Fisk (2015) for reference. Axelson-Fisk, M., & Axelson-Fisk, M. (2015). Single Species Gene Finding. Comparative Gene Finding: Models, Algorithms and Implementation, 29-105. |
Ok, I see the differences and yes, it makes sense that the emissions are ultimately influenced by both... Now I see your point. Then, it is still not achievable the second representation with the HiddenMarkovModels? |
Not in the current state of the package. This specific variant is called an autoregressive HMM (AR-HMM), and it requires a slightly different implementation. I'm not sure I will include it in the package, at least in the near future. There is a trick however, which would allow you to estimate an AR-HMM using a standard HMM. The trick is to define a new, aggregate state |
Wait, on second thought I guess it is indeed possible with v0.1, but I need to wrap my head around it. I will try to spit out an example |
Cool, never heard of AR-HMMs. However, they are not referenced in the gene-finding theory I was following. It seems that is a more general framework where the observations depend on several previous steps, but in the Markov process of the nucleotides, they seem to be only of the first order, that way only depending on its immediately previous nucleotide. So that the probability of a new sequence Now in the HMMs the function dnahmm(N)
π = [π₁, π₂, π₃, π₄]
A = zeros(Int, 4, 4) # initial distribution
dists(A₁, A₂) # Coding and Non-coding transition probabilities
end I will need to define a custom function to calculate the N = Dict('N' => 0,'C' => 1) And then I'm not pretty sure how to make the predictions... but I will expect to do something like: state_seq, obs_seq = rand(hmm, "ATCGTTGGGGGGCATGCCATGTTCGAGAGTCTTTGACCCAAGACACGTAACCTATGCTTGAACGCGCTGGGAAAT")
I will be looking forward to your example! |
Hey @camilogarciabotero! |
The reason I'm doubting is because I can't seem to estimate the exact parameters even with very long sequences. But it might just be due to the hill-climbing behavior of the Baum-Welch algorithm, which is vulnerable to local optima. |
This is great! I can't wait to try it and let you know how it goes for me (how do you suggest I try it out? Making a fork? What's the best way). Regarding:
I was actually working on a simple package to model DNA ( |
You can fork and play around with the code in the pull request #24 :) |
Closed by mistake. The file with the code is now on |
Ok, thanks again for following this question and FR, I could try it out and got some questions:
Why is
It looks like:
I think this might be a little bit simpler to have a However, as I could spot in the documentation, we can now easily create different HMMs with the interface, so my guess is that it could be easily modified...
|
As for the error, check out the new file test/dna.jl, it's just that I forgot to initialize the probability vectors with rand_prob_vec. It's fixed on the main branch. As for the storage format, I could have separated the two matrices A but I put them together. Maybe not my best move but it's easy to change |
Yeah, I tried the update and it runs now, my bad. Regarding the states, for a simply DNA alphabet I am now getting this error for the following:
|
The order I gave in the docstring is an arbitrary example, it doesn't matter how you map letters to numbers as long as you do it consistently. This error is weird cause I test against it in my code, did you just run my file or make some changes? |
Ok, sorry for the bad attempts. I could finally run it completely. It looks like it is actually finding the hidden states. The thing is that since the
Maybe we can use those values to instantiate the |
Yeah, definitely use realistic values, that's what I meant when I said you should test on actual field data. My initializations are all random so they don't mean anything. If you already know the parameters If you want to estimate the parameters, we need to provide a good initial guess to the |
yes
I will try them out, thank you very much for all the help. I was trying to understand the functions:
Are they custom for the |
No these functions are not related to how nuc_trans is stored (or how you encode the letters into numbers). They are just used to convert between a couple of integers (representing the coding <=2 and nucleotide <= 4) and a single integer state <= 8. The former representation is the biological one, but the latter is used by my package |
Hi @gdalle, I've been thinking and trying this feature a bit, but still haven't figured out how to plug in the two variables Best. |
It all depends if you want to do decoding or learning. For deciding, these values are quite important, for learning they don't matter much cause they are just initial guesses. Why don't you try it with uniform distributions and see what happens? |
Hey @gdalle
Thanks for working on this package. I'm very interested in applying this package to work with the$S = {A, C, G, T}$ into coding or non-coding states. So, normally what one has is a string of characters (
BioSequences.jl
package, which allows working with DNA sequences specialized types. I'm interested in the application of HMMs to the gene-finding problem. There are several examples, a simple one consists on classify a DNA sequence of an alphabetACTATCTATCT...
) whereby one wants to locate the coordinates of the coding and non-coding regions. Generally, the coding sequence (CDS) has an encoding characteristic such that it could each triplet (Codon) is translated into an amino acid.Here is a simple representation of HMMs of the problem.
Where the sequence of characters corresponds to the emissions and the coding (C) and non-coding (N) hidden states are the ones we want to unveil.
My question now is, what would be the best way to implement or extend the HiddenMarkovModels.jl types and methods such that we can have
BioSequences.jl
types as state space?I am imagining something like this:
Now, assuming that in this sequence there exist two states coding (C) and non-coding (N):
I then want to create a
hmm
model so that it takes the new sequence, the transitions of the emissionsC -> G
,C -> T
, ... and predicts the locations of the hidden states.Now, nucleotides in each state (C, and N) display transition frequencies that are characteristics of each state. Normally one can represent the transition between the nucleotides as a Markov chain where each transition$a_{ij}$ where $i,j \in S$ in a DNA sequence of length $T$ can be obtained as:
The initial probability distributions are denoted then as$\pi = {\pi_{A}, \pi_{C}, \pi_{G}, \pi_{T}}$ , estimated from the transition between characters $c$ from the sequence:
Where probabilities of a new sequence are given by:
In the classification problem they are used in a$likelihood-ratio test$ to classify based on a decision rule:
Where$\eta$ is a threshold value for a significance level.
This led my to the idea that it might necessary to use an extra package (or another implementation) to encode the Markov chain in order to get the transitions from nucleotide to nucleotide. Hope this was clear, and rings any bells on what could be a nice strategy.
Best.
The text was updated successfully, but these errors were encountered: