Markov Chains - 23 August 2011 - BioInformatics Pakistan

		Monday, 2025-07-14
	BioInfo Pakistan

Home | Sign Up | Log In

Welcome Guest | RSS

Total online: 1

Guests: 1

Users: 0

Markov Chains

1. Description:

A Markov chain, named for Andrey Markov, is a mathematical system that undergoes transitions from one state to another (from a finite or countable number of possible states) in a chainlike manner. It is a random process characterized as memoryless: the next state depends only on the current state and not on the entire past. This specific kind of "memorylessness" is called the Markov property. Markov chains have many applications as statistical models of real-world processes.

2. Key Points:

If we can predict all of the properties of a sequence knowing only the conditional dinucleotide probabilities, then that sequence is an example of a Markov chain
A Markov chain is defined as a sequence of states in which each state depends only on the previous state

3. Formalism For Markov Chains:

M=(Q,.,P) is a Markov chain, where
Q = vector (1,..,n) is the list of states

Q(1)=A, Q(2)=C, Q(3)=G, Q(4)=T for DNA

p = vector (p1,..,pn) is the initial probability of each state

p(i)=pQ(i) (e,g., p(1)=pA for DNA)

P= n x n matrix where the entry in row i and column j is the probability of observing state j if the previous state is i and the sum of entries in each row is 1 ( dinucleotide probabilities)

P(i,j)=p*Q(i)Q(i) (e.g., P(1,2)=p*AC for DNA)

4. Generating Markov Chains:

Given Q,.,P (and a random number generator), we can generate sequences that are members of the Markov chain M
If .,P are derived from a single sequence, the family of sequences generated by M will include that sequence as well as many others
If .,P are derived from a sampled set of sequences, the family of sequences generated by M will be the population from which that set has been sampled

5. Matlab Code For Generating Markov Chains:

chars = ['a' 'c' 'g' 't'];

% the dinucs array shows the frequency of observing the character in the

% row followed by the character in the column

% these values show strong preference for c-c

dinucs = [2, 1, 2, 0; 0, 8, 0, 1; 2, 0, 2, 0; 1, 0, 0, 1];

% these values restrict transitions more

%dinucs = [2, 0, 2, 0; 0, 8, 0, 0; 2, 0, 2, 0; 1, 1, 0, 1];

% calculate mononucleotide frequencies only as the probability of

% starting with each nucleotide

monocounts = sum(dinucs,2);

monofreqs = monocounts/sum(monocounts);

cmonofreqs = cumsum(monofreqs);

% calculate dinucleotide frequencies and cumulative dinuc freqs

freqs = dinucs./repmat(monocounts,1,4);

cfreqs = cumsum(freqs,2);

disp('Dinucleotide frequencies (transition probabilities)');

fprintf(' %c %c %c %c\n',chars)

for i=1:4

fprintf('%c %f %f %f %f\n',chars(i),freqs(i,:))

end

nseq = 10;

for ntries=1:20

rnums = rand(nseq,1);

% start sequence using mononucleotide frequencies

seq(1) = min(find(cmonofreqs>=rnums(1)));

for i=2:nseq

% extend it using the appropriate row from the dinuc freqs

seq(i) = min(find(cfreqs(seq(i-1),:)>=rnums(i)));

end

output=chars(seq);

disp(strvcat(output));

end

Given examples sequences that are from either + model (CpG island) or - model (not CpG island), can calculate the probability that each nucleotide will occur for each model (the a values for each model)

+ A C G T - A C G T

A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210

C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302

G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208

T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292

ß A C G T

A -0.740 0.419 0.580 -0.803

C -0.913 0.302 1.812 -0.685

G -0.624 0.461 0.331 -0.730

T -1.169 0.573 0.393 -0.679

9. Example:

What is relative probability of C+G+C+ compared with C-G-C-?

First calculate log-odds ratio: S(CGC)= ß(CG) +ß(GC)=1.812+0.461=2.273

Convert to relative probability: 22.273=4.833

Relative probability is ratio of (+)to(-) P(+)=4.833 P(-)

Convert to percentage

P(+) + P(-) = 1 4.833
P(-) + P(-) = 1
P(-) = 1/5.833 = 17%

Conclusion

P(+)=83% P(-)=17%