%% HAEMOPHIILUS Example of sequence statistics and gene finding with MATLAB
% This demonstration looks at some statistics about the DNA content of the
% Haemophilus Influenzae. It shows also some algorithms for finding ORFs
% and for assessing their significance.
%%
% This demo belongs to the collection of Case Studies in Computational
% Genomics, mostly based on classic papers, and mostly based on the contents
% of the book
%
% Introduction to Computational Genomics, A Case Studies Approach
% Cambridge University Press, 2006
% Nello Cristianini and Matthew Hahn
%
% The demos of the other case studies, pointers to software and papers that
% are available on-line can be found on the website:
web('http://www.computational-genomics.net/');
% Demo by Elisa Ricci.
%% Introduction
% Haemophilus influenzae is a non-motile Gram-negative coccobacillus first
% described in 1892 by Dr. Robert Pfeiffer during an influenza pandemic.
% It is generally aerobic, but can grow as a facultative anaerobe. It is
% responsible for a wide range of clinical diseases.
% The Haemophilus influenzae genome was the first to be sequenced and
% assembled in a free-living organism. It contains about 1.8 million base
% pairs and is estimated to have 1,740 genes.
%% Accessing NCBI data from the MATLAB workspace
% The dna sequence can be obtained from the GenBank database with the
% accession number NC_000907. Using the *getgenbank* function with the
% 'SequenceOnly' flag just the nucleotide sequence is loaded into the
% MATLAB workspace.
Hflu = getgenbank('NC_000907','SequenceOnly',true);
%%
% The sequence can be also downloaded from the website www.computational-genomics.net
% Then you can load the data from a .mat file using the command
load Hflu
%%
% The MATLAB *whos* command gives information about the size of the
% sequence.
whos Hflu
%%
%The total length of the Haemophilus Influenzae genome is 1830138 bp.
%% Statistical sequence analysis
% You will examine some statistics properties of the sequence. You can look
% at the composition of the nucleotides with the *basecount* function.
bases=basecount(Hflu)
%%
% The four nucleotides are not used at equal frequency across the genome: A
% and T are much common than G and C.
%%
% Others symbols occur in the sequence. They correspond to positions in the
% sequence that are are not clearly one base or another and they are due,
% for example, to sequencing uncertainties. In this case for example:
% N: aNy base
% K: G or T
% R: A or G (puRine)
% Y: C or T (pYrimidine)
% M: A or C (aMino)
% S: C or G
% W: A or T
% Instead of grouping other simbols together, you can count them individually.
bases=basecount(Hflu,'Others','Full')
%%
% You can also calculate the frequency of each nucleotide.
[bases frequency]=basecountfrequency(Hflu,'Others','Full')
%%
% It shold be pointed out that these are on the 5'-3' strand of DNA, but
% we know exactly the frequency of all the the bases on the other strand
% because of the complementarity of the double helix. Anyway we could
% verify it using again the *basecountfrequency* function.
[bases frequency]=basecountfrequency(seqrcomplement(Hflu),'Others','Full')
%% CG content
% The local fluctuations in the frequencies of nucleotides provide
% different interesting information. The local base composition by a sliding
% window of variable size can be measured. In the following the window
% size is assumed 90000 bp and 20000 bp respectively.
ntdensity1(Hflu,'window',90000)
%%
ntdensity1(Hflu,'window',20000)
%%
% The phenomenon called horizontal gene transfer can be observed in the
% second plot: the H. influenzae genome contains a 30Kb region of unusual
% GC content from 1.56Mb to 1.59Mb.
%% k-mer frequency and motif bias
% Now look at the dimers in the sequence and display the 2-mer frequencies
% in the table matrix. The simbols others than A, C, G, T are not
% considered.
[dimers,matrix] = dimercount(Hflu)
%%
% You can also plot the frequency of certain di-nucleotides of interest:
% for example considering the local fluctuations in the frequencies
% of dimers AT and CG.
density = dndensity1(Hflu)
%%
% As a further analysis you can also compare the observed frequency of each
% dimer with the one expected under the multinomial model. This is called
% genome signature.
gs=gensign(Hflu)
%%
% You can look at codons using *codoncount*. That function counts the
% codons on a particular reading frame. With no options, the function shows
% the codon counts on the first reading frame.
codoncount(Hflu)
%% Exploring the Open Reading Frames (ORFs)
% In a nucleotide sequence an obvious thing to look for is if there are any
% open reading frames. The function *seqorfs* can be used to determine the
% ORFs in a sequence.
orf=seqorfs(Hflu);
%%
% The variable orf is a structure with information about the start/stop
% positions of each ORFs, its length and which reading frame it is in.
% Here the minimun number of codons for an ORF to be considered valid is 10
% (default value).
%%
% The minimun number of codons for an ORF to be considered valid can be
% also set. Here a threshold of 70 and 80 amino acids are considered
% leading to about 2175 and 1966 ORFs respectively.
[orf n_70]=seqorfs(Hflu,'MINIMUMLENGTH',70);
n_70
%%
[orf n_80]=seqorfs(Hflu,'MINIMUMLENGTH',80);
n_80
%%
% You can also look for the total number of ORFs
[orf n]=seqorfs(Hflu,'MINIMUMLENGTH',1);
n
%% Detecting significant ORFs
% The classic approach to decide whether an ORF is a good candidate as a
% gene is to calculate the probability of seeing an ORF of a certain length
% L in a random sequence. To test the significance of ORFs a
% single-nucleotide permutation test can be used.
[orf1 n1]=seqorfs(Hflu(randperm(length(Hflu))),'MINIMUMLENGTH',1);
n1
%%
% The ORF Length of the original and of the randomized sequences can be
% compared.
ORFLength=[];
for i=1:6
ORFLength=[ORFLength; orf(i).Length'];
end
ORFLength1=[];
for i=1:6
ORFLength1=[ORFLength1; orf1(i).Length'];
end
%%
% A threshold must be chosen in order to keep as candidate genes those ORFs
% in the H. Influenzae genome that are longer than most (or all) the ORFs
% in the random sequence.
max_threshold=max(ORFLength1)
n_max=length(find(ORFLength>=max_threshold))
%%
% You can also use a more tolerant threshold in order to keep all ORFs of
% length equal to or greater than the top 1% of random ORFs.
empirical_threshold=prctile(ORFLength1,99)
n_emp=length(find(ORFLength>=empirical_threshold))