Week 1: May 19, 2003 - May 23, 2003
- Monday, May 19, 2003
Initial meeting: Andrew and Yung-Pin.
- Tuesday, May 20, 2003
Andrew wrote Mathematica code that can read the decimal expansions
of real numbers and store them as arrays; Discrete Fourier analysis of the
decimal digits of Pi.
- Wednesday, May 21, 2003
Discussion on randomness; literature on the algorithm for calculating individual
hexadecimal digits of Pi; more on discrete Fourier analysis of the decimal
digits of Pi.
- Thursday, May 22, 2003
Andrew wrote simple Mathematica code that can search patterns and
count their occurrences in the decimal expansions of real numbers.
(Click here to see Andrew's program in Mathematica.)
- Friday, May 23, 2003
Andrew used histograms to see the distributions of varous spacing times between
the occurrences of particular patterns in the expansion (in any base) of a
real number.
Week 2: May 26, 2003 - May 30, 2003
- Monday, May 26, 2003
Memorial Day Holiday
- Tuesday, May 27, 2003
Nick started; Rogers Science Research Program lunch meeting in Olin Lounge.
- Wednesday, May 28, 2003
Nick wrote a C-program that can read biological sequences and count the frequencies
of the bases in the sequences.
- Thursday, May 29
Nick developed a C-program that can search patterns in biological sequences;
Nick computed the relative frequencies of the 64 codons in four different
DNA sequences.
- Friday, May 30
Nick looked at the scatter plots of the codon occurrences of two different
DNA sequences.
Week 3: June 2, 2003 - June 6, 2003
- Monday, June 2, 2003
Michal started.
Andrew began investigating tools of number theory including LLL and PSLQ algorithms
and also tried to understand how to read discrete Fourier spectra.
Nick and Michal wrote a C-program that computes maximum alignment scores between
two DNA sequences.
(Click here to see Michal's C-program dnasubstring.
The program asks for a length of string and the original string of DNA,
then it finds the segment of the string of that length that appears most often.)
- Tuesday, June 3, 2003
Rogers Science Research presentation "Peer-to-peer networking" by
Jens Mache, David Ely, Melanie Gilbert, Jason Gimba, Matt Wilkinson (CS) in
Olin 204.
Nick and Michal worked on a C-program that can count the occurrences of all
subsequences in a biological sequence.
(Click here to see Michal's C-program dnasubstringspecific.
The program asks for the original string of DNA and a specific substring that
it searches for, and then it counts the frequency it appears.)
Andrew began study on the complete genome of the cowpox virus, and tried to
collapse the 224,501 base sequence into a single generating function.
- Wendesday, June 4, 2003
Nick gave a tutorial on DNA sequences.
The team began looking at hidden Markov chains for modeling DNA sequences.
- Thursday, June 5, 2003
Nick looked at if the codon usage is correlated with context, position within
the gene, and amino acid content. (Can we use the frequencies of codon usage
as a measure of similarity? At the same time, we would like to take the codon
orders into account.)
Michal checked the hidden Markov chain model on a virus genome.
Andrew investigated and read more on ``1/f spectrum'' papers by Wentian Li.
- Friday, June 6, 2003
Michal looked at SARS virus and lipase.
Andrew reexamined the PSLQ algorithm.
Nick used C-programs to check the codon usage.
Week 4: June 9, 2003 - June 13, 2003
- Monday, June 9, 2003
Michal applied a program in C++ that recursively finds substrings that appear
minimum number of times with a specific length.
Nick computed and compared the codon usage frequencies in DNA sequences of
chick, cow, human, and mouse.
Andrew browsed the thesis paper "Fast Fourier Transform Analysis of DNA
Sequences" by Russel W. Hanson of Reed College.
- Tuesday, June 10, 2003
Rogers Science Research presentations by Louis Kuo, Angela Blum, Nick Tadros,
David Finigan (Chemistry) and Greg Hermann, Caroline Hieb, Elizabeth Kwan,
Lena Schroeder (Biology) in Olin 204.
Michal tried debugging a program in C.
Nick worked more on the bias of codon usage in DNA sequences.
Andrew looked at DNA sequences from the geometric and graph theoretic perspectives.
- Wednesday, June 11, 2003
The group met to discuss how to prepare for the presentation on July 1.
- Thursday, June 12, 2003
Andrew explored heuristic models of neucleotide-to-neucleotide relations.
Nick worked more on the codon usage and graphed a bar chart.
Michal tried to understand more on the hidden Markov model and its applications
to the DNA analysis.
- Friday, June 13, 2003
Nick and Andrew summarized the questions we would like to investigate: How
to align DNA sequences; How to assign scores to mismatches while comparing
two DNA sequences; How to estimate the mutation rate; How to locate a mutation
if it occurs.
Michal computed the number of occurrences of a particular substring in Tuberculosis.
Week 5: June 16, 2003 - June 20, 2003
- Monday, June 16, 2003
Michal read more on the hidden Markov model and its applications to the DNA
analysis.
Nick worked on algorithms of aligning biological sequences.
Andrew explored more on heuristic models (for example, a random walk model)
of neucleotide-to-neucleotide relations.
- Tuesday, June 17, 2003
Rogers Science Research presentations by two physics groups (led by Steve
Tufte and Tom Olson) in Olin 204.
The group met to discuss more about how to present on July 1.
Michal gave a brief tutorial on hidden Markov models to the group.
(It is a very important step to estimate the probabilities of tranisition
and emission in constructing a hidden Markov model.)
Nick explored on the single-double-triple base correlation in DNA sequences.
- Wednesday, June 11, 2003
Nick wrote a program that tries to find reference points for identifying mismatches
and insersions/deletions in DNA sequence alignments.
Andrew learned how to use the Linux computer system and worked on a C-program
that simulates a random walk.
Michal wrote and debugged (it now works) a program to find the highest score
of a string based on the collapsed state transition probabilities of a hidden
Markov model.
- Thursday, June 18, 2003
Andrew completed a DNA tetrahedral random walk program with Michal's help
and read the literature on chaos game.
(Click here to see a tetrahedra random walk for
the first 1000 base pairs of the cowpox virus genome; and click here to see
the random walk for the entire cowpox virus genome.)
Michal continued to make progress on the score algorithm by using a different
calculating method and taking the "emission probability" into account.
- Friday, June 19, 2003
Andrew considered length-scaled tetrahedral random walks for DNA sequences
and read the paper "A Mathematical Theory of Communication" by C.E.
Shannon.
Michal modified a program that remembers the path with the highest score and
then uses that information to update the parameters (state transition probabilities
and emission probabilities) of a hidden Markov model.
Week 6: June 23, 2003 - June 27, 2003
- Monday, June 23, 2003
Michal came up with a way of estimating the probability parameters of a hidden
Markov model without aligning sequences.
Andrew drew the tetrahedral random walks of the DNA sequences of eight different
genes.
- Tuesday, June 24, 2003
Rogers Science Research presentation by Tom Olson, Brett Tomlin, and Collin
Trail (Physics) on "Modeling and characterizing systems that exhibit
chaotic dynamics"
The group prepared for an outline of the presentation on July 1.
** Intro of hidden Markov models (HMM)
** The decimal expansion of Pi and other examples (related to HMM)
** Bio intro
** Biological HMMs: Are there patterns? Any information hidden? Emission and
transition probabilities of a HMM
** Construction of a HMM: Alignment problem; Using iteration; Does the iteration
ensure convergence? Use of an existing model
** Conclusion: Why is this significant?
- Wednesday, June 25, 2003
Michal worked on a C-program that iteratively gives the maximum likelihood
estimates of the probability parameters of a hidden Markov model without aligning
sequences.
Nick helped with building a hidden Markov chain. (What are reasonable transition
states? transition probabilities to initalize? emission probabilities to initialnize?)
Andrew relooked at the frequencies of single bases, double bases, and triple
bases of a DNA sequence. Andrew also considered a way for continuously smoothing
the path of a discrete tetrahedral random walk of a DNA sequence, together
with the changepoint detection along the path.
- Thursday, June 26, 2003
The group rediscussed the outline of the presentation on July 1.
Michal worked more on a C-program that iteratively gives the maximum likelihood
estimates of the probability parameters of a hidden Markov model.
- Friday, June 27, 2003
The group practiced for the presentation on July 1.
Andrew would start with an outline and compare Pi in base 4 to a DNA sequence
through tetrahedral random walks.
Nick would talk about DNA and introduce alignment: what you would see if you
had two sequences together - match, mismatch, indel.
Michal would introduce a hidden Markov model.
Nick then talked about alignment type, scoring method, and the dilemma.
Michal then gave a possible solution for the dilemma through HMMs and introduced
her future work.
Nick talked his future work.
Andrew then gave a nice way of visualizing alignments via tetrahedral random
walks and talked his future work.
Finally Andrew concluded the presentation.
Week 7: June 30, 2003 - July 4, 2003
- Monday, June 30, 2003
The group practiced for the presentation on July 1.
- Tuesday, July 1, 2003
Rogers Science Research presentation on "Stress" by Deborah Lycan,
Nina Jozanovic, and Alex Sundberg (Biochemistry).
The group presented "DNA sequence comparison and alignment" for
the Rogers Science Research in Olin 301. (See our presentation
slides here.)
Andrew explored on using MATLAB in the geology lab.
- Wednesday, July 2, 2003
Nick worked on algorithms for sequence alignment without using scores (for
example, by the largest common substring).
Michal continued working on the C-program that iteratively gives the maximum
likelihood estimates of the probability parameters of a hidden Markov model.
Dr. Debrorah Lycan talked to Andrew about some possible biological significance
of a changepoint along a DNA path, and Andrew talked to Dr. Jeff Ely about
a method, together with a C-program, for identifying changepoints.
- Thursday, July 3, 2003
Michal worked on how to recursively estimate the probability parameters of
a hidden Markov model.
Nick continued working on an algorithm for global sequence alignment without
using scores.
Andrew worked on how to identify changepoints in the path of a DNA tetrahedral
random walk.
- Friday, July 4, 2003
July 4th Independence holiday.
Week 8: July 7, 2003 - July 11, 2003
- Monday, July 7, 2003
Michal estimated the probability parameters of a hidden Markov model for three
sets of closely related DNA sequences.
Nick worked more on the global sequence alignment without using scores.
Andrew looked at the x-y-z projections of a DNA tetrahedral random walk for
detecting changepoints, and talked to Dr. Jeff Ely again about a C-code that
identifies "corners" of a curve.
- Tuesday, July 8, 2003
Rogers Science Research presentations "Tracking Granules in Hippocampal
Neurons" by Bethe Scalettar, Dmitri Gurkins, and Scooter Johnson (Physics)
and
"The molecular basis of long-term memory" by Janis Lochner, Alexis
Hansen, and Leah Honigman (Biochemistry).
Michal's numerical work showed that the probabilitiy parameters estimated
from "the best fit to a prescribed hidden Markov model" converges
biasedly to 1 or 0.
Andrew considered "DNA bouquets" for locally aligning DNA sequences.
Nick worked more on the global sequence alignment without using scores.
- Wednesday, July 9, 2003
Michal used the "median" method to estimate the probability parameters
of a hidden Markov model.
Nick compiled and finished a C-program that does the global sequence alignment
without using scores.
Andrew continued working on C-code to identify changepoints of a DNA tetrahedral
random walk.
- Thursday, July 10, 2003
Michal's "median" method showed convergence of the estimates of
the probability parameters of a hidden Markov model.
Nick chose the Glycogen phosphorylase family of eight DNA sequences for pairwise
global alignments.
Andrew considered normalizing the "vectors" between the changepoints
of a DNA tetrahedral random walk and represent them on the surface of a unit
sphere.
- Friday, July 11, 2003
Michal's tried the "median" method on two different DNA sequences
in different order to see whether the order of fitting to a prescribed hidden
Markov model affects the converging values of the estimates of the probability
parameters.
Nick continued working on the pairwise alignments of the Glycogen phosphorylase
family of eight DNA sequences.
Andrew pondered how to match the orders of the normalized vectors on the two
spheres.
Week 9: July 14, 2003 - July 18, 2003
- Monday, July 14, 2003
Michal tried the "median" method to repeat training a prescribed
model by multiple sequences.
Nick gave the estimates of probability parameters of a hidden Markov model
based on 28 pairwise alignments of the Glycogen phosphorylase family of eight
DNA sequences. (For example, the transition probability from a match to a
match is about 70%.)
Andrew produced random points on a unit sphere. (Click here to see the Mathematica
program.)
- Tuesday, July 15, 2003
Rogers Science Research presentations by Peter Drake, Heather Cook, and Amanda
Venghaus (Computer Science) and by Herschel Snodgrass, Brenna Gillman, and
John Worst (Physics).
Michal ran the "median" method on families of multiple close sequences
and observed consistent stablization of transition probabilities across different
families.
Nick gave more estimates of probability parameters of a hidden Markov model
based on pairwise alignments of various families of DNA sequences.
- Wednesday, July 16, 2003
Michal ran the "median" method on distantly related sequences and
it did not show the initial fluctuation while training sequences to the modified
model.
Nick considered how to "score" based on the estimates of probability
parameters of a hidden Markov model.
- Thursday, July 17, 2003
Nick gave the estimates of the probabilitiy parameters of a hidden Markov
model from aligning the human and fly Hox B4 gene sequences. Here is the table
of the result.
Andrew tried to use Mathematica to plot the path of a DNA random walk without
involving the do-loop or the for-loop. (Only table is used.)
- Friday, July 18, 2003
Nick gave more estimates of the probabilitiy parameters of a hidden Markov
model by pairwise-aligning more DNA sequences.
Andrew tried the Mathematica code with using table only to plot the path of
a huge DNA random walk.
Week 10, July 21, 2003 - July 25, 2003
- Monday, July 21, 2003
Nick modified the "largest common substring" alignment algorithm
(to allow a certain percentage of mismatch) to get a better global pairwise
alignment of DNA sequences.
Michal used the estimates of the probability parameters of a hidden Markov
model obtained by Nick to construct a scoring system.
Andrew rechecked the path of a DNA (cowpox genome) random walk produced by
a C-program and compared it with a Mathematica program.
- Tuesday, July 22, 2003
Rogers Science Research presentations by Kellar Autumn, Amanda Gassett, Shannon
McGonagle, and Nathan Schwab (Biology) and by Liz Safran, Jonny Armstrong,
and Jemile Erdem (Geological Science).
Nick and Michal worked on a scoring system based on the estimates of the probability
parameters of a hidden Markov model. (What is a reasonable penalty for an
insertion or a deletion?)
Andrew looked at more paths of various DNA random walks. (Many virus DNA sequences
show a long-range linear trend. Some don't. See the path of SARSbj04
DNA sequence.)
- Wednesday, July 23, 2003
Nick computed the penalty scores based on the estimates of the probability
parameters of a hidden Markov model. Here is an EXCEL table
of the results.
Michal and Nick drafted for their poster.
- Thursday, July 24, 2003
Andrew used matrix algebra to locate a largest common substring while aligning
two sequences.
Michal and Nick prepared for their poster.
- Friday, July 25, 2003
Michal and Nick prepared for their poster. (The last day of Michal's eight-week
research work.)
Andrew "smoothed" a binary sequence based on the runs. (For example,
101111100 will become 101232100.)
Week 11, July 28, 2003 - August 1, 2003
- Monday, July 28, 2003
Andrew put the cDNA on the top of its DNA sequence of Bacteriophage P4 to
examine any particular patterns of coding regions.
- Tuesday, July 29, 2003
Nick applied a scoring system obtained from a hidden Markov model to align
sequences to the hidden Markov model. (The alignment does not seem to be good,
and it may be caused by the fact that the penalty assigned to an InDel is
too low.)
Andrew used "moving average" to locate changepoints for DNA sequence
segmentation.
- Wednesday, May 28, 2003
Nick compared his sequence alignments (using the original scoring system obtained
from a hidden Markov model and a modified scoring system) with the alignment
via the Needleman-Wunsch algorithm.
Andrew continued more work on DNA sequence segmentation. (A threshold can
be used to locate where a spike occurs.)
- Thursday, May 29
Nick computed the scores of the alignments that were used to estimate the
probability parameters of a hidden Markov model and redo the alignments based
on the scoring system to see whether the alignments could be improved or the
alignments could be used to update the probability parameters of the hidden
Markov model.
Andrew realized the method of "averaging" is obtaining the local
average rates of change over the chosen windows and the spikes occur at the
places where the local rates of change are zero. See Andrew's Mathematica
program that makes segmentation of the Bacteriophage lambda genome.
- Friday, May 30
Andrew revised the algorithm that locates changepoints, and worked on the
poster.
Nick found that the global alignment seems to get worse while using the scoring
system obtained from the hidden Markov model.