Atish Agarwala's blog
Blog about science and math.
https://atishagarwala.github.io/
Tue, 08 Aug 2017 21:11:50 +0000Tue, 08 Aug 2017 21:11:50 +0000Jekyll v3.5.1Null model for gap finding algorithm<p>Today’s post will be about a simple null model for the distribution of “gaps” in coverage when metagenomic reads are mapped
against a reference genome. We can use this null model to try to detect places in the genome where a lack of coverage is real
and due to that genetic element not being present (at least not at high abundance) in the gene pool.</p>
<h2 id="model-definition">Model definition</h2>
<p>Suppose we have a genome of length <script type="math/tex">L</script>. Suppose we have a sample of <script type="math/tex">R</script> reads, each of length <script type="math/tex">\ell</script>, which map
onto the genome. Further suppose that these reads are distributed evenly across the genomes. We then define a <em>gap</em> as a
continuous region on the genome that has no coverage - that is, a consecutive string of nucleotides which have no
reads mapping to them. Under this model, what is the empirical distribution <script type="math/tex">\hat{\rho}(g)</script>, where <script type="math/tex">g</script> is the size of a
gap?</p>
<h2 id="computing-expectations">Computing expectations</h2>
<p>As posed, we want to find a distribution of distributions - the probability distribution over empirical distributions.
In general, this is a complicated problem. However, we can show that
computing <script type="math/tex">n(g)\equiv{\rm E}[\hat{\rho}(g)]</script>, the expected number of gaps of size <script type="math/tex">g</script>, is a far more tractable task.</p>
<p>There are <script type="math/tex">L-g</script> potential gaps of size exactly <script type="math/tex">g</script>. Suppose that <script type="math/tex">g\ll L</script>, so this number is well approximated
by <script type="math/tex">L</script>. (We will later see that this is a reasonable assumption for large <script type="math/tex">L</script>.) If we can compute the probability
that any of these individual blocks is a gap, we can compute the expectation exactly.</p>
<p>For a region of length <script type="math/tex">g</script> to be a gap, and not be contained in any other gap, there must be no reads
in the block, and at least one read flanking each side of the gap. We can compute the probability
of such an event directly. We label the gap by the integers <script type="math/tex">0</script> to <script type="math/tex">g-1</script>, inclusive.
We adopt the convention that a read starts at the lowest included
number and ends at the highest. Our condition
then amounts to:</p>
<ul>
<li>No reads which start from <script type="math/tex">-(\ell-1)</script> to <script type="math/tex">g-1</script> inclusive.</li>
<li>At least one read which starts at <script type="math/tex">-\ell</script> and at least one which starts at <script type="math/tex">g</script>.</li>
</ul>
<p>We start with the computation of the probability of the first event. There are <script type="math/tex">g+\ell-1</script> positions
which can’t have a read start; this gives us</p>
<script type="math/tex; mode=display">P(\text{no gap reads}) = \left(1-\frac{g+\ell-1}{L}\right)^{R}</script>
<p>If $g-\ell-1\ll L$, we have</p>
<script type="math/tex; mode=display">P(\text{no gap reads}) \approx e^{-(g+\ell-1)R/L}</script>
<p>We now need to calculate the probability that there is at least one read flanking either side of the
gap, given that there are no reads in the gap. Let $F_{1}$ and $F_{2}$ be the events that there
are reads flanking the left and right sides of the gap respectively. We have</p>
<script type="math/tex; mode=display">P(\text{flank}|\text{no gap reads}) = 1-P(\text{no flank}|\text{no gap reads})</script>
<p>which we can write in terms of the individual flanking probabilities as</p>
<script type="math/tex; mode=display">P(\text{flank}|\text{no gap reads}) = 1-P(\text{no flank 1}|\text{no gap reads})-P(\text{no flank 2}|\text{no gap reads})+P(\text{no flank}|\text{no gap reads})</script>
<p>These probabilities can be computed as:</p>
<script type="math/tex; mode=display">P(\text{flank}|\text{no gap reads}) = 1-2\left(\frac{L-g-\ell-2}{L-g-\ell-1}\right)^{R} +\left(\frac{L-g-\ell-3}{L-g-\ell-1}\right)^{R}</script>
<p>We can rewrite and use the large $L$ approximation again:</p>
<script type="math/tex; mode=display">P(\text{flanking}|\text{no gap reads}) = 1-2\left(1-\frac{1}{L}\right)^{R} +\left(1-\frac{2}{L}\right)^{R} \approx (1-e^{-\frac{R}{L}})^{2}</script>
<p>We can factor out the genome length <script type="math/tex">L</script> by defining the per nucleotide coverage <script type="math/tex">C\equiv \frac{\ell R}{L}</script>.
The expected number of gaps <script type="math/tex">n(g)</script> is given by</p>
<script type="math/tex; mode=display">n(g) = L P(\text{no gap reads})P(\text{flanking}|\text{no gap reads}) \approx L(1-e^{-C/\ell})^{2}e^{-Cg/\ell}e^{-C(1-1/\ell)}</script>
<p>If <script type="math/tex">\ell\gg1</script> as well, we have</p>
<script type="math/tex; mode=display">n(g) \approx L(1-e^{-C/\ell})^{2}e^{-C}e^{-Cg/\ell}</script>
<p>We have a roughly exponential distribution, with scale $\ell/C$ and total number of gaps given by
<script type="math/tex">\frac{\ell L}{C}(1-e^{-C/\ell})^{2}e^{-C}</script>. The “typical” largest gap approximately occurs when
<script type="math/tex">\sum_{g = g_{max}}^{\infty}n(g) = 1</script>, which occurs at</p>
<script type="math/tex; mode=display">g_{max} = \frac{\ell}{C}\ln(\ell L/C(1-e^{-C/\ell})^{2}e^{-C}) = \frac{\ell}{C}\ln( L^{2}R^{-1}(1-e^{-C/\ell})^{2}e^{-C})</script>
<p>For low coverage (<script type="math/tex">C\ll\ell</script>), the total number of gaps is roughly <script type="math/tex">\frac{CL}{\ell}e^{-C} = Re^{-C}</script> and
<script type="math/tex">g_{max}\approx \frac{\ell}{C}\ln(L^{2}C^{2}/R\ell^{2}e^{-C})</script>, or <script type="math/tex">g_{max}\approx \frac{\ell}{C}\ln(Re^{-C})</script>.
This makes sense; as <script type="math/tex">C</script> goes to zero there are <script type="math/tex">R</script> total gaps, which corresponds to the space between adjacent
reads, which are all non-intersecting.</p>
<p>The cumulative distribution functions are often useful, as they transform nicely
under rescalings. Additionally, if we are using the null distribution to filter out spurious gaps, we can
use the CDF to define a cutoff. Let <script type="math/tex">N(g)</script> be the number of gaps with length greater than <script type="math/tex">g</script>. We have</p>
<script type="math/tex; mode=display">N(g) = \frac{\ell L}{C}(1-e^{-C/\ell})^{2}e^{-C} e^{-Cg/\ell}</script>
<p>or, in the low coverage case,</p>
<script type="math/tex; mode=display">N(g) = Re^{-C} e^{-Cg/\ell}</script>
<h2 id="model-validity-and-relaxations">Model validity and relaxations</h2>
<p>The expectation function <script type="math/tex">n(g)</script> will be a good approximation of the empirical distribution when the genome is
long
compared to the read lengths and the
typical (or perhaps somewhat) block lengths. In this case, we can think of dividing the genome into sublocks
which are each statistically independent in terms of gap distributions. This is true up to the block-spanning
gaps, which may contribute to the tail of the distribution. One can compare the tails of the blocked and full
expectation distributions to try to understand when there are many such roughly independent blocks, and
therefore
when the empirical distribution is close to the expectation. Furthermore, the fact that the distribution does
not have heavy tails suggests that the expectation function will well characterize the</p>
<p>This model makes some strong assumptions. First, it assumes uniform read length and coverage across the genome. Both of
these
are expected to fail; reads will have some distribution (though this distribution will be quite narrow for Illumina reads),
and more importantly there will be areas of the genome which naturally have higher or lower coverage. One easy
way to extend the model would be to try to see the effects of different distributions of coverage and read sizes. A simple
approach would be to take these distributions from a dataset of interest to see if the data can inform the model.</p>
<h2 id="using-the-model-to-fit-data">Using the model to fit data</h2>
<p>Naively fitting the model to data leads to issues; in particular, there are global coverage biases due to GC content etc.,
and often one can find parts of the genome with extremely high coverage (eg in locations which are more conserved across
different types). Coverage issues are further complicated by the possible presence of multiple closely related types in the
population. Therefore, figuring out the appropriate model parameters can be non-trivial, especially in the sparse coverage
regime. For example, the coverage <script type="math/tex">C</script> can be biased by high coverage areas and lead to too many gaps being called.
It is useful then to think about the parameters and figure out which ones may be more or less sensitive to
these issues.</p>
<p>There are three free parameters in the model:</p>
<ul>
<li><script type="math/tex">L</script>, the genome length.</li>
<li><script type="math/tex">\ell</script>, the typical read length</li>
<li><script type="math/tex">R</script>, the number of reads
Note that the coverage, which shows up in many of the expressions as a useful quantity, is a composite involving all three.</li>
</ul>
<p>In most cases, <script type="math/tex">L</script> is known and fixed. The question then comes to <script type="math/tex">\ell</script> and <script type="math/tex">R</script>. Given some estimate of the coverage
<script type="math/tex">C</script> (an easily computable average), how should we estimate <script type="math/tex">\ell</script> and <script type="math/tex">R</script>?</p>
<p>With Illumina short reads, the distribution of <script type="math/tex">\ell</script> (or really, the mapped <script type="math/tex">\ell</script>) tends to be relatively narrow.
Therefore if we have some estimate of <script type="math/tex">C</script> via either filtering, or some other method, we should treat it as changing
the estimate of <script type="math/tex">R</script> as opposed to <script type="math/tex">\ell</script>. Otherwise we can also run into issues where lowering coverage can raise the
total number of gaps, even in the low coverage situation.</p>
Fri, 04 Aug 2017 00:00:00 +0000
https://atishagarwala.github.io/2017/08/04/Null-model-coverage.html
https://atishagarwala.github.io/2017/08/04/Null-model-coverage.htmlkitpQBio2017eco-evoKITP Course week 2<p>Had a great Sunday, biked to Santa Barbara with Michael and a few others from the course. Got some delicious tacos, and some
nice ice cream too. Ended with barbeque and ping pong. Not a bad day off.</p>
<p>Second week is about to start. We’ll do some DNA extraction and sequencing later in the week, but until then I’ll look
through
the data. Thought I’d use this space to try to plan out what I want to do. I tend to feel a bit overwhelmed when first
getting
a new dataset, not knowing what to do first. Feel additionally stressed about this one, because we have a time constraint for
the course, and I think I’m feeling a little more pressure/competetiveness to do something good. We’ll see how that turns
out.</p>
<p>Overall the goal for this week is to get a bit more familiar with the data and</p>
<h2 id="visualizing-the-data">Visualizing the data</h2>
<p>Guilhem, Paul Rainey’s grad student, had put together some nice visualizations of some of the sequencing data. I think my first
goal is to recreate/extend that. What I want to do is make a script that does the following, given a particular dataset as
input:</p>
<ul>
<li>Gives me the overall average coverage and the distribution of coverage per contig/over the genome.</li>
<li>Histograms of mapping accuracy at different positions. One would think that we should get >90% accuracy of a match
if a metagenome read really came from the same type; the errors should be dominated by the nucleotide errors of the
original genome. We can use the timepoint where things came from to try to judge this.</li>
<li>Coverage cumulative distributions per nucleotide, averaged over a small window. Do this with different cutoffs (probably
some kind of logarithmic scale) in order to get a sense of what the reads at different frequencies looks like.</li>
</ul>
<p>In addition, if I have time I may want to make a data exploration tool which lets me load a file, put in a contig number
and base pair range, and shows me a variety of these stats in that given window. We’ll see how long that would
take to code/how useful that would be.</p>
<h2 id="evolutionary-scenarios">Evolutionary scenarios</h2>
<p>By looking at the data with the above tools, we can try to falsify the various evolutionary scenarios which
could have occured:</p>
<ul>
<li><strong>Individual was present at early times, didn’t fix until later ones.</strong> Here horizontal gene transfer within
the bottle may have contributed to structural variation, but that/initial diversity dominated rather than
transfer between bottles. Can check to see if gaps get filled in other bottles, if gaps get filled in the
vertical treatment, and how correlated gaps are in different bottles.</li>
<li><strong>Some individual(s) got transferred in from another bottle.</strong> Maybe some whole bacteria were transferred;
they may be adpative in another condition, so got to high enough frequency to be transferred, and then got transferred
wholesale to the new bottle. Would look for some conditions where all the gaps are filled, as well as correlation between
gaps coming up.</li>
<li><strong>Horizontal gene transfer of specific elements.</strong> The golden goose. Would want to show that some gaps get filled in,
and that gaps don’t come at all from vertical conditions. Also need to find the junctions and show that they differ bottle to
bottle.</li>
</ul>
<h2 id="later-analyses">Later analyses</h2>
<p>Analyses I want to get to later in the week once I have a better sense of the data include:</p>
<p><strong>Gap analysis.</strong> I want to try and come up with some way of identifying gaps, and classifying them. I need to use the
coverage
to compute a basic null expectation for how large gaps can be at random, as well as use the reads mapping on to their same
timepoint to create two methods of finding the lengths of gaps which may be real. Can also use time course correlation
as well. In addition need to write a simple average in window to actually find the ends of the gaps (all tunable).</p>
<p><strong>Low frequency matches.</strong> Allowing for lower frequency matches may start filling in the gaps; maybe this could be used to
try and figure out if, at low coverage/earlier timepoints when gaps are somewhat speckled, the filling is real or not.</p>
<p><strong>Guessing at the evolutionary history of gaps.</strong> Using the above and the time course, can we say something about the origin
of/evolutionary history of the gaps? This would be the ultimate goal.</p>
<p><strong>Find reads which span the junctions.</strong> This would be good to see if, for example, when the gap gets filled in in other
bottles whether or not it actually is incorporated in the same place, or if there are differences.</p>
Sun, 30 Jul 2017 00:00:00 +0000
https://atishagarwala.github.io/2017/07/30/KITP-week-2.html
https://atishagarwala.github.io/2017/07/30/KITP-week-2.htmlkitpQBio2017eco-evoThoughts on theoretical ecology<p>Short post today. Started work on the project, more on that when we actually get things done. Will just talk about some
thoughts on the talks/discussions from today.</p>
<p>Otto Cordero gave a talk today on genome evolution. He presented some evidence that genome size is shrinking in the absence of
any selection for bacteria, and that genome size and number of genes is linear in bacteria. The claim was that this disproves
the neutral theory of gene duplication; I still find the fetish for neutral theories to be a bit odd.
He also presented his vision for
clustering/grouping things in bacterial systems: trying to find the genetic resolution at which there are individuals who
are separated because they recombine homologously with each other much more than with others (in a way where there is some
separation of scales). It’s an attractive thought, especially because it gives a definition which seems more
functional/biological in nature.</p>
<p>Seppe Kuehn then talked a bit about his work with the 3 species closed ecosystem. Was pretty interesting, not sure what to
make of it though. They tried to analyze “ecomodes” of dynamics (aka just doing PCA on replicate trajectories). Ended up with
some modes which were kind of coherent, but with 3 species kind of hard to know what to make of it. I feel like you want to
be able to do something better than PCA, since PCA depends on the numbers of types you have and how you’ve grouped them.
Not sure about what an alternative is, but things to think about.</p>
<p>We then had a student’s only session with Otto in the evening, where he talked about some work looking at communities on
granules from a bioreactor. They claimed that for the most abundant type, granule to granule there was little to no overlap
in
what was there. This suggested that the species level classification was not useful, and/or that to understand ecology
maybe we need to understand genes/functions, or bacteria as pathways. This in general got me thinking about the overall
question of what ecological models should look like. Maybe for more realistic/useful/interesting ecological models, we should
try to think about things in terms of metabolic pathways? Not sure how this is changed by the fact that predation
type interactions seem to allow for new pathways (or if that gets incorporated into the framework).</p>
<p>I asked a question about whether or not one would need to consider a large number of types to understand any ecological
system. It seems most of the systems presented about here have a few key players (or that’s what the belief is). He
suggested that low abundance guys may be important, but that if something like metabolic pathways are explanatory
for microbial ecology, the number of things to consider would be set by that rather than the actual
number of types in the population.</p>
<p>Overall I think I like Otto’s work/philosophy. I think he has an interesting and useful viewpoint on many of the questions in
biology, especially with regards to ecology. The quantitative nature of his work sometimes feels incomplete/suspect/not well
presented, but the experiments/thoughts are usually sound enough that I can usually get something for it. Very glad
that he came out to KITP, and looking forward to his research talk on Monday.</p>
<p>I also had a nice chat with Matti today about how to analyze/think about ecological time series data. I’m still not sure
of the answer, but I proposed 3 things that someone might want to do with such data:</p>
<ol>
<li><strong>Prediction/curve fitting.</strong> Given one half of a time series of species/type/gene abundances, predict the second half.
This would then suggest that something about the ecological data has been captured by a hopefully low dimensional model.</li>
<li><strong>Extrapolation.</strong> What would have happened if more of one type or another were present? What is the long term fate
of an ecosystem? Being able to do this would imply some general understanding of the system that is encoded in some useful
way.</li>
<li><strong>Extraction of interactions.</strong> The word interactions is used loosely here; really, this is any scenario where one can
use model parameters to say something about biological processes. This could be as coarse as inferring a Lotka-Volterra
like predator prey interaction, or a model which spits out something about metabolic rates.</li>
</ol>
<p>Still not really sure how to do any of these, but food for thought. Thinking about time series analysis and
thinking about what sorts of ecology models (toy or otherwise) one should be thinking about/playing around with
are the two overall questions which interest me in theoretical ecology. I hope I can get more of a sense of
the latter especially while I’m here; it would be great to come up with ideas for interaction schemes that “look like”
trophic levels/metabolic networks and have the right flavor about them, in order to start asking questions.</p>
<p>I’ll
end with a quote today paraphrased from Boris Shraiman: <em>“Make sure your models are falsifiable, not just solutions
looking for problems.”</em></p>
Thu, 27 Jul 2017 00:00:00 +0000
https://atishagarwala.github.io/2017/07/27/Thoughts-on-ecology-Cordero.html
https://atishagarwala.github.io/2017/07/27/Thoughts-on-ecology-Cordero.htmlkitpQBio2017eco-evoPaper project brainstorming<p>Had our first meeting with the “Paper project” group, looking into the ecology of a community degrading cellulose (with
possible induced HGT). Right now we’re trying to figure out what we want to do for a project. Figured I’d write down some
ideas I’ve had here, in order to organize thoughts.</p>
<h2 id="data-and-experiments">Data and experiments</h2>
<p>The data we have are:</p>
<ul>
<li>Metagenomic short reads from T0 (before splitting into vertical and horizontal treatments), T 1-4, T10, T16, and T24.</li>
<li>20 whole genomes from T24 and from T16 (one for each flask). One set of genomes is from pseudomonas, the other are from
random individuals.</li>
</ul>
<p>In addition, we have samples from T0, T16, and T24. We have the capability to do the following experiments:</p>
<ul>
<li>Cellulose degradation assays.</li>
<li>Plating/colony growth</li>
<li>Illumina sequencing</li>
<li>24 nanopore sequence runs</li>
</ul>
<p>The latter will ostensibly be used to assemble some more whole genomes.</p>
<p>We need to figure out what to do with the data/experiments. The main thing that needs to be decided on as a group is the whole
genome sequencing; the rest may require some coordination/working together, but if people have different ideas/directions they
want to go in those all can probably be pursued.</p>
<h2 id="specific-questions">Specific questions</h2>
<p>Some scientific questions I think would be fun to answer are:</p>
<h3 id="what-is-the-rate-of-recombination-in-the-population">What is the rate of recombination in the population?</h3>
<p>Could we use the recombination sites inferred by the gaps
(and maybe checked by other methods) to try to estimate the level of recombination in the population? This may require
either additional genomes at the final timepoint, and/or access to the ancestral strain. At the very least,
recombination induced by the “phage juice” could be estimated. Here, in order to not get tripped up by
amplification by selection/hitchiking, we may want to find “neutral” recombinations (either by looking in specific
genes, or by finding recombination events which can be traced to the original flask). One big confounding factor
of course will be the composition of the phage juice.</p>
<h3 id="can-we-find-additional-evidence-that-the-gaps-are-due-to-recombination">Can we find additional evidence that the gaps are due to recombination?</h3>
<p>Would like to try to find breakpoints, to rule
out the possibility that adaptive bacteria are the only ones who got transferred. Could also see if putative phage
sequences are in the recombining gaps.</p>
<h3 id="what-fraction-of-recombination-events-are-homologous">What fraction of recombination events are homologous?</h3>
<p>Getting quantitative estimates of this could be cool. One thought I had was to find the timepoint before
the recombination event was thought to have occured, and then try to match things into the gap allowing more
and more error. Then you’ll have a curve of coverage vs. allowable error, which will be increasing; if the derivative
of this curve is high at a low error rate, then perhaps homologous recombination has occured. As a control, one could do
the same for some region which is spanned.</p>
<h3 id="what-fraction-of-recombination-events-are-related-to-selection">What fraction of recombination events are related to selection?</h3>
<p>If there is a recombination event which happened in the first four timepoints, maybe we can see the selective
effects of these recombinants. Could give some indication of the evolutionary value of the recombinations, and
as to how much recombination is enriched by selective recombinants.</p>
<h3 id="what-does-the-diversity-at-finer-scales-look-like-in-the-flaskbetween-flasks">What does the diversity at finer scales look like in the flask/between flasks?</h3>
<p>Would be interesting to look at
subspecies abundances, to see if there is a lot more diversity there as well. Could take 16s or something and try to use
a minimal clustering/cluster free approach. Would need some checks here like distance from large types, and correlation
across timepoints.</p>
<h3 id="what-does-the-ecological-dynamics-look-like">What does the ecological dynamics look like?</h3>
<p>Once we have something that looks like abundances of types, we could try to look at the dynamics. Could do so on multiple
scales of clustering, to see how much substructure there is in the population. Things to look at would include correlations
between relative abundances.</p>
<p>We could also possibly try to indirectly figure out how the gene content of different types has changed by the end of the
evolution. We could take the first few timepoints and see what reads that map to genes are correlated with what OTUs.
If there are some OTUs which have the same ratio with hits to a gene consistently over a few timepoints, then we can say
that the gene is probably only present in the OTU. We can then</p>
<p>Some technical challenges which will have to be overcome for some of these questions:</p>
<h3 id="how-can-we-estimate-relative-abundances-from-the-metagenomics-data">How can we estimate relative abundances from the metagenomics data?</h3>
<p>We need to figure out to what extend abundances in the short reads actually correspond to abundances of types. Would be
good if we can figure out checks on this - maybe by looking at the distribution of coverage with respect to the reference
genome? Could also look at specific regions which are thought to be either highly variable or not very variable.</p>
<h3 id="how-can-we-error-correct-16s-etc-to-low-distances">How can we error correct 16s etc to low distances?</h3>
<p>Could be worth coding up a much simpler DADA/something like what Tikhonov did in his paper.</p>
<h3 id="what-is-the-right-null-model-for-gapsthe-distribution-of-gaps">What is the right null model for gaps/the distribution of gaps?</h3>
<p>Two thoughts here: one to just think about the combinatorics. The other would be to assume that any gaps in the vertical
transmission are spurious, and to look at their distribution.</p>
<h2 id="thoughts-about-the-sequencing">Thoughts about the sequencing</h2>
<p>There is also the question of what should be whole genome sequenced. Paul suggested one approach would be to look at the
early times and try to track down the ancestors of some of the later time guys. Rachel suggested just sequencing random guys
at late times would be good just to see what’s in the population.</p>
<p>I think it would be good to spend some (maybe half?) of our sequences on the late timepoint, to get a sense of the
diversity of types/shared or unshared nature of gaps which may indicate horizontal gene transfer. For this reason I think
going for the pseudomonas in particular would be good, just to reduce the search space. Perhaps at the same time,
we can spend some Illumina sequencing to try and see if there are types which look like the ancestors of sequenced guys.
If so, then we can try to sequence them; otherwise, just sequence more guys at late timepoints.
Also, I think we should sequence in lines which contributed a lot to horizontal gene transfer (by the gaps)</p>
<h2 id="proposed-plan-of-attack">Proposed plan of attack</h2>
<p>For the questions I’m interested in, I think what I’d want to do would be:</p>
<ul>
<li>Try to see if there is breakpoint signal.</li>
<li>Try to understand how read abundance relates to actual abundance.</li>
<li>See if my method of finding homologous recombination would make any sense.</li>
<li>Try to get out the (corrected) 16S abundances.</li>
<li>Track 16S dynamics over time.</li>
<li>See if any genes correlate with specific 16S subtypes.</li>
<li>Look at late time whole genomes, find recombinations which occured early, and try to estimate their fitnesses.</li>
</ul>
<p>Probably more than 2 weeks of work, especially if I’m just learning the tools, but I think if I get started, I’ll be
able to complete at least one of the directions. At any rate excited for the next 1.5 weeks!</p>
Wed, 26 Jul 2017 00:00:00 +0000
https://atishagarwala.github.io/2017/07/26/Paper-project-thoughts.html
https://atishagarwala.github.io/2017/07/26/Paper-project-thoughts.htmlkitpQBio2017eco-evopaperprojectProject assignments and random thoughts<p>Another day of talks, and we got our project assignments. Also I pipetted for the first time!</p>
<h2 id="talks">Talks</h2>
<p>Bruce Levine gave a really nice talk about various topics in microbiology. He went through the history of things like
the nature of mutations, bacteriophage, and more and gave his thoughts on interesting directions (spatial structure
and phage interaction). Favorite quote: “Evolution is about dN/dt and not dN/dS.”</p>
<p>Erdal Toprak talked about his work with the morbidostat and evolution of antibiotic resistance. Cool dataset, but was
left a bit unsatisfied. He presented fitness landscapes based on minimum inhibitory concentration, but I would have
really liked to see some estimation of what the selective effect for each mutation was when it arose. Something similar
to this was shown with some work estimating the yield of the reaction catalyzed by the resistance encoding proteins, would
have been cool to see if that changed the predictability of evolution.</p>
<p>Also as always there was the question of how to talk about epistasis. Epistasis was described here again with pairwise
interactions, with the magnitude/sign epistasis language which I really abhor. I guess that sort of description is not
too bad when the number of mutations is small, but even here with 5 mutations higher order epistasis definitely mattered.
Wonder how best to talk about epistasis with a small number of mutations - I still think the pairwise epistasis inspired
language is not great, but maybe the way I talk about things is a bit too broad.</p>
<p>It was interesting to think about/see
a scenario where under strong selective pressure only a few mutations seemed to matter. I wonder what it’s like during an
actual infection where there are more cells/the selective pressure (antibiotic concentration) has a different temporal
profile. Also, apparently a lot of the protocols used for administering antibiotics seem somewhat arbitrary - someone
set them up at some point, they seem to work ok, but haven’t really been tested.</p>
<p>Relatedly, talked to Portia about her project studying a fitness landscape of a few mutations which confer antibiotic
resistance. Here she generated landscapes with respect to a whole bunch of drugs, and actually measured growth rates -
that is, things that are in the right units to understand evolution! The idea was to predict cycles of drugs
which would lead to reversion to the wild type. Definitely a cool idea, to try to understand something about
the fitness landscape and take advantage of dynamics to prevent strains with multiple resistances from arising.
Still, I wonder about two things: one, will there be other sites on the genome which will give benefits? Once we get to
high dimensions, this kind of manipulation may become harder. Secondly, how long do non-optimal genotypes
last in these populations?</p>
<p>For example, if a strain was resistant to drug A, but in the presence of drug B reverting
to the ancestral phenotype was beneficial and it began to do so, how long to the strains resistant to A persist? If it’s
long enough, they may rise up again when A comes back. If they are still at some non-trivial number, they may cause
resistance to flare up again faster than it would if resistance to A established via mutation alone. Would be good
to get some of the numbers to play around with to try and get at the plausibility of different scenarios.
I guess a related question would be, how often does resistance arise from standing variation rather than de novo mutation?</p>
<p>Also had a good chat with Otto Cordero about various topics. He mentioned that phage may attack mobile genetic elements
in bacteria more than other ones - which may mean that I should think about recombination in my bacteria-phage
coevolution stuff. I think a postdoc from his lab is going to talk about this at some point, and he’s giving a talk
later in the week, so hope to learn some more about the actual biology of the problem I’m thinking about!</p>
<h2 id="the-project">The project</h2>
<p>Looks like we’ll work on 2 projects, one for the first two weeks, and another for the second two weeks. The first two
weeks I’ll be working on the ecology of cellulose eating bacteria with <a href="http://www.stevenquistad.com/">Steven Quistad</a>.
In short, they passaged 10 different communities sampled form compost in minimal media with cellulose (ie a piece of
paper) as a carbon source. They transferred the microbes once every two weeks for a year. Each community was
passaged in one of two ways: first, there was the <em>Vertical</em> group, which was passaged normally. Second, there
was a <em>Horizontal</em> group which, every time it was passaged, also got some “phage juice”.</p>
<p>This phage juice consisted of a sample from all 10 horizontal groups, mixed together, and filtered of all particles larger
than 2 microns. This means that the juice should just consist of chemicals, DNA, and possibly bacteriophage. The idea
is that the horizontal group is getting some genetic elements shared across all the communities, and there is possibility
of horizontal gene transfer. If some useful gene gets to high frequency in one population, it may be transferred to the
other vials and uptaken there.</p>
<p>I think this is a pretty cool experiment, because it has the possibility of both having interesitng community dynamics
(with the vertical alone), and says something about migration/horizontal gene transfer (with the horizontal treatment).
It would be cool if both inter and intra line information could be combined to do things like assess community
evolution and stability, new genes sweeping through the population, and the rate of recombination events. Stay tuned
for what direction me (and the other students) will take this in! Really excited to work on this, was my first choice
of all the projects. I’ll be working on a project with millifluidics in weeks 3 and 4, and will talk more about that
when I get there.</p>
Tue, 25 Jul 2017 00:00:00 +0000
https://atishagarwala.github.io/2017/07/25/Projects-assigned.html
https://atishagarwala.github.io/2017/07/25/Projects-assigned.htmlkitpQBio2017eco-evoKITP Qbio Day 1: Community selection and project proposals<p>Today marked the first day of the Qbio course!</p>
<p>Kicked off with Paul Rainey giving a talk about microbial ecology. He started off talking about a system of Pseudomonas
with three different types: wrinkly, smooth, and fuzzy. They have some non-trivial frequency dependent selection where
each one can invade the other, and he described some experiments to characterize them.</p>
<p>He then talked about an experiment which may be one of the ones I get a chance to work on, involving soil bacteria.
They took some samples from compost, and put them in minimal media with some paper (cellulose) as the main
carbon source. One set of 10 lines was passaged with serial dilution (and new paper) every two weeks. Another
set was passaged similarly, but with filtered “phage juice” added as well. The “juice” was
made from liquid from all 10 lines mixed and filtered down to virus/dna particle sizes. The idea here was that
one set of lines had the possibility for “migration” of genes - no individual bacteria making it through, but genomic
elements getting to migrate between populations.</p>
<p>The hope was that if an adaptive mutation/beneficial gene rose to
high levels in one population it would be sent to others, and then incorporated via horizontal gene transfer. They
have whole genome sequences for some individuals at the last timepoint, and then compared shotgun metagenome
reads to these whole genome sequences to see if they lineup. They claim that there are large “gaps” whose emergence
can be tracked through time (and between lines in the case of the second experimental condition), which suggest
a large amount of horizontal gene transfer. It’s a cool experimental idea, and I would be curious how strong
that evidence of horizontal gene transfer actually is, and to try to falsify different scenarios. Would also
be cool to see if one can back out things about the evolutionary dynamics, and maybe even the incorporation
rates of horizontal gene transfer from those data.</p>
<p>The third section of his talk was quite interesting - talking about the possibility of selection on the level of the
community. The question was: how can you get emergent behavior where selection/heritability occur on the level of
some collection of types rather than just a few? A toy example given was a model of growth with spatial structure -
where there were patchy resources, and organisms had to travel from patch to patch. There was a tension between fast
growth at short timescales (growing to take over your patch), and surviving long enough to transmit yourself
to another patch. You could find regimes where guys who grew more slowly would, in the long term, win out. With a similar
setup you could also, in the right parameter regime evolve a steady frequency of “altruistic” types which were sterile,
but conferred a fitness benefit to their downstream offspring.</p>
<p>This was a pretty interesting idea - that a multi-step growth dynamic with separation of timescales could lead to stabilization
of diversity in some way, and lead to types being generated/surviving which wouldn’t get a chance to in simpler environments.
In some sense this goes
against the notion of “fitness” that can be used to simply quantify evolutionary outcomes; in another way it just says that
what we call fitness might have to be coarse grained over the appropriate scale, and depend on certain higher order
moments/spatial averages of the system (ie frequency dependent selection).
I wonder if this sort of idea could be deployed to explain the existence of other cheater systems. I also wonder what
consequences this has for the role of stochasticity/differences in local population structure. Could this stuff be tied
into Michael (Pearce)’s spatial structure models? This also reminds me of classic antagonistic pleiotropy stuff, like
how flies may make themselves more mortal in order to have more early life growth.</p>
<p>He then talked about experimental setups where one could select “at the community level” by explicitly selecting for
mixed populations. In simulation you can do this in the right regime. I found this one a lot more artificial and a lot less
compelling; it reminds me of ecology models where the niches are “put in by hand”. I’ll have to think more about
the original example, and what that might say about what sorts of models/scenarios might be interesting to think about.</p>
<p>We also had the project presentations. I think I’ll talk about them more in subsequent posts, since I’ll be working on one or
two of them in the following weeks. Needless to say I think there are a few exciting ones that hopefully I’ll get to do
some cool stuff with. Stay tuned!</p>
Mon, 24 Jul 2017 00:00:00 +0000
https://atishagarwala.github.io/2017/07/24/Qbio-First-Day.html
https://atishagarwala.github.io/2017/07/24/Qbio-First-Day.htmlkitpQBio2017eco-evoStarting KITP Qbio course<p>First blog post in a few months, was pretty busy with the wedding (and then a lot of relaxing after that). Thought I’d try
to start up the blog again while I’m at the <a href="https://www.kitp.ucsb.edu/qbio/2017-course-description">2017 KITP QBio course</a>.
Goal will be to write 2 posts per week with thoughs, interesting discussions, and ideas for the project.</p>
<p>Overall I’m pretty excited for the course; been thinking about ecology and evolution for a bit now, and hopefully this will
help focus my directions/thoughts there. I’ll get exposure to what kind of cutting edge experimental work and
techniques are going on in the field right now, and hopefully learn some long-overdue genomics analysis/bioinformatics-y
skills.</p>
<p>People seem very nice; they had a great welcome BBQ for us, and so far I find the other students fun and easy to talk to.
Looking forward to learning from them and working alongside them during the course. They seem more bent towards
biology/bioinformatics, but I think that’s a plus since I kind of always have an easy in with the physics crowd coming
out of Daniel’s group.</p>
<p>I always feel a bit weird leaving the bubble of our office/group and coming out to the “real world” of science; I often feel
like I’m not accomplished enough to be rubbing elbows with the big PIs. I think it’s a bit more than that; I don’t feel
like I’m quite scholarly enough, or have an understanding of enough of the details which are important in a field to
say intelligent things about projects, or pitch ideas for new ones. I think right now my main skill (other than being
decent at playing around with modelling and stats stuff) is asking critical questions given a bunch of information
(which may not be very unique critical thoughts).</p>
<p>I think this year, I’d like to be able to get to the level of synthesis - being able to really generate
thoughts/ideas which are good directions to pursue. I’ve started to do that a bit for the networks stuff, but there
I have some advantage of networks just being mathematical objects; even then there’s a bit of a question of any of the
directions can coalesce into something interesting. I hope I can use the QBio course to propel me forward a bit
in that direction for eco-evo dynamics. One thing I’m slowly appreciating is that it’s not enough to have ideas that
seem good given the current information you have (ie “in theory); you really need to have some of those ideas be good
“in practice”. I want to try to hold myself to a higher standard in terms of idea generation/higher level thought about
projects, and give myself the tools to do so. I think part of it will involve just spending more time thinking about these
things, and maybe more importantly learning to do so efficiently. Doesn’t do any good to go around in circles. Also, should
probably read more (or maybe just read better).</p>
<p>Since not much has happened yet, I’ll include some pictures of the Munger Residence, a new fancy building meant to house
KITP visitors in a kind of collaborative work/play/living space. Would be sweet to get to stay here, but for now we’re just
in the dorms.</p>
<p><br /></p>
<p><img src="/images/entrance.jpg" alt="Entrance" /></p>
<p><br /></p>
<p>Main entrance.</p>
<p><br /></p>
<p><img src="/images/interior_board.jpg" alt="Interior board" /></p>
<p><br /></p>
<p>Lots of chalkboards around, kind of a dream setup for a theorist.</p>
<p><br /></p>
<p><img src="/images/piano_and_boards.jpg" alt="Piano and boards" /></p>
<p><br /></p>
<p>Also (at least) two very nice pianos.</p>
<p><br /></p>
<p><img src="/images/dining_room.jpg" alt="Dining room" /></p>
<p><br /></p>
<p>Super fancy dining room.</p>
Sun, 23 Jul 2017 00:00:00 +0000
https://atishagarwala.github.io/2017/07/23/KITP-Qbio-course-start.html
https://atishagarwala.github.io/2017/07/23/KITP-Qbio-course-start.htmlkitpQBio2017eco-evoSome thoughts on eigenvalues and SVD<p>I’ve been thinking a bit about spectra and singular values as I’ve been working
on some ideas related to ecology and neuroscience, and I’ve decided to write up
my musings. I’ve realized that I’ve never really sat down and thought about SVD,
as I usually deal with Hermitian, anti-Hermitian, or orthogonal matrices
which are all normal (so the singular vectors are just eigenvectors).
In this post I’ll just go through some thoughts on ideas in linear algebra, and
in a later post I may expand on the neuroscience and ecology ideas that got me
started thinking about this.</p>
<h2 id="definitions">Definitions</h2>
<p>We’ll be talking about linear operators (matrices, in coordinates), which
I’ll denote by boldface capital letters like <script type="math/tex">\textbf{M}</script>. Vectors will
be lower case bold <script type="math/tex">\textbf{x}</script>, and scalars will just be normal
“mathed” characters like <script type="math/tex">\lambda</script>.</p>
<h2 id="eigenvectors-and-jordan-form">Eigenvectors and Jordan form</h2>
<p>I’ll start by first reviewing some basic things about eigenvectors. Given
a linear operator <script type="math/tex">\textbf{M}</script>, sending <script type="math/tex">\mathbb{R}^{N}</script>
to <script type="math/tex">\mathbb{R}^{N}</script>, a vector <script type="math/tex">\textbf{x}</script> is an <em>eigenvector</em>
if we have <script type="math/tex">\textbf{M}\textbf{x} = \lambda \textbf{x}</script>, where <script type="math/tex">\lambda</script>
is the associated <em>eigenvalue</em>. This definition is good enough for finite
dimensional vector spaces. Otherwise, we can generalize the definition of
eigenvalue to that of the <em>spectrum</em> of an operator, which is the collection
of values <script type="math/tex">\lambda</script> such that <script type="math/tex">\textbf{M}-\lambda</script> is not invertible. Eigenvalues
are always in the spectrum, but in infinite dimensional vector spaces we can
get into funky situations where there are elements in the spectrum which don’t
correspond to any eigenvectors.</p>
<p>However, we’ll stick to finite dimensional vector spaces for this post (over
<script type="math/tex">\mathbb{C}</script>, for the pedants at home). In a
finite dimensional vector space, there is always at least one eigenvalue. One
way to see this is to think about the determinant of <script type="math/tex">\textbf{M}-\lambda</script>. This
is a polynomial, so we know that we always have at least one root over the
complex numbers. With each eigenvalue, there is associated at least one
eigenvector.</p>
<p>If we have no degenerate eigenvalues
(one unique eigenvalue for each dimension), then we are guaranteed to have
an <em>eigenbasis</em> - a set of linearly independent eigenvectors which span our space.
In the eigenbasis coordinate system, <script type="math/tex">\textbf{M}</script> has a very simple form: it is
diagonal with entries equal to the eigenvalues <script type="math/tex">\lambda</script>. This is the source
of the power of eigenvalues and eigenvectors; they give us a coordinate system
where the linear operator can be described as simply as possible, as <script type="math/tex">N</script>
independent multiplications as opposed to <script type="math/tex">N^{2}</script> for a “typical” matrix.
In many linear or near-linear dynamical systems, understanding the <em>eigenmodes</em>
(dynamics in the eigenbasis) individually leads to direct understanding of the
global dynamics.</p>
<p>However, if we have a degeneracy (repeated eigenvalues, ie multiple roots in
the characteristic polynomial), then we are not guaranteed to have a complete
eigenbasis. In the worst case scenario we are guaranteed a nested set of invariant
subspaces. That is, we have some collection of vectors
<script type="math/tex">\{\textbf{v}_{\lambda,i}\}</script> such that</p>
<script type="math/tex; mode=display">\textbf{M}\left(\text{span}\{\textbf{v}_{\lambda,i}\}_{i\leq j}\right)\subseteq\textbf{M}\left(\text{span}\{\textbf{v}_{\lambda,i}\}_{i\leq k}\right)</script>
<p>where <script type="math/tex">j\leq k</script>. More concretely, this means we can find a basis where
<script type="math/tex">\textbf{M}</script> is almost diagonal, possible with 1’s on the off diagonal.
This is known as the <em>Jordan canonical form</em>, and the blocks look like</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
...\\
... & \lambda & 1 & 0 & ...\\
... & 0 & \lambda & 1 & 0 & ...\\
... & 0 & 0 & \lambda & 1 & 0 & ...\\
...
\end{pmatrix} %]]></script>
<p>The non-diagonalizability is related to <em>nilpotent</em> elements - submatrices which
are non-zero, but become zero when raised to some power. A simple example is
the 3-d matrix</p>
<script type="math/tex; mode=display">% <![CDATA[
\textbf{N}_{3} = \begin{pmatrix}
0 & 1 & 0\\
0 & 0 & 1\\
0 & 0 & 0
\end{pmatrix} %]]></script>
<p>This matrix only has one eigenvalue (0) and one eigenvector (the first
coordinate). It is already in Jordan form. We can compute its powers as</p>
<script type="math/tex; mode=display">% <![CDATA[
\textbf{N}_{3} = \begin{pmatrix}
0 & 1 & 0\\
0 & 0 & 1\\
0 & 0 & 0
\end{pmatrix},~\textbf{N}_{3}^{2} = \begin{pmatrix}
0 & 0 & 1\\
0 & 0 & 0\\
0 & 0 & 0
\end{pmatrix},~\textbf{N}_{3}^{3} = \begin{pmatrix}
0 & 0 & 0\\
0 & 0 & 0\\
0 & 0 & 0
\end{pmatrix} %]]></script>
<p>The matrix is evidently a shift operator. It becomes 0 after the 3rd power.
(Note: in linear dynamical systems, operators like this set up dynamics
linear in time
as opposed to the exponential dynamics set up by eigenvectors normally. In the
above example if coordinate 1 is position, coordinate 2 becomes speed and 3
acceleration).</p>
<h2 id="invariant-subspaces-and-qr-decomposition">Invariant subspaces and QR decomposition</h2>
<p>While the eigenbasis decomposition is the most common and useful
way to characterize matrices, it is far from the only informative one.
We will talk briefly about converting matrices to triangular forms, with an
eye towards talking about singular value decomposition in the next section.</p>
<p>For any finite dimensional linear operator (matrix), we can find some basis
in which it is upper triangular. In some sense we already know this because of
the Jordan canonical form mentioned in the previous section; however we can
go through an alternative calculation which gets us a different decomposition.</p>
<p>What kind of operator does an upper triangular matrix represent? In terms
of the basis coordinates, the first coordinate maps to a subspace spanned by
the first coordinate, the first two coordinates map to a subspace spanned by
the first two, and so on and so forth. In this way, there are a nested
set of <em>invariant subspaces</em> - linear subspaces that map onto themselves.</p>
<p>How can we construct such a set of nested invariant subspaces? One way
is to construct one inductively. Suppose that we have an invariant subspace
of dimension <script type="math/tex">k</script>. We can find one of dimension <script type="math/tex">k+1</script> in the following way.
Let <script type="math/tex">\textbf{P}_{k}</script> be the orthogonal projection onto our <script type="math/tex">k</script> dimensional
invariant subspace <script type="math/tex">V_{k}</script>.
That is, on <script type="math/tex">V_{k}</script> it is the identity;
orthogonal vectors map to 0. Then consider</p>
<script type="math/tex; mode=display">\textbf{M}_{k+1}\equiv (\textbf{Id}-\textbf{P}_{k})\textbf{M}</script>
<p>This linear map outputs onto the orthogonal subspace <script type="math/tex">(V_{k})_{\perp}</script>.
Therefore, restricted to <script type="math/tex">(V_{k})_{\perp}</script> it is a map from that subspace
onto itself. It has some eigenvector <script type="math/tex">v_{k+1}</script>. While <script type="math/tex">v_{k+1}</script> is an eigenvector
of <script type="math/tex">\textbf{M}_{k+1}</script>, it is not an eigenvector of <script type="math/tex">\textbf{M}</script>; however,
it has no other projection onto <script type="math/tex">(V_{k})_{\perp}</script>. Therefore,
<script type="math/tex">\textbf{M}</script> maps <script type="math/tex">V_{k}\oplus\{v_{k+1}\}</script> onto itself; we now have
a <script type="math/tex">k+1</script> dimensional subspace!</p>
<p>Our base case is a 1-dimensional invariant subspace; that is, an eigenvector.
We already know that each linear operator has an eigenvector; therefore we
are done. We’ve proven that a nested set of invariant subspaces exists.</p>
<p>In fact, we’ve done more. What we’ve done is shown that this series of
invariant subspaces are orthogonal to each other. We have proven
the existence of a decomposition</p>
<script type="math/tex; mode=display">\textbf{M} = \textbf{U}\textbf{T}\textbf{U}^{\dagger}</script>
<p>where <script type="math/tex">\textbf{T}</script> is upper triangular, and <script type="math/tex">\textbf{U}</script> is a <em>unitary</em> matrix
(generalized rotation for
complex vector spaces, <script type="math/tex">\textbf{U}^{\dagger}\textbf{U} = \textbf{Id}</script>). Here
<script type="math/tex">\dagger</script> is the adjoint
(transpose + complex conjugate).</p>
<p>Related to this decomposition is the <em>QR decomposition</em> where <script type="math/tex">\textbf{M}</script>
can be written as <script type="math/tex">\textbf{Q}\textbf{R}</script> - where the first matrix is
orthogonal and the second is upper triangular. While our above decomposition
scheme dependeded on the eigenvectors chosen, the QR decomposition is
unique if <script type="math/tex">\textbf{M}</script> is invertible, and we choose the diagonal elements of
<script type="math/tex">\textbf{R}</script> to be positive and ordered. Note that RQ, LQ, and QL decompositions
also exist (L being lower triangular).</p>
<h2 id="singular-value-decomposition">Singular value decomposition</h2>
<p>In the above decompositions, we’ve secretly added a hidden structure:
an inner product. Unitary matrices (and even the adjoint) implicitly assume
that some inner product exists. After all, a rotation is an operator that leaves
<em>lengths</em> invariant!</p>
<p>The most commonly used type of unitary transformation of a matrix is
the <em>singular value decomposition</em> (SVD). We decompose our matrix
<script type="math/tex">\textbf{M}</script> as</p>
<script type="math/tex; mode=display">\textbf{M} = \textbf{U}\mathbf{\Lambda}\textbf{V}^{\dagger}</script>
<p>where <script type="math/tex">\textbf{U}</script> and <script type="math/tex">\textbf{V}</script> are unitary, and <script type="math/tex">\mathbf{\Lambda}</script>
is diagonal. In order to make the decomposition unique up to
complex phases, one picks the
<script type="math/tex">\mathbf{\Lambda}</script> to be positive and ordered (which can be accomplished
by permutation and multiplying phases into the unitary matrices).
The columns of <script type="math/tex">\textbf{V}</script> are the <em>right singular vectors</em>
and the columns of <script type="math/tex">\textbf{U}</script> are the <em>left singular vectors</em>.</p>
<p>There are two ways to think about SVD, and the singular values themselves. One
is as solutions to the optimization problem</p>
<script type="math/tex; mode=display">\textbf{u}_{c},\textbf{v}_{c} = \underset{\textbf{u},\textbf{v}}{\arg\max}
~\textbf{u}^{\dagger}\textbf{M}\textbf{v}</script>
<p>where <script type="math/tex">\textbf{u}</script> and <script type="math/tex">\textbf{v}</script> are constrained to be fixed length.
The singular values are the (local) minima of this constrained optimization;
the pairs of vectors <script type="math/tex">\textbf{u}</script> and <script type="math/tex">\textbf{v}</script> give us the sets of
left and right singular vectors respectively. These vectors obey the equations</p>
<script type="math/tex; mode=display">\lambda_{u}\textbf{u}_{c} = \textbf{M}\textbf{v}_{c},~\lambda_{v}\textbf{v}_{c} = \textbf{M}^{\dagger}\textbf{u}_{c}</script>
<p>If we pick the pair that gives us the maximal value, we can “shave it off”
from the space: we know that inputting in <script type="math/tex">\textbf{v}_{c}</script> gets us
an output proportional to <script type="math/tex">\textbf{u}_{c}</script>. However, if we have an input
orthogonal to <script type="math/tex">\textbf{v}_{c}</script>, its output will be orthogonal to
<script type="math/tex">\textbf{u}_{c}</script> by the second equation. Therefore we can take one dimension
out of the input space and one out of the output space; we can iterate
and construct the SVD as defined above.</p>
<p>Note that this kind of “optimization” definition nets us eigenvectors
and eigenvalues
when <script type="math/tex">\textbf{M}</script> is symmetric, and we compute
<script type="math/tex">\textbf{u}^{\text{T}}\textbf{M}\textbf{u}</script>. This optimization formulation
correctly generalizes to the case where the matrices are not symmetric.</p>
<p>Another way to get the singular values and vectors is to consider
the matrices <script type="math/tex">\textbf{M}\textbf{M}^{\dagger}</script> and
<script type="math/tex">\textbf{M}^{\dagger}\textbf{M}</script>. These are both <em>Hermitian</em> matrices
(generalization of symmetric to complex valued). Hermitian matrices
all have a complete, orthogonal eigenbasis with real eigenvalues.
If we cheat and plug in the SVD form of <script type="math/tex">\textbf{M}</script> into each
of the formulae we get</p>
<script type="math/tex; mode=display">\textbf{M}\textbf{M}^{\dagger} = \textbf{U}\mathbf{\Lambda}\textbf{V}^{\dagger}\textbf{V}\mathbf{\Lambda}^{\dagger}\textbf{U}^{\dagger} = \textbf{U}\mathbf{\Lambda}\mathbf{\Lambda}^{\dagger}\textbf{U}^{\dagger}</script>
<p>and similarly</p>
<script type="math/tex; mode=display">\textbf{M}^{\dagger}\textbf{M} = \textbf{V}\mathbf{\Lambda}^{\dagger}\mathbf{\Lambda}\textbf{V}^{\dagger}</script>
<p>Therefore, the singular vectors are the eigenvectors of the two product matrices,
and the singular values are the (positive) square roots of the
eigenvalues of the two operators. (As a bonus note, these formulae
tell us that for real <script type="math/tex">\textbf{M}</script>, the singular vectors are real as well.)</p>
<h2 id="features-of-svd-and-applications">Features of SVD and applications</h2>
<p>There are a few important notes to make about SVD. The first is that
SVD is very coordinate dependent. It depends on the (often implicit)
choice of an inner product on our space, from which notions of orthogonality
can be defined (and unitary matrices/transposes). The SVD
is invariant under rotations (which don’t change the inner product)
but not under non-orthogonal transformations (which correlate coordinates).</p>
<p>The second is a related point: the singular values are in general NOT the
same as the eigenvalues, even in magnitude! As an example, consider the matrix</p>
<script type="math/tex; mode=display">% <![CDATA[
\textbf{M} = \begin{pmatrix}
1 & 1 \\
0 & 1
\end{pmatrix} %]]></script>
<p>It has one degenerate eigenvalue <script type="math/tex">\lambda = 1</script>. Now consider
<script type="math/tex">\textbf{M}^{\dagger}\textbf{M}</script>; we have</p>
<script type="math/tex; mode=display">% <![CDATA[
\textbf{M}^{\dagger}\textbf{M} = \begin{pmatrix}
2 & 1 \\
1 & 1
\end{pmatrix} %]]></script>
<p>If the eigenvalues were matched to the singular values, this matrix would have
eigenvalues 1 and 1. However, its trace is 3; therefore its singular
values and eigenvalues don’t match! More on which values matter later.</p>
<p>Also note that SVD can be applied to non-square matrices. In this case,
<script type="math/tex">\textbf{V}</script> is a square matrix with dimensions of the input, <script type="math/tex">\textbf{U}</script>
is the same for the output, and <script type="math/tex">\mathbf{\Lambda}</script> has the same dimensions
as <script type="math/tex">\textbf{M}</script> - and is diagonal or 0 (rectangular diagonal matrix).
The number of singular values is the minimum of the
dimensions of <script type="math/tex">\textbf{M}</script> - the structure fundamentally can only have
the dimensionality of the simplest vector space involved!</p>
<p>The singular values and eigenvalues are only the same when the
matrix is <em>normal</em> - that is, it has an orthogonal eigenbasis. In this case
<script type="math/tex">\textbf{U} = \textbf{V}</script>. In fact, thinking about singular values gives
us another way to characterize normal dynamics: an operator <script type="math/tex">\textbf{M}</script>
is normal if and only if</p>
<script type="math/tex; mode=display">\textbf{M}\textbf{M}^{\dagger} = \textbf{M}^{\dagger}\textbf{M}</script>
<p>In other words, normal operators commute with their transposes! We prove this
quite simply; the two products above can be used to get the <script type="math/tex">\textbf{V}</script>
and the <script type="math/tex">\textbf{U}</script>, and if the singular vectors are the same, the
matrices from which they are derived should be the same too.</p>
<p>All of these points come up when we use SVD in the “real world”.
The most common useage of singular value decomposition is to analyze and
reduce the dimensionality of data. Consider a data matrix <script type="math/tex">\textbf{M}</script>
constructed by “gluing together” data (column) vectors <script type="math/tex">\textbf{v}_{i}</script>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\textbf{M} = \begin{pmatrix}
\textbf{v}_{1} & \textbf{v}_{2} & \textbf{v}_{2} & ...
\end{pmatrix} %]]></script>
<p>The right singular vectors <script type="math/tex">\textbf{V}</script> give some sense of the correlations
between different data elements. The left vectors <script type="math/tex">\textbf{U}</script> give a basis
over the data space. To reduce the dimensionality of data while preserving
the structure, we can project the data onto only the left singular
vectors which correspond to the largest singular values.</p>
<p>What structure are we preserving? Recall that the <script type="math/tex">\textbf{U}</script> are
the eigenvectors of <script type="math/tex">\textbf{M}\textbf{M}^{\dagger}</script>. In our data matrix,
the <script type="math/tex">ij</script>th entry of this matrix corresponds to</p>
<script type="math/tex; mode=display">(\textbf{M}\textbf{M}^{\dagger})_{ij} = \sum_{\alpha}(\textbf{v}_{\alpha})_{i}
(\textbf{v}_{\alpha})_{j}</script>
<p>If the mean is subtracted out of our data, this is the covariance between
coordinates <script type="math/tex">i</script> and <script type="math/tex">j</script>! Therefore, the <script type="math/tex">\textbf{U}</script> give us an ordered basis
which contributes to the covariance. Projecting our data into some subset of the
<script type="math/tex">\textbf{U}</script> then preserves the covariance structure, while reducing
dimensionality. This is the idea behind the popular dimensionality reduction
technique <em>PCA</em>. The eigenvectors <script type="math/tex">\textbf{U}</script> are called <em>principal components</em>
(PCs). PCA is the appropriate thing to do when Gaussian distributions
are involved, but is surprisingly useful in situations where that is not the
case.</p>
<p>Our understanding of SVD tells us a few things about PCA. First, it is
rotationally invariant. If we multiply our data by any unitary matrix, the PCA
will basically be the same; PCs will be in direct 1-1 correspondence.
However, any non-unitary linear
transformation will change the structure of the PCs. The principal values will
be different, and the PCs won’t map to each other 1-1. This is just for
arbitrary <em>linear</em> transforms! Non-linear transforms will be even less related.
That’s why if you’re doing PCA on data which isn’t Gaussian distributed,
it may be worth randomly transforming the data, doing PCA, and seeing if the
conclusions you draw are the same.</p>
<h2 id="normal-vs-non-normal-dynamics">Normal vs. non-normal dynamics</h2>
<p>We conclude with an example of the qualitative differences between understanding
things from the singular value picture versus the eigenvalue decomposition.
I may expand this in a later post with some examples from neuroscience and
ecology modelling which inspired me to revisit my understanding of matrix
decomposition.</p>
<p>Consider a linear dynamical system with the equation</p>
<script type="math/tex; mode=display">\dot{\textbf{x}} = \textbf{M}\textbf{x}</script>
<p>Let <script type="math/tex">\textbf{w}</script> be some fixed vector. We want to ask:
when can perturbations in the <script type="math/tex">\textbf{w}</script> direction not affect the dynamics,
while the <script type="math/tex">\textbf{w}</script>
component is present at late times in the dynamics of <script type="math/tex">\textbf{x}</script>?</p>
<p>In general, we can solve linear dynamical systems by diagonalizing
<script type="math/tex">\textbf{M}</script>, and solving the resulting uncoupled system of equations. (We
won’t deal with the case of degeneracy here.) Let
<script type="math/tex">\textbf{M} = \textbf{B}\mathbf{\lambda}\textbf{B}^{-1}</script> for diagonal
<script type="math/tex">\mathbf{\lambda}</script>. Let <script type="math/tex">\textbf{y} = \textbf{B}^{-1}\textbf{x}</script>.
Then we have</p>
<script type="math/tex; mode=display">y_{i}(t) = c_{i}e^{\lambda_{i}t}</script>
<p>We can use this basis to answer our question.</p>
<p>Consider first the case where <script type="math/tex">\textbf{M}</script> is normal. Then, in the coordinate
system of the <script type="math/tex">\textbf{y}</script>, the inner product is just the standard inner product.
If <script type="math/tex">\textbf{w}</script> does not affect the dynamics, that means that <script type="math/tex">\textbf{w}</script>
does not project much onto components of <script type="math/tex">\textbf{y}</script> which have
large <script type="math/tex">\lambda_{i}</script>. Let <script type="math/tex">\textbf{w}' = \textbf{B}^{-1}\textbf{w}</script>; then we have</p>
<script type="math/tex; mode=display">\textbf{w}^{\dagger}\textbf{x} = \textbf{w}'^{\dagger}\textbf{y}</script>
<p>which is given by</p>
<script type="math/tex; mode=display">\textbf{w}^{\dagger}\textbf{x} = \sum_{i}(\textbf{w}')_{i}c_{i}e^{\lambda_{i}t}</script>
<p>We know this is small since <script type="math/tex">\textbf{w}</script> does not project much onto the
<script type="math/tex">\textbf{y}</script> which matter; therefore <script type="math/tex">\textbf{w}</script> does not play into the dynamics
much.</p>
<p>However, what happens if <script type="math/tex">\textbf{M}</script> is not normal? Then we can arrive at our
desired behavior, which we can see in one of two ways. First, suppose that
<script type="math/tex">\textbf{w}</script> has a large projection onto some left singular vector <script type="math/tex">\textbf{U}</script>
but is uncorrelated with the right singular vectors <script type="math/tex">\textbf{V}</script>. Then</p>
<p>This explanation works, but is a bit annoying since it’s not in terms of the
eigenbasis - which gives us a complete understanding of the dynamics. Here’s
an alternative explanation.</p>
<p>Suppose again that <script type="math/tex">\textbf{M}</script> is non-normal. Suppose now that
<script type="math/tex">\textbf{w}</script> is a zero eigenvector of <script type="math/tex">\textbf{M}</script>. That is, it doesn’t
contribute at all to the dynamics. Now when we calculate
<script type="math/tex">\textbf{w}^{\dagger}\textbf{x}</script>, we have</p>
<script type="math/tex; mode=display">\textbf{w}^{\dagger}\textbf{x} =
(\textbf{w}')^{\dagger}(\textbf{B}^{-1})^{\dagger}\textbf{B}^{-1}\textbf{y}</script>
<p>Now, even though <script type="math/tex">\textbf{w}</script> is an eigenvector of <script type="math/tex">\textbf{M}</script> (ie, a
coordinate basis vector in the <script type="math/tex">\textbf{y}</script> coordinates), the inner products
are no longer compatible between the bases. The inner product is defined by
the matrix <script type="math/tex">(\textbf{B}^{-1})^{\dagger}\textbf{B}^{-1}</script>, and now induces
correlations between the different eigenmodes! Therefore even though
<script type="math/tex">\textbf{w}</script> has literally 0 dynamical contribution, projecting onto
<script type="math/tex">\textbf{w}</script> in the original space still gets us signal.</p>
<p>Here we arrive at an interesting feature of non-normal dynamics: the input-output
behavior is not symmetric as it is for normal dynamics. In other words,
coordinates which are nice for understanding dynamics don’t play nicely with the
inner product. This is not important when there is symmetry up to linear
transformations in the problem; one can then switch bases at will. However,
when there is a particular basis with meaning (ie, the basis of neuron
activities or species abundances), or when the noise characteristics are known
in a particular basis (ie the basis with uncorrelated noise), then
the distinction between normal and non-normal dynamics truly matters.</p>
<h2 id="links-to-neuroscience-and-ecology">Links to neuroscience and ecology</h2>
<p>I won’t go into this now, but the question of normality of dynamics may
be important for modelling efforts in both theoretical neuroscience and
theoretical ecology. I hope to expand on this in a later post, when I have
thought a bit more about how these things apply.</p>
Sun, 05 Mar 2017 00:00:00 +0000
https://atishagarwala.github.io/2017/03/05/eig-and-svd-1.html
https://atishagarwala.github.io/2017/03/05/eig-and-svd-1.htmllinearalgebraspectraBulk fitness assay code up!<p>Just publicly released my fitness assay code!</p>
<p><a href="https://github.com/barcoding-bfa/fitness-assay-python">https://github.com/barcoding-bfa/fitness-assay-python</a></p>
<p>This is essentially the first bit of science I’ve done that’s been published, and my first time releasing code into the wild.
Kind of cool to see the first successful project of my PhD actually out there.</p>
<p>In case you’re interested in reading the paper, you can find it <a href="http://dx.doi.org/10.1016/j.cell.2016.08.002">here</a>. Basically
I helped develop ways to analyze data on abundance of different subtypes of yeast in order to figure out which ones were
better, and by how much. That paper is more of a methods one, but we have a more sciencey one in the works where we use this
analysis to figure out (or attempt to figure out!) how exactly the adaptive subtypes are gaining an advantage. I’ll post
about that when it comes out, look for it around March!</p>
Fri, 27 Jan 2017 00:00:00 +0000
https://atishagarwala.github.io/2017/01/27/Fitness-assay-code-done.html
https://atishagarwala.github.io/2017/01/27/Fitness-assay-code-done.htmlfitnessassayexperimentalevoDoes this work?<p>File is in <code class="highlighter-rouge">/_posts</code>.</p>
<p>And now for some Latex $\int\tau u\bar{f}\hat{f}$:</p>
<script type="math/tex; mode=display">e = mc^{2}</script>
<figure class="highlight"><pre><code class="language-ruby" data-lang="ruby"><span class="k">def</span> <span class="nf">a_bit_of_code</span><span class="p">(</span><span class="n">some_val</span><span class="p">):</span>
<span class="k">return</span> <span class="n">some_val</span></code></pre></figure>
Tue, 17 Jan 2017 09:24:01 +0000
https://atishagarwala.github.io/junk/2017/01/17/test-post.html
https://atishagarwala.github.io/junk/2017/01/17/test-post.htmltrying_jekylljunk