March 30, 2015

Causality, part II (was it caused by Part I?)

This post serves as a follow-up to a Synthetic Daisies post written in 2012 on new methods to detect causality in data.

Here are a few interesting readings at the intersection of data analysis and the philosophy of science. The first [1] is a new arXiv paper [2] that evaluates two approaches to evaluating causality using two machine learning techniques. A plethora of discriminative machine learning techniques have emerged in recent years to address relatively simple relationships. In terms of cause and effect itself, the distinguishing signal is often subtle and unclear even for seemingly obvious sets of relationships. In [2], techniques called Additive Noise Methods [3] and Information Geometric Causal Influence [4]. A dataset called CauseEffectPairs [5] was used to benchmark each method, and show that causal relationships can be uncovered from a wide variety of data.

The second paper (or rather series of papers) is on the topic of strong inference [6]. Strong inference is an alternative to hyper-reductionism and the use of over-simplified models. Strong inference involves the use of a conditional inductive tree to examine the possible causes for a given phenomenon [7]. Potential causes (or hypotheses) represent nodes of the tree, and these hypotheses are falsified as one moves through the tree using either inductive or empirical criteria. Unlike the machine learning models we discussed, the goal is to lead a researcher to key experiments that help to uncover the sources of variation. In general, this process of elimination lead us to the best answers, Yet according to Platt in [2], this approach can ultimarely provide us with axiomatic statements.

Conceptual steps involved in strong inference. COURTESY: Figure 1 in [8].

While this seems to be a fruitful methodology, it has turned out to be more inspirational than as a source of analytical rigor [9]. Strong inference hs inflenced a variety of scientific fields concentrated in the biological and social sciences. Platt predicted [2] that sciences that concurred with strong inference would be fields that experienced a greater number of breakthrough advances. However, in testing Platt's predictions regarding the efficacy of Strong Inference, is have been found that advances are not directly related to the adoption of the method [10]. This could be due to our incomplete understanding of the factors that drive scientific discovery and the rate of advancement. 

[2] Mooij, J.M., Peters, J., Janzing, D., Zscheischler, J., and Scholkopf, B.   Distinguishing cause from effect using observational data: methods and benchmarks. arXiv, 1412.3773 (2014).

[3] Hoyer, P.O., Janzing, D., Mooij, J.M., Peters, J., and Scholkopf, B.   Nonlinear causal discovery with additive noise models. In Advances in Neural Information Processing Systems (NIPS), 21, 689-696 (2009).

[4] Daniusis, P., Janzing, D., Mooij, J.M., Zscheischler, J., Steudel, B., Zhang, K., and Scholkopf, B. Inferring deterministic causal relations. In Proceedings of the 26th Annual Conference on Uncertainty in Artificial Intelligence (UAI), 143-150 (2010).

[5] This work was part of the CauseEffect Pairs Challenge and was presented at NIPS 2013.

[6] Platt, J.R.   Strong Inference: certain systematic methods of scientific thinking may produce much more rapid progress than others. Science, 146(3642), 347-352 (1964).

[7] Neuroskeptic   Is Science Broken? Let's Ask Carl Popper. Neuroskeptic blog, March 15 (2015).

[8] Fudge, D.S.   Fifty years of J.R. Platt's Strong Inference. Journal of Experimental Biology, 217, 1202-1204 (2014).

[9] Davis, R.H.   Strong Inference: rationale or inspiration? Perspectives in Biology and Medicine, 49(2), 238-250 (2006).

[10] O'Donohue, W. and Buchanan, J.A.   The Weaknesses of Strong Inference. Behavior and Philosophy, 29, 1-20 (2001).

March 14, 2015

A Modest Framework for Scientific Transparency

Here are six points for the integration of open-access science publishing and open data. This was developed from personal practice and research in addition to interactions with the Research Data Service (University of Illinois) and the SciFund challenge. This pipeline begins at the write-up stage, but some points rely on practice prior to analysis and write-up.

A)   Preprint (e.g. kernel of hypothesis- or question-driven results).

A number of options exist for this, including arXiv, bioRxiv, PLoS One, or another permanent location that provides a formal archival address or digital object identifier (doi). The core paper should be brief (6-12 pgs) and formal.

B)   Advanced methods/theory.

These can be submitted as supplemental materials, either in the same repository as the preprint itself or on another permanent server. As opposed to simple auxillary files, this should be set up more along the lines of an iPython notebook.

C)   Advanced Analysis.

This can be treated in the same manner as the advanced methods/theory. This will include transformational datasets (e.g. time-frequency decompositions, log transforms, combinations of data from multiple sources in a common framework) and the associated data tables and figures/graphs.

D)   Datasets.

1)   Raw Data: images, unprocessed vectorial or matricial output.

These will be stored as formatted image files, ASCII files, or tabular files.

2)   Processed Data: numeric variables, simple annotation.

These will be appended to the raw data either in the file or as linked files in the same directory.

3)   Higher-level Data: correlational, data fusion, decompositional.

These will include the transformational datasets mentioned in the section on Advanced Analysis. These datasets are to be linked to the raw and processed data directory. Simple annotation methods will confirm the identity.

4) Higher-level Representation: RDF/XML descriptive models, algorithmic (e.g. data landscapes, possibility spaces).

These types of representations can help us go beyond the typical reliance on “statistical significance” and “future directions” to provide a rigorous approach to guide future investigations. An example of this is parameterization models from existing data.

E)   Blogging Publicity.

All materials should be promoted through a blog post. This can be in the form of a feature article, or as a series of annotated links. This can be followed up with reposting key features of the initial post to a social blog like Tumblr or sharing a link via Twitter.

F)   Peer Commentary.

While this is typically kept confidential, there are so-called post-peer-review venues that provide a means to review work (e.g. PeerJ, F1000). This includes both formal (actionable) statements and informal statements in the form of critiques. 

This outline represents the entirely of a scientific reporting pipeline (from formal write-up to published items), although I am no doubt missing something. I will be fleshing each of these points out in future posts with real data and examples from Orthogonal Research and my work at the University of Illinois.

March 9, 2015

Review of "Arrival of the Fittest"

"Arrival of the Fittest" is the latest book by Andreas Wagner, a professor at the University of Zurich in Switzerland. The book [1] tackles a subtly complicated topic: the evolution and evolvability of innovations using a biochemical and computational perspective. For the most part, Wagner succeeds at presenting an elegant case for how the ability to naturally evolve innovations lies at the heart of the evolutionary process. To get there, however, Wagner must introduce us to a number of semi-obscure concepts (at least to the layman). The book can be summarized in four parts: survival of the most novel (I), the concept of innovability (II), the safe and the risky (III), and multiple origins, multiple solutions (IV). I will give a technical review of the book by highlighting these four interrelated themes.

I. Survival of the most novel.
To understand the propagation of innovative solutions throughout the tree of life, it is important to understand the difference between the force of evolution versus the role of selection. Wagner proposes that, rather than creating innovations, the role of natural selection is to preserve them. Innovations themselves are the result mutation, recombination and historical contingencies. The first two mechanisms are capable of randomly generating either very simple innovations or the components of more complex innovations. But to achieve the "tinkering" that seems to be prevalent in complex genetic pathways and phenotypes, we need to have standard meta-components that can lock in previous changes. This allows for the limited exploration of a fitness space without incurring the cost of losing previous advances.

Wagner teases historical contingencies this into two classes: building blocks and standards. In the case of building blocks, the working components that result from previous innovation are modularized into larger units. This allows for increasing complexity to be had from a stochastic process (evolution) as well as accelerating the process of finding and retaining novelties (innovation). The coordination of building blocks often leads to standards, which are much more flexible than hyper-specialized systems but are much more able to deal with hyper-complexity.

Building blocks assembled into a complex structure.

Wagner illustrates this by comparing metabolic engines (a high-tolerance biological engine) with the internal combustion engine (a high-precision mechanical innovation). While metabolic engines can use an interchangeable set of fuels and reactants, internal combustion engines require specific specifications to operate. While one might argue that this implies mechanical engines will be "optimized" and metabolic engines will be made "good enough", this is indeed the point. Rather than survival of the fittest, we observe a survival of the fit enough, with selection acting strongest on functional novelties that augment fitness.

A high-tolerance but highly-specific engine (internal combustion) type.

II. Concept of innovability.
In the case of complex metabolic networks, the basic function has not changed throughout evolutionary history. What has changed is the number of reactions, which scales with evolutionary complexity. This scaling requires very general standards. But to make these standards interoperable across divergent evolution, a diversity of building blocks regulatory mechanisms are also required. This leads us to a set of principles which can explain general trends in innovation rather than on a case-by-case basis. Another such principle involves the number of parts and identity of a specific innovation. The number of interacting parts and their configuration provides a means to compare specific innovations in a hypothetical manner. Wagner proposes that this be done using a high-dimensional structure such as a hypercube [2] or a neutral network [3].

In a hypercube representation, each node represents a specific genotype, while the edges represent pathways (or the accumulation of mutations) between genotypes. In short, the shortest paths are the most probable. In a highly-connected space, a long distance can be traveled across the space in a short number of mutations (or edges).

III. The safe and the risky.
How can the structure of a biological system make things safe for innovation? Certainly, blindly changing key components of a metabolic network or developmental scaffolding without regard for essential function can result in lethality. The presence of the building blocks and standards principles provides a failsafe means to tinker or change without lethal disruption. But there are two other principles at work: redundancy and connectivity. These principles are arguably more important in acting as gatekeepers for the fitness benefits reaped by innovations.

A machine finding its way through a maze. One way to search for novel solutions in a de novo fashion.

Redundancy involves the presence of multiple parts that play a similar or interchangeable role in the functioning of a system. For example, both genetic and metabolic networks can be robust to the removal of single components. When components get removed without having a consequence on function [4], we can say that such a component is redundant. Gene duplications can play a similar role: single copies can be knocked out without a large detriment to fitness. Related to redundant function is, of course, robustness. Robustness can be thought of as how much redundancy exists in a specific system. Wagner gives the example of phenocopying as a way in which developmental robustness can lead to redundant phenotypic configurations.

Connectivity is a less appreciated aspect of evolution. Yet in terms of functional and conceptual unity, connectivity is an essential component of evolving systems that produce innovation. Three aspects of connectivity are most important here: interaction type, density, and connection order. There are a number of interaction types in biological networks that can give rise to innovation. For example, we can look to genetic networks (interaction between genes), breeding networks (interactions between conspecifics), or even the aforementioned neutral networks (interactions between evolutionary configurations) for ways in which connectivity can yield pathways that lead to innovations of high fitness.

An example of a genetic (gene interaction) network. Sometimes this approach is called "hairball science". COURTESY: Figure 1 in [5].

Network density (or the number of interconnections between nodes) allows for a greater number of potential innovative solutions to be explored in a shorter number of steps. This can increase the number of potential solutions, which can be reached in a shorter number of evolutionary steps and/or time (depending on your perspective). In the neutral network, a number of "safe" pathways of equivalent fitness are created for the innovating pathway or organism to explore. While the network density is important in facilitating innovation, the connection order of the network is also important. In networks with a short connection order (e.g. small-world networks), even random paths through the network can yield large-scale, non-deleterious change.

An example of constituent genotypes in a neutral network with reference to robustness. COURTESY: Ricardo Azevedo, Wikipedia.

IV. Multiple origins, multiple solutions.
The final point of Wagner's book is to remind us that innovations can be achieved through multiple unique solutions, and that existing innovations may have had multiple origin points. While evolution also has no unique solution, innovation is particularly subject to parallel evolution. Since this book does not shy away from computational representations, Wagner applies the idea of evolution strategies [6] to illustrate how innovations can be modeled as a five-part process. Specifically, innovation involves trial-and-error, population-based exploration, solutions with multiple origins, a combinatorial structure, and a stochastic process with a mutation-selection structure. While these factors also define the evolutionary process, we might say that innovation is inseparable from evolution by natural selection. Thinking more broadly and considering social systems, cultural innovation might also be best characterized as an evolutionary process. Wagner's approach to innovation as evolvability is a useful and accessible introduction to the topic, and places this new set of concepts and terminology directly into the context of evolution by natural selection.

[1] For other, briefer reviews, please see: Hoppe, R.B.   Andreas Wagner: Arrival of the Fittest: Solving Evolution’s Greatest Puzzle. Panda's Thumb blog, November 4 (2014) AND Pagel, M.   The Neighborly Nature of Evolution. Nature, 514, 34 (2014).

[2] For an example, please see: Gavrilets, S. and Gravner, J.   Percolation on the Fitness Hypercube and the Evolution of Reproductive Isolation. Journal of Theoretical Biology, 184(1), 51–64 (1997).

[3] For an example, please see: Wagner, A.   Robustness and Evolvability in Living Systems. Princeton University Press (2005).

[4] Li, J., Yuan, Z., and Zhang, Z.   The Cellular Robustness by Genetic Redundancy in Budding Yeast. PLoS Genetics, 6(11), e1001187 (2010).

[5] Magtanong, L.   Dosage suppression genetic interaction networks enhance functional wiring diagrams of the cell. Nature Biotechnology, 29, 505-511 (2011).

[6] Evolution strategies assumes that evolution is an optimizing process that results from the iterative application of a mutation-selection model. For more, please see: Schwefel, H-P.   Numerical optimazaion of computer models. Wiley Press, Chichester (1981) AND Beyer, H-G. and Schwefel, H-P.   Evolution Strategies: A Comprehensive Introduction. Journal of Natural Computing, 1(1), 3–52 (2002).