A central truth of the genomics revolution is that only a small percentage of newly found genes can be reliably annotated (i.e. assigned function with certainty) by even the most sensitive sequence comparison methods. Adding structural information improves this percentage slightly, but the sheer complexity of even the simplest biological systems dictates that function discovery must involve the collection and successful integration of data from a variety of data streams. For small samples human analysis of disparate data sources is feasible, but this becomes impossible for genome-scale analysis. High throughput function discovery requires integrated data.
Keith Allen describes the efforts to integrate data from sequence annotation, gene expression profiling, biochemical profiling and detailed phenotypic analysis. Data integration is only a prerequisite for function discovery, however, as all of the data must be transformed into coherent data sets. We define coherent data as truly comparable data from multiple technology platforms. Thus the data streams themselves must be translated such that they are cross compatible, and can be simultaneously analyzed by a single method. Once coherent data sets are established, any analytical tool (i.e., cluster analysis) that could be applied to a single data set can be applied to all of the data at once. In this way meaningful comparisons can be made between data points from heterogeneous sources, and biological relationships can be discerned that would otherwise have been missed. He shows how this approach has worked in a pilot study of herbicide action in Arabidopsis.