Phylogenetic data, and the trees inferred from them, represent a hugely valuable resource for evolutionary biological research. The data are often expensive and time-consuming to acquire, and the results from analyses of these data - typically trees - represent a vast investment of effort and expertise across the global community of bioinformaticians and systematists. Trees, and their underlying character data, are often repurposed in other areas of biology; notably in evolutionary studies that seek to test patterns of genomic evolution or macroevolutionary trends. Despite their enormous value, recent research by the PDRA estimates that less than 4% of the phylogenetic trees published in 2010 are available in machine-readable form. Our proposal stands at the leading edge of content mining technology. We will create Open Source 'data liberation' software tools that will allow us to unlock the greater proportion of phyloinformatic data from where they are currently buried in the literature. These will include phylogenetic trees, branch lengths and support values (extracted from the SVG content of PDF files), analytical methods and indices of data quality (from figure legends and the main body of the text) and the underlying molecular and morphological character data. We will also derive full bibliographic and geographical data for each source paper. We will test, refine and perfect these tools by applying them to PLoS, BMC, Elsevier, Wiley and Springer online content from the 21st Century. Once the data are extracted, we will ensure that their immense interdisciplinary (evolutionary biology, ecology, ethology, palaeobiology and conservation) and legacy potential is realised by making them available online in an explicitly open manner. We will also use the data ourselves in order to address several related questions concerning research effort, phyloinfomatic data quality and the progress of systematic research. While there is renewed interest and emphasis on curating underlying research data and results (exemplified by projects such as TreeBASE, Dryad, BMC's partnership with LabArchives, and FigShare) these ventures rely upon author submission, which is rarely mandated by journals. Uptake has been slow and coverage is woeful. The data archiving success of NCBI/GenBank for nucleotide sequences (N.B., not alignments, trees or other results, and certainly not morphology) is the exception rather than the rule in the Biological Sciences. For the foreseeable future, therefore, there is a pressing need to retrospectively gather data from the published literature. This project is extremely novel in its scale and ambition. If successful in re-extracting the majority of phylogenetic data from the last decade, the software will easily be adapted and modified by others to suit the data re-extraction needs of other areas of science. This will better harness the billions of pounds of research money hitherto invested into obtaining and analyzing data, only for it to have been locked down and subsequently obfuscated in PDF publications when projects are completed. The project is also widely trans-disciplinary, bringing together a macroevolutionary phylogeneticist (Wills), a chemoinformaticist (Murray-Rust), and a young, up-coming Researcher (Mounce). The potential wider benefits of this project are vast and diverse; content mining techniques are estimated to be capable of generating up to £200 billion annually in added value for Europe alone. We cannot claim to generate those benefits directly, but we will create open tools and generate open data that will greatly facilitate other commercial, industrial and academic ventures.
|Effective start/end date||1/02/14 → 1/08/15|
twenty first century