Dataset for "Timing of replication is a determinant of neutral substitution rates but does not explain slow Y chromosome evolution in rodents"

  • Catherine Pink (Creator)
  • Laurence Hurst (Creator)
  • Martin Lercher (Heinrich-Heine-Universität Düsseldorf) (Creator)

Dataset

Description

The dataset consists of intronic substitution rates (Ki) for mouse-rat orthologs using the July 2007 (mm9) and November 2004 (rn4) assemblies respectively, both obtained from the UCSC Table Browser. Intronic substitution rates were corrected for multiple hits according to the model of Tamura and Kumar (2002) Mol Biol Evol, 19:1727-1736. The dataset was further filtered to control for selective effects as described in the methodologies of Pink et al. (2009) Genome Biol Evol. 1:13-22, and Pink and Hurst (2010) Mol Biol Evol. 27(5): 1077-1086. Two intronic substitution rate datasets are provided: The main findings of Pink and Hurst (2010) were based on a filtered dataset, purged of all introns thought to be evolving under purifying selection that had failed a test for clusters of conserved bases, potentailly indicative of hidden functional sites. An unfiltered dataset underpins supplementary findings. Full details of the test and other filters for selection are provided in Pink et al. (2009).

The dataset combines intronic substitution rates with mouse replication times. Replication timing data for mouse cell lines prior to differentiation were downloaded from www.replicationdomain.org (Hiratani et al. (2008) PLoS Biol, 6:e245). The four available datasets were treated as replicates: Three derived from embryonic stem cells and a fourth derived from induced pluripotent stem cells (iPS). Positive values were indicative of early replication and negative values were indicative of replication later during S-phase.

Positions of genes on the mouse genome were defined by the terminal 5’ and 3’ base pairs of the coding sequence. These positions were obtained from annotations of the July 2007 assembly (mm9). As mouse replication timing data were assigned genomic coordinates based on the February 2006 assembly (mm8), the stand alone liftOver utility and associated chain file mm9ToMm8.over.chain, both obtained from UCSC, were used to convert positions between builds. Genic replication times were then taken from an average of times determined for probe positions overlapping with any part of the orthologous gene, within the limits of the coding sequence. Both means and medians are provided for each gene.

The dataset also includes intronic GC content and extent of intronic G+T skew for each ortholog, the latter used as a proxy for germ-line expression rate. The published datasets are the original .txt and .xls formats produced by the scripts, as well as .xlsx and .csv versions for preservation purposes, containing the variables described above. Details of methodologies are provided both in the publications Pink et al. (2009) and Pink and Hurst (2010), as well as the accompanying readme file. The readme file also contains details of the original sources of input data. Scripts used to process these input data and create the final datasets are also provided.
Date made available2015
PublisherUniversity of Bath

Cite this