The dataset consists of intronic substitution rates (Ki) for mouse-rat orthologs using the July 2007 (mm9) and November 2004 (rn4) assemblies respectively, both obtained from the UCSC Table Browser. Intronic substitution rates were corrected for multiple hits according to the model of Tamura and Kumar (2002) Mol Biol Evol, 19:1727-1736. The dataset was further filtered to control for selective effects as described in the methodologies of Pink et al. (2009) Genome Biol Evol. 1:13-22, and Pink and Hurst (2010) Mol Biol Evol. 27(5): 1077-1086. Two intronic substitution rate datasets are provided: The main findings of Pink and Hurst (2010) were based on a filtered dataset, purged of all introns thought to be evolving under purifying selection that had failed a test for clusters of conserved bases, potentailly indicative of hidden functional sites. An unfiltered dataset underpins supplementary findings. Full details of the test are provided in Pink et al. (2009).
The dataset combines intronic substitution rates with mouse replication times. Replication timing data for mouse cell lines prior to differentiation were downloaded from www.replicationdomain.org (Hiratani et al. (2008) PLoS Biol, 6:e245). The four available datasets were treated as repolicates: Three derived from embryonic stem cells and a fourth derived from induced pluripotent stem cells (iPS). Positive values were indicative of early replication and negative values were indicative of replication later during S-phase.
Positions of genes on the mouse genome were defined by the terminal 5’ and 3’ bp of the coding sequence. These positions were obtained from annotations of the July 2007 assembly (mm9). As mouse replication timing data were assigned genomic coordinates based on the February 2006 assembly (mm8), the stand alone liftOver utility and associated chain file mm9ToMm8.over.chain, both obtained from UCSC, were used to convert positions between builds. Genic replication times were then taken from an average of times determined for probe positions overlapping with any part of the orthologous gene, within the limits of the coding sequence. Both means and medians are provided for each gene.
The dataset also includes intronic GC content and extent of intronic G+T skew for each ortholog, the latter used as a proxy for germ-line expression rate. The published datasets are both xlsx, tsv and text files containing the variables described above. Details of methodologies are provided both in the publications Pink et al. 2009, and Pink and Hurst 2010, as well as the accompanying readme file.