First author: Faith Okamoto - fokamoto@ucsd.edu Corresponding author: Abraham A Palmer - aapalmer@ucsd.edu Code: https://github.com/Palmer-Lab-UCSD/y-mt-code-for-okamoto-et-al Code DOI: 10.5281/zenodo.11493119 Description of contents: Sample info --This comma separated values (CSV) file has information (sex, library preparation method, and date of birth) for all modern rats used in this study. Raw genotypes --This zipped archive contains variant call files (VCFs) of Y and mitochondrial (MT) Chromosome variants: single nucleotide polymorphisms (SNPs) in low-coverage modern HS rat samples (modern_HS_shallow_sequenced.vcf.gz), SNPs and indels in high-coverage modern HS rat samples (modern_HS_deep_sequenced.vcf.gz), SNPs and indels in high-coverage HS founder samples (HS_founders.vcf.gz), and short tandem repeats in both high-coverage modern HS rat and high-coverage HS founder samples (STRs.vcf.gz). Methods used to generate VCFs are described in the section "Genotype datasets" in the primary associated publication. All SNPs and indels were filtered to those within our standard reference panel. All modern HS rat samples are filtered to those which passed QC. Also included are mRatBN7.2's reference MT sequence (mRatBN_7_2_MT.fasta) and PLINK-binary genotype files for X Chromosome SNPs (chrX.bed/bim/fam) in rats with brain hemisphere RNA-seq data. X SNPs were filtered using the same thresholds as were used to make the GRM, except that Hardy-Weinberg equilibrium tests were conducted with only females. Genetic relationship matrix --This zipped archive contains a genetic relationship matrix (GRM) made as described in the section "GWAS phenotype association" in the primary associated publication. In brief, it used all autosomal variants from genotyped modern HS rats, filtered using standard GWAS quality thresholds (minor allele frequency, missingness, Hardy-Weinberg equilibrium). GWAS phenotypes --This zipped archive contains CSVs for GWAS phenotypes collected by the NIDA Center for GWAS in Outbred Rats, processed as described in the section "GWAS phenotype association" in the primary associated publication. One CSV is a table (GWAS_phenotype_table.csv) where RFIDs are rows and traits are columns. Another (trait_dictionary.csv) matches each trait to its project. A separate phenotype file (kidneys.csv) has kidney count at birth, if known. Gene expression --This zipped archive contains RatGTex gene expression tables using mRatBN7.2. Methods to generate them are described on RatGTex (https://ratgtex.org/about/). "log2" values, included for all tissues (.rn7.expr.log2.bed.gz), are the base-2 logarithm of (RSEM + 1), using original RSEM values. For "Brain", both transcripts per million (Brain.rn7.expr.tpm.bed.gz) and inverse quantile normalized (Brain.rn7.expr.iqn.bed.gz) tables are also included. Also provides lookup tables for tissue long name to abbreviation (tissue_name_map.csv), Ensembl ID to common gene name/symbol (gene_name_map.csv), and contigs which are cis to a chromosome (Y_MT_cis_contigs.csv). Sequencing depths --This file contains CSVs with average read depth along given regions, with 1-indexed positions as rows and RFIDs as columns. Depths generated by SAMtools (using the options "-r NC_001665.2 -a", except for Y_depth.csv) from BAM (binary Sequence Alignment/Map: SAM format) files produced by the genotyping pipelines in References. Depths in low-coverage modern HS rats along the MT Chromosome, (MT_depth.csv), Y Chromosome (Y_depth.csv), and the gene Med14Y (modern_shallow_Med14Y.csv) in low-coverage modern HS rats are included. (Y_depth.csv summed depth within Y haplogroup/library preparation method groups, instead of separately for each RFID, due to the size of the Y Chromosome). Also included are areas near the gene Med14Y in founders (founders_near_Med14Y.csv) and high-coverage modern HS rats (modern_deep_near_Med14Y.csv). Association test results --This zipped archive contains CSV files with p-values for associations. Methods on how the associations were performed are in the sections "GWAS phenotype association" and "Gene expression association" in the primary associated publication, as well as in brief below in Methods. Association tests are separated as __tests.csv: Y_gene_expression_tests.csv has all association tests between gene expression phenotypes and Y haplogroup. All files are sorted by unadjusted p-value. Methods: The following approach was taken to generate the association test result files. GWAS phenotypes --Run MLMA (--mlma) with GCTA; genotypes are Y/MT haplogroup (each encoded as a single SNP, either 0 for Y1/MT1 or 2 for Y2/MT2), phenotype is one column of GWAS_phenotype_table.csv, and the GRM is the autosomal one included here. Combine all "Freq" and "p" columns from the .mlma result files, by chromosome. Add a new "name" column for the phenotype name. Create an adjusted p-value "adj-p" column by running the Benjamini-Hochberg (BH) false discovery rate (FDR) procedure ("p.adjust" in R), separately for each chromosome. Gene expression --Starting from the "log2" read counts, and haplogroups created as detailed in the sections "Two major versions of Y are present in modern HS rats" and "Two versions of MT are present in modern HS rats" and in the primary associated publication, the below analysis pipeline is run for each tissue separately. 1. Convert expression values to original RSEM values by (log2(RSEM)^2) - 1. 2. Remove samples which lack a haplogroup. 3. Remove genes which are expressed in less than 10% of samples. 4. Compare read counts between haplogroups using a two-sample Wilcoxon rank-sums test ("wilcox.test" in R). This procedure generates p-values for each tissue-gene-haplogroup combination. Metadata about each p-value, such as Ensembl ID, tissue name, gene location, and number of samples with each haplogroup, should be added to the relevant rows. Finally, create an adjusted p-value "adj-p" column by running the BH FDR procedure, separately for each chromosome. Data dictionary: Here are details of the information included under various CSV column headers. Some CSVs (e.g. GWAS_phenotype_table.csv) are simple X-by-Y tables. For other file formats, refer to their documentation. Other file types --VCF version 4.0 is documented by the The International Genome Sample Resource (https://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40) --PLINK-binary genotype files are documented by PLINK (https://www.cog-genomics.org/plink/1.9/formats) --GRM (.grm.bin, .grm.id, and .grm.N.bin) files are documented by PLINK (https://www.cog-genomics.org/plink/1.9/formats#grm) --BED expression format is documented by RatGTex (https://ratgtex.org/about/); Briefly, the first four columns are gene metadata (location and ID), while the other columns, labeled by RFID, include expression values for each sample. sample_info.csv: --rfid: unique RFID of this rat, used in phenotype and genotype files --sex: sex of the rat, either M or F --seq_method: library preparation method used to sequence this rat's sample, either ddGBS or lcWGS --dob: date of birth of this rat, in YYYY-MM-DD format, if known trait_dictionary.csv --trait: phenotype name matching GWAS_phenotype_table.csv --project: title of the project which produced this phenotype kidneys.csv --rfid: unique RFID of this rat --n_kidneys: kidney count at birth tissue_name_map.csv --short_name: abbreviation of tissue name; used in filename --long_name: full tissue name gene_name_map.csv --ensembl_id: Ensembl ID of the gene --gene_name: common gene name/symbol _GWAS_phenotype_tests.csv --name: phenotype name matching GWAS_phenotype_table.csv --Freq: frequency of the nonreference (Y2/MT2) haplogroup --p: unadjusted p-value from association test --adj_p: p-value after multiple test correction was applied _gene_expression_tests.csv --ensembl_id: Ensembl ID of the gene --chr: chromosome/contig name the gene is on --tissue: tissue the samples are from --n1: number of samples with reference (Y1/MT1) haplogroup --n2: number of samples with nonreference (Y2/MT2) haplogroup --p: unadjusted p-value from association test --adj_p: p-value after multiple test correction was applied Technical details: Software versions used to generate the above mentioned files are listed below. Any dependencies associated with these packages can be found via their respective documentation. See also the Reagent Table in the Supplementary Files of the primary associated publication. Genotyping: See References. Notably, for low-coverage samples, alignment was done by BWA-mem v0.7.17, and imputation by STITCH v1.6.6 GRM production and X SNP filtration: PLINK v1.90b3.31 (64 bit) GWAS: GCTA v1.26.0 Gene expression file production: Munro et al. 2022 doi:10.17504/protocols.io.rm7vzyk92lx1/v1 Gene expression association: R v4.2.3, edgeR v3.40.2 Calculating sequencing depth: SAMtools v1.3 References: Gileta AF, Gao J, Chitre AS, Bimschleger HV, St. Pierre CL, Gopalakrishnan S, Palmer AA. 2020. Adapting Genotyping-by-Sequencing and Variant Calling for Heterogeneous Stock Rats. G3: Genes|Genomes|Genetics. 10:2195–2205. doi:10.1534/g3.120.401325. Chen D. 2022. Palmer Lab High Coverage WGS Genotyping Pipeline. doi:10.5281/zenodo.6584834. Chen D, Chitre A, Cheng R, Peng B, Polesskaya O, Palmer A. 2023. Palmer Lab Heterogeneous Stock Rats Genotyping Pipeline. doi:10.5281/zenodo.10002191.