Corresponding author: Abraham Palmer - aapalmer@ucsd.edu Primary associated publication: Okamoto F, Chitre AS, Sanches TM, Chen D, Munro D, NIDA Center for GWAS in Outbred Rats, Polesskaya O, Palmer AA 2023. Y and Mitochondrial Chromosomes in the Heterogeneous Stock Rat Population. bioRxiv. doi:10.1101/2023.11.29.566473 Code: https://github.com/Palmer-Lab-UCSD/y-mt-code-for-okamoto-et-al Zenodo DOI: 10.5281/zenodo.10234037 Description of contents: Sample metadata --This comma separated values (CSV) file has information (sex, library preparation method, and date of birth) for all modern rats used in this study. Raw genotypes --This zipped archive contains four variant call files (VCFs), with low-coverage modern HS rat samples (modern_HS_shallow_sequenced.vcf.gz), high-coverage modern HS rat samples (modern_HS_deep_sequenced.vcf.gz), high-coverage HS founder samples (HS_founders.vcf.gz), and short tandem repeats in both high-coverage modern HS rat and high-coverage HS founder samples (STRs.vcf.gz) for the Y and mitochondrial (MT) chromosomes. Methods used to generate these VCFs are described in the section labeled "Genotype datasets" in the primary associated publication. All VCFs have had single nucleotide polymorphisms (SNPs) and indels filtered to those which appear in our standard reference panel. All modern HS rat samples have been filtered to those which passed QC. In addition, the mRatBN7.2 reference MT sequence, AY172581.1, is included as mRatBN_7_2_mt.fasta. Genetic relationship matrix --This zipped archive contains a genetic relationship matrix (GRM) made as described in the section labeled "GWAS phenotype association" in the primary associated publication. In brief, all autosomal variants from genotyped modern HS rats were filtered using standard GWAS quality thresholds (minor allele frequency, missingness, Hardy-Weinberg equilibrium) and input to the GRM. GWAS expression phenotypes --This zipped archive contains CSVs for GWAS phenotypes collected by the NIDA Center for GWAS in Outbred Rats, processed as described in the section labeled "GWAS phenotype association" in the primary associated publication. Phenotype names are anonymized to respect unpublished collaborations. One CSV is a table (gwas_phenotypes_table.csv) where RFIDs are rows and traits are columns. Another (trait_dictionary.csv) matches each trait to its project. A separate phenotype file (kidneys.csv) has kidney count at birth, if known. Gene expression phenotypes --This zipped archive contains RatGTex gene expression tables using mRatBN7.2. Methods to generate them are described on RatGTex (https://ratgtex.org/about/). "log2" values, included for all tissues (.rn7.expr.log2.bed.gz) are the base-2 logarithm of (RSEM + 1), using original RSEM values. For "Brain", both transcripts per million (Brain.rn7.expr.tpm.bed.gz) and inverse quantile normalized (Brain.rn7.expr.iqn.bed.gz) tables are also included. Sequencing depth along MT --This file contains (positions are rows, RFIDs are columns) read depth along the MT chromosome, generated by SAMtools (using the "-r NC_001665.2 -a" options) from BAM (binary Sequence Alignment/Map, or SAM format) files produced by genotyping pipelines described in References. Association test results --This zipped archive contains CSV files with p-values for associations. Methods on how the GWAS were performed can be found in the sections labeled "GWAS phenotype association" and "Gene expression association" in the primary associated publication, as well as in brief below in Methods. Association tests are separated as __tests.csv: gene_expression_Y_tests.csv has all association tests between gene expression phenotypes and Y haplotype. All files are sorted by unadjusted p-value. Methods: The following approach was taken to generate the association test result files. GWAS phenotypes --Run MLMA (--mlma) with GCTA; genotypes are Y/MT haplotype (each encoded as a single SNP), phenotype is one column of gwas_phenotypes_table.csv, and the GRM is the autosomal one here. Combine all "Freq" and "p" columns from the .mlma result files, for each chromosome separately. Add a new "name" column for the phenotype name. Create an adjusted p-value "adj-p" column by running the Benjamini-Hochberg (BH) false discovery rate (FDR) prodcedure ("p.adjust" in R), separately for each chromosome. Gene expression --Starting from the "log2" read counts, and haplotype groups created as detailed in the sections labeled "Two versions of MT are present in modern HS rats" and "Two versions of Y are present in modern HS rats" in the primary associated publication. The below analysis pipeline is run for each tissue separately. 1. Convert expression values to original RSEM values by (log2(RSEM)^2) - 1. 2. Remove samples which lack a haplotype group. 3. Remove genes which are expressed in less than 10% of samples. 4. Compare read counts between haplotype groups using a two-sample Wilcoxon rank-sums test ("wilcox.test" in R). This procedure generates p-values for each tissue-gene-haplotype combination. Metadata about each p-value, such as Ensembl ID, tissue name, gene location, and number of samples with each haplotype, should be added to the relevant rows. Finally, create an adjusted p-value "adj-p" column by running the Benjamini-Hochberg (BH) false discovery rate (FDR) procedure ("p.adjust" in R), separately for each chromosome. Data dictionary: Here are details of the information included under various CSV column headers. Some CSVs (gwas_phenotypes_table.csv, Sequencing depth along MT) are simple X-by-Y tables. For other file formats, refer to their documentation. Other file types --VCF version 4.0 is documented by the The International Genome Sample Resource (https://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40) --GRM (.grm.bin, .grm.id, and .grm.N.bin) files are documented by PLINK (https://www.cog-genomics.org/plink/1.9/formats#grm) --BED expression format is documented by RatGTex (https://ratgtex.org/about/); Briefly, the first four columns are gene metadata (location and ID), while the other columns, labeled by RFID, include expression values for each sample. Sample metadata: --rfid: unique RFID of this rat, used in phenotype and genotype files --sex: sex of the rat, either M or F --seq_method: library preparation method used to sequence this rat's sample, either ddGBS or lcWGS --dob: date of birth of this rat, in YYYY-MM-DD format, if known trait_dictionary.csv --trait: anonymized phenotype name matching gwas_phenotypes_table.csv --project: title of the project which produced this phenotype kidneys.csv --rfid: unique RFID of this rat --n_kidneys: kidney count at birth gwas_phenotype__tests.csv --name: anonymized phenotype name matching gwas_phenotypes_table.csv --Freq: frequency of the nonreference (Y2/MT2) haplotype --p: unadjusted p-value from association test --adj_p: p-value after multiple test correction has been applied gene_expression__tests.csv --ensembl_id: Ensembl ID of the gene --tissue: tissue the samples are from --chr: chromosome/contig name the gene is on --n1: number of samples with reference (Y1/MT1) haplotype --n2: number of samples with nonreference (Y2/MT2) haplotype --p: unadjusted p-value from association test --adj_p: p-value after multiple test correction has been applied Technical details: All software versions used to generate the above mentioned files are listed below. Any dependencies associated with these packages can be found via their respective documentation. Genotyping: See References. Notably, for low-coverage samples, alignment was done by BWA-mem v0.7.17, and imputation by STITCH v1.6.6 GRM production: PLINK v1.90b3.31 (64 bit) GWAS: GCTA v1.26.0 Gene expression file production: Munro et al. 2022 doi:10.17504/protocols.io.rm7vzyk92lx1/v1 Gene expression association: R v4.2.3, edgeR v3.40.2 Calculating sequencing depth: SAMtools v1.3 References: Gileta AF, Gao J, Chitre AS, Bimschleger HV, St. Pierre CL, Gopalakrishnan S, Palmer AA. 2020. Adapting Genotyping-by-Sequencing and Variant Calling for Heterogeneous Stock Rats. G3 (Bethesda). 10(7):2195–2205. doi:10.1534/g3.120.401325. Chen D, Chitre A, Cheng R, Peng B, Polesskaya O, Palmer A. 2023 Oct 20. Palmer Lab High Coverage WGS DeepVariant Genotyping Pipeline. doi:10.5281/zenodo.10027133. [accessed 2023 Nov 28]. https://zenodo.org/records/10027133. Chen D, Chitre A, Cheng R, Peng B, Polesskaya O, Palmer A. 2023 Oct 20. Palmer Lab Heterogeneous Stock Rats Genotyping Pipeline. doi:10.5281/zenodo.10002191. [accessed 2023 Nov 29]. https://zenodo.org/records/10002191.