First author: Faith Okamoto - fokamoto@ucsd.edu
Corresponding author: Abraham A Palmer - aapalmer@ucsd.edu
Code: https://github.com/Palmer-Lab-UCSD/y-mt-code-for-okamoto-et-al
Code DOI: 10.5281/zenodo.11493119

Description of contents:

Sample info
--This comma separated values (CSV) file has information (sex, library
preparation method, and date of birth) for all modern rats used in this study.

Raw genotypes
--This zipped archive contains variant call files (VCFs) of Y and mitochondrial
(MT) Chromosome variants: single nucleotide polymorphisms (SNPs) in low-coverage
modern HS rat samples (modern_HS_shallow_sequenced.vcf.gz), SNPs and indels in
high-coverage modern HS rat samples (modern_HS_deep_sequenced.vcf.gz), SNPs and
indels in high-coverage HS founder samples (HS_founders.vcf.gz), and short
tandem repeats in both high-coverage modern HS rat and high-coverage HS founder
samples (STRs.vcf.gz). Methods used to generate VCFs are described in the
section "Genotype datasets" in the primary associated publication. All SNPs and
indels were filtered to those within our standard reference panel. All modern HS
rat samples are filtered to those which passed QC. Also included are mRatBN7.2's
reference MT sequence (mRatBN_7_2_MT.fasta) and PLINK-binary genotype files for
X Chromosome SNPs (chrX.bed/bim/fam) in rats with brain hemisphere RNA-seq data.
X SNPs were filtered using the same thresholds as were used to make the GRM,
except that Hardy-Weinberg equilibrium tests were conducted with only females.

Genetic relationship matrix
--This zipped archive contains a genetic relationship matrix (GRM) made as
described in the section "GWAS phenotype association" in the primary associated
publication. In brief, it used all autosomal variants from genotyped modern HS
rats, filtered using standard GWAS quality thresholds (minor allele frequency,
missingness, Hardy-Weinberg equilibrium).

GWAS phenotypes
--This zipped archive contains CSVs for GWAS phenotypes collected by the NIDA
Center for GWAS in Outbred Rats, processed as described in the section "GWAS
phenotype association" in the primary associated publication. One CSV is a table
(GWAS_phenotype_table.csv) where RFIDs are rows and traits are columns. Another
(trait_dictionary.csv) matches each trait to its project. A separate phenotype
file (kidneys.csv) has kidney count at birth, if known.

Gene expression
--This zipped archive contains RatGTex gene expression tables using mRatBN7.2.
Methods to generate them are described on RatGTex (https://ratgtex.org/about/).
"log2" values, included for all tissues (<tissue>.rn7.expr.log2.bed.gz), are the
base-2 logarithm of (RSEM + 1), using original RSEM values. For "Brain", both
transcripts per million (Brain.rn7.expr.tpm.bed.gz) and inverse quantile
normalized (Brain.rn7.expr.iqn.bed.gz) tables are also included. Also provides
lookup tables for tissue long name to abbreviation (tissue_name_map.csv),
Ensembl ID to common gene name/symbol (gene_name_map.csv), and contigs which are
cis to a chromosome (Y_MT_cis_contigs.csv).

Sequencing depths
--This file contains CSVs with average read depth along given regions, with 
1-indexed positions as rows and RFIDs as columns. Depths generated by SAMtools
(using the options "-r NC_001665.2 -a", except for Y_depth.csv) from BAM (binary
Sequence Alignment/Map: SAM format) files produced by the genotyping pipelines
in References. Depths in low-coverage modern HS rats along the MT Chromosome,
(MT_depth.csv), Y Chromosome (Y_depth.csv), and the gene Med14Y
(modern_shallow_Med14Y.csv) in low-coverage modern HS rats are included.
(Y_depth.csv summed depth within Y haplogroup/library preparation method groups,
instead of separately for each RFID, due to the size of the Y Chromosome). Also
included are areas near the gene Med14Y in founders (founders_near_Med14Y.csv)
and high-coverage modern HS rats (modern_deep_near_Med14Y.csv).

Association test results
--This zipped archive contains CSV files with p-values for associations.
Methods on how the associations were performed are in the sections "GWAS
phenotype association" and "Gene expression association" in the primary
associated publication, as well as in brief below in Methods. Association tests
are separated as <chromosome>_<category>_tests.csv: Y_gene_expression_tests.csv
has all association tests between gene expression phenotypes and Y haplogroup.
All files are sorted by unadjusted p-value.

Methods:

The following approach was taken to generate the association test result files.

GWAS phenotypes
--Run MLMA (--mlma) with GCTA; genotypes are Y/MT haplogroup (each encoded as a
single SNP, either 0 for Y1/MT1 or 2 for Y2/MT2), phenotype is one column of
GWAS_phenotype_table.csv, and the GRM is the autosomal one included here.
Combine all "Freq" and "p" columns from the .mlma result files, by chromosome.
Add a new "name" column for the phenotype name. Create an adjusted p-value
"adj-p" column by running the Benjamini-Hochberg (BH) false discovery rate (FDR)
procedure ("p.adjust" in R), separately for each chromosome.

Gene expression
--Starting from the "log2" read counts, and haplogroups created as detailed in
the sections "Two major versions of Y are present in modern HS rats" and "Two
versions of MT are present in modern HS rats" and  in the primary associated
publication, the below analysis pipeline is run for each tissue separately.
  1. Convert expression values to original RSEM values by (log2(RSEM)^2) - 1.
  2. Remove samples which lack a haplogroup.
  3. Remove genes which are expressed in less than 10% of samples.
  4. Compare read counts between haplogroups using a two-sample Wilcoxon
  rank-sums test ("wilcox.test" in R).
This procedure generates p-values for each tissue-gene-haplogroup combination.
Metadata about each p-value, such as Ensembl ID, tissue name, gene location, and
number of samples with each haplogroup, should be added to the relevant rows.
Finally, create an adjusted p-value "adj-p" column by running the BH FDR
procedure, separately for each chromosome.

Data dictionary:

Here are details of the information included under various CSV column headers.
Some CSVs (e.g. GWAS_phenotype_table.csv) are simple X-by-Y tables. For other
file formats, refer to their documentation.

Other file types
--VCF version 4.0 is documented by the The International Genome Sample Resource
(https://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40)
--PLINK-binary genotype files are documented by PLINK 
(https://www.cog-genomics.org/plink/1.9/formats)
--GRM (.grm.bin, .grm.id, and .grm.N.bin) files are documented by PLINK
(https://www.cog-genomics.org/plink/1.9/formats#grm)
--BED expression format is documented by RatGTex (https://ratgtex.org/about/);
Briefly, the first four columns are gene metadata (location and ID), while the
other columns, labeled by RFID, include expression values for each sample.

sample_info.csv:
--rfid: unique RFID of this rat, used in phenotype and genotype files
--sex: sex of the rat, either M or F
--seq_method: library preparation method used to sequence this rat's sample,
either ddGBS or lcWGS
--dob: date of birth of this rat, in YYYY-MM-DD format, if known

trait_dictionary.csv
--trait: phenotype name matching GWAS_phenotype_table.csv
--project: title of the project which produced this phenotype

kidneys.csv
--rfid: unique RFID of this rat
--n_kidneys: kidney count at birth

tissue_name_map.csv
--short_name: abbreviation of tissue name; used in filename
--long_name: full tissue name

gene_name_map.csv
--ensembl_id: Ensembl ID of the gene
--gene_name: common gene name/symbol

<chromosome>_GWAS_phenotype_tests.csv
--name: phenotype name matching GWAS_phenotype_table.csv
--Freq: frequency of the nonreference (Y2/MT2) haplogroup
--p: unadjusted p-value from association test
--adj_p: p-value after multiple test correction was applied

<chromosome>_gene_expression_tests.csv
--ensembl_id: Ensembl ID of the gene
--chr: chromosome/contig name the gene is on
--tissue: tissue the samples are from
--n1: number of samples with reference (Y1/MT1) haplogroup
--n2: number of samples with nonreference (Y2/MT2) haplogroup
--p: unadjusted p-value from association test
--adj_p: p-value after multiple test correction was applied

Technical details:

Software versions used to generate the above mentioned files are listed
below. Any dependencies associated with these packages can be found via their
respective documentation. See also the Reagent Table in the Supplementary Files
of the primary associated publication.

Genotyping: See References. Notably, for low-coverage samples, alignment was
done by BWA-mem v0.7.17, and imputation by STITCH v1.6.6
GRM production and X SNP filtration: PLINK v1.90b3.31 (64 bit)
GWAS: GCTA v1.26.0
Gene expression file production: Munro et al. 2022
doi:10.17504/protocols.io.rm7vzyk92lx1/v1
Gene expression association: R v4.2.3, edgeR v3.40.2
Calculating sequencing depth: SAMtools v1.3

References:

Gileta AF, Gao J, Chitre AS, Bimschleger HV, St. Pierre CL, Gopalakrishnan S,
Palmer AA. 2020. Adapting Genotyping-by-Sequencing and Variant Calling for
Heterogeneous Stock Rats. G3: Genes|Genomes|Genetics. 10:2195–2205.
doi:10.1534/g3.120.401325.

Chen D. 2022. Palmer Lab High Coverage WGS Genotyping Pipeline.
doi:10.5281/zenodo.6584834.

Chen D, Chitre A, Cheng R, Peng B, Polesskaya O, Palmer A. 2023. Palmer Lab
Heterogeneous Stock Rats Genotyping Pipeline. doi:10.5281/zenodo.10002191.