Corresponding author: Abraham Palmer - aapalmer@ucsd.edu

Primary associated publication: Okamoto F, Chitre AS, Sanches TM, Chen D, 
Munro D, NIDA Center for GWAS in Outbred Rats, Polesskaya O, Palmer AA 2023. 
Y and Mitochondrial Chromosomes in the Heterogeneous Stock Rat Population. 
bioRxiv. doi:10.1101/2023.11.29.566473

Code: https://github.com/Palmer-Lab-UCSD/y-mt-code-for-okamoto-et-al
Zenodo DOI: 10.5281/zenodo.10234037

Description of contents:

Sample metadata
--This comma separated values (CSV) file has information (sex, library
preparation method, and date of birth) for all modern rats used in this study.

Raw genotypes
--This zipped archive contains four variant call files (VCFs), with low-coverage
modern HS rat samples (modern_HS_shallow_sequenced.vcf.gz), high-coverage modern
HS rat samples (modern_HS_deep_sequenced.vcf.gz), high-coverage HS founder
samples (HS_founders.vcf.gz), and short tandem repeats in both high-coverage
modern HS rat and high-coverage HS founder samples (STRs.vcf.gz) for the Y and
mitochondrial (MT) chromosomes. Methods used to generate these VCFs are
described in the section labeled "Genotype datasets" in the primary associated
publication. All VCFs have had single nucleotide polymorphisms (SNPs) and indels
filtered to those which appear in our standard reference panel. All modern HS
rat samples have been filtered to those which passed QC. In addition, the
mRatBN7.2 reference MT sequence, AY172581.1, is included as mRatBN_7_2_mt.fasta.

Genetic relationship matrix
--This zipped archive contains a genetic relationship matrix (GRM) made as
described in the section labeled "GWAS phenotype association" in the primary
associated publication. In brief, all autosomal variants from genotyped modern
HS rats were filtered using standard GWAS quality thresholds (minor allele
frequency, missingness, Hardy-Weinberg equilibrium) and input to the GRM.

GWAS expression phenotypes
--This zipped archive contains CSVs for GWAS phenotypes collected by the NIDA
Center for GWAS in Outbred Rats, processed as described in the section labeled
"GWAS phenotype association" in the primary associated publication. Phenotype
names are anonymized to respect unpublished collaborations. One CSV is a table
(gwas_phenotypes_table.csv) where RFIDs are rows and traits are columns. 
Another (trait_dictionary.csv) matches each trait to its project. A separate
phenotype file (kidneys.csv) has kidney count at birth, if known.

Gene expression phenotypes
--This zipped archive contains RatGTex gene expression tables using mRatBN7.2.
Methods to generate them are described on RatGTex (https://ratgtex.org/about/).
"log2" values, included for all tissues (<tissue>.rn7.expr.log2.bed.gz) are the
base-2 logarithm of (RSEM + 1), using original RSEM values. For "Brain", both
transcripts per million (Brain.rn7.expr.tpm.bed.gz) and inverse quantile
normalized (Brain.rn7.expr.iqn.bed.gz) tables are also included.

Sequencing depth along MT
--This file contains (positions are rows, RFIDs are columns) read depth along
the MT chromosome, generated by SAMtools (using the "-r NC_001665.2 -a" options)
from BAM (binary Sequence Alignment/Map, or SAM format) files produced by
genotyping pipelines described in References.

Association test results
--This zipped archive contains CSV files with p-values for associations.
Methods on how the GWAS were performed can be found in the sections labeled
"GWAS phenotype association" and "Gene expression association" in the primary
associated publication, as well as in brief below in Methods. Association tests
are separated as <category>_<chromosome>_tests.csv: gene_expression_Y_tests.csv
has all association tests between gene expression phenotypes and Y haplotype.
All files are sorted by unadjusted p-value.

Methods:

The following approach was taken to generate the association test result files.

GWAS phenotypes
--Run MLMA (--mlma) with GCTA; genotypes are Y/MT haplotype (each encoded as a
single SNP), phenotype is one column of gwas_phenotypes_table.csv, and the GRM
is the autosomal one here. Combine all "Freq" and "p" columns from the .mlma
result files, for each chromosome separately. Add a new "name" column for the
phenotype name. Create an adjusted p-value "adj-p" column by running the 
Benjamini-Hochberg (BH) false discovery rate (FDR) prodcedure ("p.adjust" in R),
separately for each chromosome.

Gene expression
--Starting from the "log2" read counts, and haplotype groups created as detailed
in the sections labeled "Two versions of MT are present in modern HS rats" and
"Two versions of Y are present in modern HS rats" in the primary associated
publication. The below analysis pipeline is run for each tissue separately.
  1. Convert expression values to original RSEM values by (log2(RSEM)^2) - 1.
  2. Remove samples which lack a haplotype group.
  3. Remove genes which are expressed in less than 10% of samples.
  4. Compare read counts between haplotype groups using a two-sample Wilcoxon
  rank-sums test ("wilcox.test" in R).
This procedure generates p-values for each tissue-gene-haplotype combination.
Metadata about each p-value, such as Ensembl ID, tissue name, gene location, and
number of samples with each haplotype, should be added to the relevant rows.
Finally, create an adjusted p-value "adj-p" column by running the
Benjamini-Hochberg (BH) false discovery rate (FDR) procedure ("p.adjust" in R),
separately for each chromosome.

Data dictionary:

Here are details of the information included under various CSV column headers.
Some CSVs (gwas_phenotypes_table.csv, Sequencing depth along MT) are simple
X-by-Y tables. For other file formats, refer to their documentation.

Other file types
--VCF version 4.0 is documented by the The International Genome Sample Resource
(https://www.internationalgenome.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-40)
--GRM (.grm.bin, .grm.id, and .grm.N.bin) files are documented by PLINK
(https://www.cog-genomics.org/plink/1.9/formats#grm)
--BED expression format is documented by RatGTex (https://ratgtex.org/about/);
Briefly, the first four columns are gene metadata (location and ID), while the
other columns, labeled by RFID, include expression values for each sample.

Sample metadata:
--rfid: unique RFID of this rat, used in phenotype and genotype files
--sex: sex of the rat, either M or F
--seq_method: library preparation method used to sequence this rat's sample,
either ddGBS or lcWGS
--dob: date of birth of this rat, in YYYY-MM-DD format, if known

trait_dictionary.csv
--trait: anonymized phenotype name matching gwas_phenotypes_table.csv
--project: title of the project which produced this phenotype

kidneys.csv
--rfid: unique RFID of this rat
--n_kidneys: kidney count at birth

gwas_phenotype_<chromosome>_tests.csv
--name: anonymized phenotype name matching gwas_phenotypes_table.csv
--Freq: frequency of the nonreference (Y2/MT2) haplotype
--p: unadjusted p-value from association test
--adj_p: p-value after multiple test correction has been applied

gene_expression_<chromosome>_tests.csv
--ensembl_id: Ensembl ID of the gene
--tissue: tissue the samples are from
--chr: chromosome/contig name the gene is on
--n1: number of samples with reference (Y1/MT1) haplotype
--n2: number of samples with nonreference (Y2/MT2) haplotype
--p: unadjusted p-value from association test
--adj_p: p-value after multiple test correction has been applied

Technical details:

All software versions used to generate the above mentioned files are listed
below. Any dependencies associated with these packages can be found via their
respective documentation.

Genotyping: See References. Notably, for low-coverage samples, 
alignment was done by BWA-mem v0.7.17, and imputation by STITCH v1.6.6
GRM production: PLINK v1.90b3.31 (64 bit)
GWAS: GCTA v1.26.0
Gene expression file production: Munro et al. 2022
doi:10.17504/protocols.io.rm7vzyk92lx1/v1
Gene expression association: R v4.2.3, edgeR v3.40.2
Calculating sequencing depth: SAMtools v1.3

References:

Gileta AF, Gao J, Chitre AS, Bimschleger HV, St. Pierre CL, Gopalakrishnan S, 
Palmer AA. 2020. Adapting Genotyping-by-Sequencing and Variant Calling for 
Heterogeneous Stock Rats. G3 (Bethesda). 10(7):2195–2205. 
doi:10.1534/g3.120.401325.

Chen D, Chitre A, Cheng R, Peng B, Polesskaya O, Palmer A. 2023 Oct 20. 
Palmer Lab High Coverage WGS DeepVariant Genotyping Pipeline. 
doi:10.5281/zenodo.10027133. [accessed 2023 Nov 28]. 
https://zenodo.org/records/10027133.

Chen D, Chitre A, Cheng R, Peng B, Polesskaya O, Palmer A. 2023 Oct 20. 
Palmer Lab Heterogeneous Stock Rats Genotyping Pipeline. 
doi:10.5281/zenodo.10002191. [accessed 2023 Nov 29]. 
https://zenodo.org/records/10002191.