search_index.json

[["index.html", "Allele-specific transcriptomics tutorial 1 Introduction 1.1 Contents 1.2 Data", " Allele-specific transcriptomics tutorial Sean T. Bresnahan 2024-07-15 1 Introduction This tutorial demonstrates best practices for processing whole genome sequence (WGS) and mRNA sequencing (mRNA-seq) reads for allele-specific transcriptomics analysis. 1.1 Contents Sequence reads to SNP-level parent-specific read counts SNP-level parent-specific read counts to allele-specific transcription 1.2 Data Sequencing data for this tutorial are described in Bresnahan et al., 2024, “Intragenomic conflict underlies extreme phenotypic plasticity in queen-worker caste determination in honey bees (Apis mellifera)”, bioRxiv. This study used instrumental insemination to create reciprocal crosses between F1 honey bees derived from different F0 genetic stocks (referred to as “lineage A” and “lineage B”) in two distinct genetic Blocks, and manipulated the F2 larvae resulting from these crosses to induce development of either the worker or queen caste fate. Whole-genome sequencing (WGS) of genomic DNA isolated from the F1 males and females used to make these crosses, and mRNA-seq of RNA isolated from the F2 larvae resulting from these crosses, was performed for quantifying and comparing allele-specific transcription between worker and queen-destined larvae. In this tutorial, sequencing data from one genetic Block of these crosses (n=4 150x150bp paired-end WGS libraries and n=20 50x50bp paired-end stranded mRNA-seq libraries) will be used in order to reproduce Figure 4 from the study. Figure 4. Queen-destined larvae show enriched paternal allele-biased transcription relative to worker-destined larvae. Allele-specific transcriptomes were assessed in F2 worker-destined larvae (WL) and queen-destined larvae (QL) collected from a reciprocal cross between different stocks of European honey bees. The x-axis represents, for each transcript, the proportion of lineage A reads in larvae with a lineage B mother and lineage A father (p1). The y-axis represents, for each transcript, the proportion of lineage A reads in larvae with a lineage A mother and lineage B father (p2). Each color represents a transcript which is significantly biased at all tested SNP positions: black is maternal (mat), green is lineage A, gold is lineage B, blue is paternal (pat), and grey is not significant. Center table: the number of transcripts showing each category of allelic bias and p-values for Chi-squared tests of independence for comparisons between the castes are indicated (NS = not significant). Significance of allele-biased transcription was determined using the overlap between two statistical tests: a general linear mixed model (GLIMMIX), and a Storer-Kim binomial exact test along with thresholds of p1&lt;0.4 and p2&gt;0.6 for maternal bias, p1&gt;0.6 and p2&lt;0.4 for paternal bias, p1&lt;0.4 and p2&lt;0.4 for lineage B bias and p1&gt;0.6 and p2&gt;0.6 for lineage A bias. "],["sequence-reads-to-snp-level-parent-specific-read-counts-1.html", "2 Sequence reads to SNP-level parent-specific read counts 2.1 Overview 2.2 Setup 2.3 Generation of F1 genomes and transcriptomes 2.4 Quantification of F2 allele-specific transcription", " 2 Sequence reads to SNP-level parent-specific read counts 2.1 Overview This bash tutorial demonstrates best practices for processing whole genome sequence (WGS) and mRNA sequencing (mRNA-seq) reads for allele-specific transcriptomics analysis. Sequencing data for this tutorial are described in Bresnahan et al., 2024, “Intragenomic conflict underlies extreme phenotypic plasticity in queen-worker caste determination in honey bees (Apis mellifera)”, bioRxiv. See the tutorial introduction for more details. F1 WGS reads are trimmed with fastp to remove adapter sequences and filter low-quality reads, then aligned to the A. mellifera Amel_HAv3.1 reference genome assembly with BWA-MEM. Alignments are coordinate sorted and filtered using samtools. Duplicate alignments are removed with GATK MarkDuplicates. Variants against the A. mellifera reference genome assembly are detected using freebayes to account for differences in ploidy between diploid females (queens) and haploid males (drones). The variant calls are then filtered for quality and read depth with VCFtools, and all variants except homozygous SNPs are filtered with samtools BCFtools. The high-confidence homozygous SNPs are then integrated into the A. mellifera reference genome assembly with GATK FastaAlternativeReferenceMaker, and intersected using bedtools to identify SNPs that are unique to each parent but shared between the crosses of a reciprocal cross pair. Parent-specific SNPs are then intersected with the A. mellifera reference genome annotation with bedtools to identify positions within the longest transcript for each gene. F2 mRNA-seq reads are trimmed with fastp to remove adapter sequences, filter low-quality reads, and generate quality control metrics, and then aligned to each respective parent genome using STAR. Alignments are then filtered using samtools. For each sample, read coverage is calculated at each F1 SNP using bedtools intersect in strand-aware mode. 2.2 Setup 2.2.1 Command-line tools Mamba Several command-line tools are used in this tutorial. These are easily installed with the package manager Mamba. Follow the Miniforge distribution documentation for installation, then create a new environment called “AST-bioinfo” for installing the necessary tools. mamba create --name AST-bioinfo sra-tools The SRA Toolkit and SDK from NCBI is a collection of tools and libraries for retrieving data in the INSDC Sequence Read Archives. (Documentation) In this tutorial, sra-tools is used to retrieve Apis mellifera WGS &amp; mRNA-seq reads from SRA BioProject accession PRJNA1106847. mamba install -n AST-bioinfo bioconda::sra-tools fastp fastp is a tool designed to provide fast all-in-one preprocessing for FastQ files. (Documentation) In this tutorial, fastp is used to trim any sequencing adapter contamination from the reads and assess quality control metrics. mamba install -n AST-bioinfo bioconda::fastp BWA BWA is a software package for mapping DNA sequences against a large reference genome. (Documentation) In this tutorial, the BWA-MEM algorithm which is designed for Illumina sequence reads ranged from 70bp to a few megabases and is faster and more accurate than other alignment algorithms, is used for alignment of the F1 (parental) WGS reads to the Apis mellifera reference genome. mamba install -n AST-bioinfo bioconda::bwa STAR Spliced Transcripts Alignment to a Reference (STAR) is one of the best performing tools for alignment of mRNA-seq reads to a reference genome. (Documentation) In this tutorial, STAR is used to align the F2 mRNA-seq reads to their respective F1 reference genomes. mamba install -n AST-bioinfo bioconda::star freebayes freebayes is a Bayesian genetic variant detector designed to find small polymorphisms, specifically SNPs (single-nucleotide polymorphisms), indels (insertions and deletions), MNPs (multi-nucleotide polymorphisms), and complex events (composite insertion and substitution events) smaller than the length of a short-read sequencing alignment. (Documentation) In this tutorial, freebayes is used to identify SNPs in the F1 WGS reads with reference to the Apis mellifera reference genome. mamba install -n AST-bioinfo bioconda::freebayes samtools samtools is a suite of tools for manipulating next-generation sequencing data. (Documentation) In this tutorial, samtools is used to process alignments produced by BWA-MEM &amp; STAR and SNP calls produced by freebayes. mamba install -n AST-bioinfo bioconda::samtools BEDTools The bedtools utilities are a swiss-army knife of tools for a wide-range of genomics analysis tasks. The most widely-used tools enable genome arithmetic: that is, set theory on the genome. For example, bedtools allows one to intersect, merge, count, complement, and shuffle genomic intervals from multiple files in widely-used genomic file formats such as BAM, BED, GFF/GTF, VCF. (Documentation) In this tutorial, BEDTools is used to identify SNPs that differ between F1 parents and to annotate the Apis mellifera transcript annotations with the positions of these SNPs. mamba install -n AST-bioinfo bioconda::bedtools VCFtools VCFtools is a program package designed for working with VCFs. (Documentation) In this tutorial, VCFtools is used to filter SNPs called by freebayes. mamba install -n AST-bioinfo bioconda::vcftools GATK The Genome Analysis Toolkit is a collection of command-line tools for analyzing high-throughput sequencing data with a primary focus on variant discovery. (Documentation) In this tutorial, the MarkDuplicates tool is used to filter duplicate WGS reads, and the FastaAlternateReferenceMaker tool is used to integrate F1 SNPs into the Apis mellifera reference genome to generate parental reference genomes. mamba install -n AST-bioinfo bioconda::gatk FastQC FastQC aims to provide a simple way to do some quality control checks on raw sequence data coming from high throughput sequencing pipelines. It provides a modular set of analyses which you can use to give a quick impression of whether your data has any problems of which you should be aware before doing any further analysis. (Documentation) In this tutorial, FastQC is used for an initial quality control check of the sequencing reads prior to pre-processing steps. mamba install -n AST-bioinfo bioconda::fastqc MultiQC MultiQC searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarising the output from numerous bioinformatics tools. (Documentation) In this tutorial, MultiQC is used to aggregate reports from FastQC and fastp to assess quality control metrics of the sequencing reads before and after filtering. mamba install -n AST-bioinfo bioconda::multiqc Activate mamba environment Upon opening a new terminal, you must activate your mamba environment to access these tools: mamba activate AST-bioinfo Project directory structure It is important to create a cohesive and consistent directory structure for your bioinformatics projects. For this tutorial, create these directories with mkdir: AST-tutorial └───RAW │ └───WGS │ └───mRNA └───TRIM │ └───WGS │ └───mRNA └───ALIGN_GENOME │ └───UNFILTERED │ └───FILTERED └───VARIANTS │ └───UNFILTERED │ └───FILTERED └───ANALYSIS │ └───INTERMEDIATE │ └───RESULTS └───ANALYSIS_SETS └───PARENT_GENOMES └───ALIGN_PARENT_GENOMES │ └───UNFILTERED │ └───FILTERED └───COUNT └───INDEX └───SCRIPTS └───REPORTS │ └───WGS │ └───mRNA 2.2.2 Amel_HAv3.1 reference genome assembly and annotation files Retrieve the RefSeq assembly files from NCBI via ftp. DIR_INDEX=&quot;AST-tutorial/INDEX&quot; wget -O ${DIR_INDEX}/Amel_HAv3.1.fasta.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.fna.gz gunzip ${DIR_INDEX}/Amel_HAv3.1.fasta.gz wget -O ${DIR_INDEX}/Amel_HAv3.1.gff.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.gff.gz gunzip ${DIR_INDEX}/Amel_HAv3.1.gff.gz wget -O ${DIR_INDEX}/Amel_HAv3.1.gff.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/254/395/GCF_003254395.2_Amel_HAv3.1/GCF_003254395.2_Amel_HAv3.1_genomic.gtf.gz gunzip ${DIR_INDEX}/Amel_HAv3.1.gtf.gz 2.2.3 A note on parallel processing in high performance computing clusters The code chunks throughout the tutorial are written so as to run each FILE in a FILES array through serial loops. If you are running this tutorial in a high-performance computing (HPC) cluster with a batch submission system (e.g., SLURM on Penn State ROAR Collab), you can instead submit a “parent” job that submits individual “child” jobs for each file. For example: Parent job #!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --time=0:10:00 #SBATCH --mem-per-cpu=1gb #SBATCH --partition=open SCRIPTS=&quot;path/to/SCRIPTS&quot; THREADS=4 DIR_OUT=&quot;path/to/OUTPUT_DIRECTORY&quot; DIR_RAW=&quot;path/to/RAW_DATA&quot; FILES=(FILE1 FILE2 etc...) for FILE in &quot;${FILES[@]}&quot; do sbatch ${SCRIPTS}/script_to_run.sh ${FILE} ${THREADS} ${DIR_OUT} ${DIR_RAW} done Child job #!/bin/bash #SBATCH --nodes=1 #SBATCH --ntasks=8 #SBATCH --time=8:00:00 #SBATCH --mem-per-cpu=8gb #SBATCH --partition=open # To source miniconda source /storage/home/stb5321/work/miniconda3/etc/profile.d/conda.sh # Assign arguments from sbatch command called in parent script to variables FILE=&quot;$1&quot; THREADS=&quot;$2&quot; DIR_OUT=&quot;$3&quot; DIR_RAW=&quot;$4&quot; commandX -t ${THREADS} \\ ${DIR_RAW}/${FILE}_1.fastq ${DIR_RAW}/${FILE}_2.fastq \\ -o ${DIR_OUT}/${FILE}.out Note that depending on how your HPC is set up, you may need to source your conda installation in each child script. For more information on the #SBATCH commands in the headers of these scripts, see the Slurm Workflow Manager documentation. Specifically for PSU, see the ROAR Collab documentation. 2.3 Generation of F1 genomes and transcriptomes 2.3.1 WGS read pre-processing 2.3.1.1 Retrieve WGS reads from SRA SRA Cross Parent SRR28865844 A Male SRR28865843 A Female SRR28865842 B Male SRR28865841 B Female DIR_RAW=&quot;AST-tutorial/RAW/WGS&quot; FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841) for FILE in &quot;${FILES[@]}&quot; do # Retrieve the .sra file prefetch -O ${DIR_RAW} ${FILE} # Extract .fastq from .sra fasterq-dump -O ${DIR_RAW} ${DIR_RAW}/${FILE}.sra # Delete the .sra file rm ${DIR_RAW}/{FILE}.sra done 2.3.1.2 Assess raw quality control metrics Here, FastQC reports for each library are generated. These can be viewed individually, or aggregated into a single report with MultiQC. This tutorial will demonstrate the latter approach. DIR_RAW=&quot;AST-tutorial/RAW/WGS&quot; DIR_REPORTS=&quot;AST-tutorial/REPORTS/WGS&quot; THREADS=&quot;integer number of threads for parallel processing, usually 4 is fine&quot; FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841) for FILE in &quot;${FILES[@]}&quot; do fastqc \\ -t ${THREADS} \\ ${DIR_RAW}/${FILE}_1.fastq ${DIR_RAW}/${FILE}_2.fastq \\ --outdir ${DIR_REPORTS} done 2.3.1.3 Trim and filter reads fastp will detect common sequencing adapters and trim them from the ends of sequencing reads, in addition to filter low-quality reads (where the average base call quality score is below some threshold and/or where the read is less than n nucleotides in length). Here, run fastp using the default settings to clean up the reads. DIR_RAW=&quot;AST-tutorial/RAW/WGS&quot; DIR_TRIM=&quot;AST-tutorial/TRIM/WGS&quot; DIR_REPORTS=&quot;AST-tutorial/REPORTS/WGS&quot; THREADS=&quot;integer number of threads for parallel processing, usually 8 is fine&quot; FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841) for FILE in &quot;${FILES[@]}&quot; do fastp \\ -w ${THREADS} \\ -i ${DIR_RAW}/${FILE}_1.fastq -I ${DIR_RAW}/${FILE}_2.fastq \\ -o ${DIR_TRIM}/${FILE}_1.fastq -O ${DIR_TRIM}/${FILE}_2.fastq \\ -j ${DIR_REPORTS}/${FILE}_fastp.json done 2.3.1.4 Aggregate QC reports with MultiQC Quality control metrics of the WGS reads before and after trimming/filtering can be visualized in one html report. Read more about interpreting NGS QC metrics from these reports here from the Harvard Chan Bioinformatics Core. An example MultiQC report of WGS reads post-trimming can be found here. DIR_REPORTS=&quot;AST-tutorial/REPORTS/WGS&quot; multiqc ${DIR_REPORTS}/. \\ -o ${DIR_REPORTS} \\ -n WGS.html 2.3.2 WGS read alignment and post-processing 2.3.2.1 Generate BWA alignment index for the Amel_HAv3.1 reference genome assembly DIR_INDEX=&quot;AST-tutorial/INDEX&quot; bwa index ${DIR_INDEX}/Amel_HAv3.1.fasta 2.3.2.2 Align reads to the Amel_HAv3.1 reference genome assembly with BWA-MEM DIR_TRIM=&quot;AST-tutorial/TRIM/WGS&quot; DIR_INDEX=&quot;AST-tutorial/INDEX&quot; DIR_ALIGN=&quot;AST-tutorial/ALIGN_GENOME/UNFILTERED&quot; THREADS=&quot;integer number of threads for parallel processing, usually 8 is fine&quot; FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841) for FILE in &quot;${FILES[@]}&quot; do bwa mem \\ -t ${THREADS} \\ ${DIR_INDEX}/Amel_HAv3.1.fasta \\ ${FILE}_1.fastq ${FILE}_2.fastq \\ &gt; ${DIR_ALIGN}/${FILE}.sam done 2.3.2.3 Sort and filter alignments with samtools and filter duplicates with GATK Here, alignments are coordinate sorted and filtered to remove unmapped reads, reads with unmapped mates, secondary, supplementary, and duplicate alignments (see here for information on SAM flags). These are important to remove for variant calling steps as they can contribute to both false positive and negative calls. For more information about these practices, see the GATK best practices workflow for variant discovery and Koboldt (2020), Genome Med, https://doi.org/10.1186/s13073-020-00791-w. DIR_ALIGN=&quot;AST-tutorial/ALIGN_GENOME/UNFILTERED&quot; DIR_ALIGN_FILTERED=&quot;AST-tutorial/ALIGN_GENOME/FILTERED&quot; DIR_REPORTS=&quot;AST-tutorial/REPORTS/WGS&quot; THREADS=&quot;integer number of threads for parallel processing, usually 6 is fine&quot; FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841) for FILE in &quot;${FILES[@]}&quot; do # Remove alignments with any flags in 2316 # Convert SAM to BAM and sort using 4Gb memory per thread samtools view \\ -@ ${THREADS} -F 2316 -O bam ${DIR_ALIGN}/${FILE}.sam \\ | samtools sort -@ ${THREADS} -m 4g -O bam - \\ &gt; ${DIR_ALIGN}/${FILE}.bam # Remove duplicates and sort by coordinate gatk MarkDuplicates \\ -I ${DIR_ALIGN}/${FILE}.bam \\ -O ${DIR_ALIGN_FILTERED}/${FILE}_rmvdup.bam \\ -M ${DIR_REPORTS}/${FILE}_rmvdup_metrics.txt \\ --ASSUME_SORT_ORDER coordinate \\ --REMOVE_DUPLICATES true \\ --ADD_PG_TAG_TO_READS false done 2.3.3 Variant discovery and filtration Here, variants (SNPs, MNPs, &amp; indels) are identified using freebayes to account for differences in ploidy between DIPLOID and HAPLOID samples. After variant discovery, the variant call files (VCFs) are then subset to homozygous SNPs and filtered by quality (QUAL), minimum depth of coverage (MINDP) and maximum depth of coverage (MAXDP) using samtools and VCFtools, compressed with bgzip and indexed with tabix (samtools). For more information about these practices, see this how-to guide on Filtering and handling VCFs for population genomics from Mark Ravinet &amp; Joana Meier. QUAL=&quot;an integer, 30 is ideal&quot; MINDP=&quot;an integer, usually 10 is fine&quot; MAXDP=&quot;an integer, usually 50 is fine&quot; DIR_ALIGN_FILTERED=&quot;AST-tutorial/ALIGN_GENOME/FILTERED&quot; DIR_VARIANTS=&quot;AST-tutorial/VARIANTS/UNFILTERED&quot; DIR_VARIANTS_FILTERED=&quot;AST-tutorial/VARIANTS/FILTERED&quot; DIR_INDEX=&quot;AST-tutorial/INDEX&quot; DIPLOID=(SRR28865843 SRR28865841) for FILE in &quot;${DIPLOID[@]}&quot; do freebayes \\ -p 2 -f ${DIR_INDEX}/Amel_HAv3.1.fasta \\ ${DIR_ALIGN_FILTERED}/${FILE}.bam &gt; ${DIR_VARIANTS}/${FILE}.vcf done HAPLOID=(SRR28865844 SRR28865842) for FILE in &quot;${HAPLOID[@]}&quot; do freebayes \\ -p 1 -f ${DIR_INDEX}/Amel_HAv3.1.fasta \\ ${DIR_ALIGN_FILTERED}/${FILE}.bam &gt; ${DIR_VARIANTS}/${FILE}.vcf done FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841) for FILE in &quot;${FILES[@]}&quot; do # Remove variants that do not meet criteria # Retain only high-quality homozygous SNPs vcftools \\ --vcf ${DIR_VARIANTS}/${FILE}.vcf \\ --minQ ${QUAL} \\ --min-meanDP ${MINDP} --max-meanDP ${MAXDP} \\ --minDP ${MINDP} --maxDP ${MAXDP} \\ --recode --stdout \\ | bcftools filter -e &#39;GT=&quot;het&quot;&#39; - \\ | bcftools filter -i &#39;TYPE=&quot;snp&quot;&#39; - \\ &gt; ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf # Compress bgzip -c ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf \\ &gt; ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf.gz # Index the compressed file tabix -p vcf ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf.gz done 2.3.4 Genome and transcriptome construction 2.3.4.1 Construct F1 genomes with GATK FastaAlternateReferenceMaker First, create a “sequence dictionary” from the Amel_HAv3.1 reference genome for use with GATK. Then, integrate the high-quality homozygous SNPs into the Amel_HAv3.1 reference genome for each parent to generate individual genomes. Finally, reformat the numeric headers of the F1 genome .fasta files produced by GATK to match the nucleotide accessions in the headers of the Amel_HAv3.1 reference genome fasta file. DIR_VARIANTS_FILTERED=&quot;AST-tutorial/VARIANTS/FILTERED&quot; DIR_INDEX=&quot;AST-tutorial/INDEX&quot; DIR_PARENT_GENOMES=&quot;AST-tutorial/PARENT_GENOMES&quot; gatk4 CreateSequenceDictionary \\ -R ${DIR_INDEX}/Amel_HAv3.1.fasta \\ -O ${DIR_INDEX}/Amel_HAv3.1.dict FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841) for FILE in &quot;${FILES[@]}&quot; do gatk4 FastaAlternateReferenceMaker \\ -R ${DIR_INDEX}/Amel_HAv3.1.fasta \\ -O ${DIR_PARENT_GENOMES}/${FILE}.fasta \\ -V ${DIR_VARIANTS_FILTERED}/${FILE}_homozygous_snps.vcf.gz done # Get all sequence headers from one of the F1 genomes (= &quot;bad&quot;&quot;) grep &quot;&gt;&quot; ${DIR_PARENT_GENOMES}/SRR28865844.fasta \\ | sed &#39;s/&gt;//g&#39; &gt; bad_headers.txt # Get all sequence headers from the reference genome (= &quot;good&quot;&quot;) grep &quot;&gt;&quot; ${DIR_INDEX}/Amel_HAv3.1.fasta \\ | sed &#39;s/\\s.*$//&#39; | sed &#39;s/&gt;//g&#39; &gt; ${DIR_INDEX}/good_headers.txt # Combine the &quot;bad&quot; and &quot;good&quot; headers into one .tsv paste -d&quot;\\t&quot; ${DIR_INDEX}/bad_headers.txt ${DIR_INDEX}/good_headers.txt \\ &gt; ${DIR_INDEX}/replace_headers.tsv for FILE in &quot;${FILES[@]}&quot; do # Use replace_headers.tsv as a lookup table for replacement awk &#39;FNR==NR{a[&quot;&gt;&quot;$1]=$2;next}$1 in a{sub(/&gt;/,&quot;&gt;&quot;a[$1]&quot;|&quot;,$1)}1&#39; \\ ${DIR_INDEX}/replace_headers.tsv ${DIR_PARENT_GENOMES}/${FILE}.fasta \\ | sed &#39;s/:.*//&#39; &gt; ${DIR_PARENT_GENOMES}/${FILE}_fixed.fasta done This will replace the sequential numeric sequence headers in the resultant F1 genome .fasta files produced by GATK, e.g., &gt;1 ATCCTCCACCT.... &gt;2 GGGAATTGCCA.... With the nucleotide accessions in the sequence headers of the Amel_HAv3.1 reference genome fasta file, e.g., &gt;NC_037638.1 ATCCTCCACCT.... &gt;NC_037639.1 GGGAATTGCCA.... This step is critical for the final mRNA-seq allele-specific read counting step as the chromosome field of the annotation file used to assign counts to transcripts must match the chromosome field of the alignment files used for counting. 2.3.4.2 Construct F1 transcriptomes for STAR THREADS=&quot;integer number of threads for parallel processing, usually 8 is fine&quot; DIR_INDEX=&quot;AST-tutorial/INDEX&quot; DIR_PARENT_GENOMES=&quot;AST-tutorial/PARENT_GENOMES&quot; FILES=(SRR28865844 SRR28865843 SRR28865842 SRR28865841) for FILE in &quot;${FILES[@]}&quot; do STAR \\ --runThreadN ${THREADS} \\ --runMode genomeGenerate \\ --genomeDir ${DIR_PARENT_GENOMES}/${FILE} \\ --genomeFastaFiles ${DIR_PARENT_GENOMES}/${FILE}_fixed.fasta \\ --sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf done 2.3.4.3 Annotate F1 SNPs within genes Here, the high-quality homozygous SNPs for each reciprocal cross are intersected to find those that are unique to each parent but shared between the crosses. For example, consider these hypothetical SNPs reported by a variant calling tool: Parent Cross Chr Pos Ref Alt Male A 1 100 A T Female B 1 100 A T In this example, in Cross A, the “A&gt;T” SNP is present in the male parent but absent in the female parent (i.e., the “Alt” allele for the female parent is the same as the “Ref” allele and thus was not reported as a SNP). In Cross B, this SNP is present in the female parent but absent in the male parent. Thus, this SNP is unique to the parents of each cross, but shared between the crosses of the reciprocal cross. Therefore, this SNP can be used to differentiate reads originating from each parent in each cross, allowing for assessment of genotype-specific or parent-specific bias within the reciprocal cross design. Get gene annotations First, subset the Amel_HAv3.1 reference genome feature file (gff) to gene features, and modify column 9 (the info column) to keep only the RefSeq GeneID (these will be used in all downstream analyses). DIR_INDEX=&quot;AST-tutorial/INDEX&quot; DIR_VARIANTS=&quot;AST-tutorial/VARIANTS/UNFILTERED&quot; DIR_VARIANTS_FILTERED=&quot;AST-tutorial/VARIANTS/FILTERED&quot; # Subset Amel_HAv3.1.gff to rows with &quot;gene&quot;&quot; in column 3 # Print columns 1-8 awk &#39;$3 == &quot;gene&quot; { print $0 }&#39; ${DIR_INDEX}/Amel_HAv3.1.gff \\ | awk -v OFS=&quot;\\t&quot; &#39;{print $1, $2, $3, $4, $5, $6, $7, $8}&#39; \\ &gt; ${DIR_INDEX}/Amel_HAv3.1_genes.txt # Extract the &quot;GeneID&quot; from column 9 awk &#39;$3 == &quot;gene&quot; { print $0 }&#39; ${DIR_INDEX}/Amel_HAv3.1.gff \\ | awk &#39;{print $9}&#39; \\ | grep -o &#39;GeneID[^\\s]*&#39; \\ | cut -d&#39;:&#39; -f2 \\ | cut -d&#39;;&#39; -f1 \\ | cut -d&#39;,&#39; -f1 \\ | sed -e &#39;s/^/LOC/&#39; \\ &gt; ${DIR_INDEX}/Amel_HAv3.1_geneIDs.txt # Combine the two files to create a new gene-only .gff paste -d&#39;\\t&#39; ${DIR_INDEX}/Amel_HAv3.1_genes.txt \\ ${DIR_INDEX}/Amel_HAv3.1_geneIDs.txt \\ &gt; ${DIR_INDEX}/Amel_HAv3.1_genes.gff3 The resultant Amel_HAv3.1_genes.gff3 file should look like this: NC_037638.1 Gnomon gene 9273 12174 . - . LOC551580 NC_037638.1 Gnomon gene 10792 17180 . + . LOC551555 NC_037638.1 Gnomon gene 17090 23457 . - . LOC726347 Identify informative F1 SNPs Next, identify the informative SNPs by taking the difference of each respective parental SNP call set (i.e., SNPs unique to each parent) and finding the intersection between crosses. DIR_VARIANTS_FILTERED=&quot;AST-tutorial/VARIANTS/FILTERED&quot; DIR_ANALYSIS=&quot;AST-tutorial/ANALYSIS_SETS&quot; CrossApat=&quot;SRR28865844&quot; CrossAmat=&quot;SRR28865843&quot; CrossBpat=&quot;SRR28865842&quot; CrossBmat=&quot;SRR28865841&quot; # Get SNPs in -a that are not in -b bedtools intersect -header -v \\ -a ${DIR_VARIANTS_FILTERED}/${CrossApat}_homozygous_snps.vcf \\ -b ${DIR_VARIANTS_FILTERED}/${CrossAmat}_homozygous_snps.vcf \\ &gt; ${DIR_VARIANTS_FILTERED}/${CrossApat}_homozygous_snps_outer.vcf bedtools intersect -header -v \\ -a ${DIR_VARIANTS_FILTERED}/${CrossAmat}_homozygous_snps.vcf \\ -b ${DIR_VARIANTS_FILTERED}/${CrossApat}_homozygous_snps.vcf \\ &gt; ${DIR_VARIANTS_FILTERED}/${CrossAmat}_homozygous_snps_outer.vcf bedtools intersect -header -v \\ -a ${DIR_VARIANTS_FILTERED}/${CrossBpat}_homozygous_snps.vcf \\ -b ${DIR_VARIANTS_FILTERED}/${CrossBmat}_homozygous_snps.vcf \\ &gt; ${DIR_VARIANTS_FILTERED}/${CrossBpat}_homozygous_snps_outer.vcf bedtools intersect -header -v \\ -a ${DIR_VARIANTS_FILTERED}/${CrossBmat}_homozygous_snps.vcf \\ -b ${DIR_VARIANTS_FILTERED}/${CrossBpat}_homozygous_snps.vcf \\ &gt; ${DIR_VARIANTS_FILTERED}/${CrossBmat}_homozygous_snps_outer.vcf # Combine Cross A SNPs from above to single .vcf grep -v &#39;^#&#39; \\ ${DIR_VARIANTS_FILTERED}/${CrossApat}_homozygous_snps_outer.vcf \\ | cat ${DIR_VARIANTS_FILTERED}/${CrossAmat}_homozygous_snps_outer.vcf - \\ &gt; ${DIR_VARIANTS_FILTERED}/CrossA_homozygous_snps_outer.vcf # Compress the combined Cross A .vcf bgzip -c \\ ${DIR_VARIANTS_FILTERED}/CrossA_homozygous_snps_outer.vcf \\ &gt; ${DIR_VARIANTS_FILTERED}/CrossA_homozygous_snps_outer.vcf.gz # Combine Cross B SNPs from above to single .vcf grep -v &#39;^#&#39; \\ ${DIR_VARIANTS_FILTERED}/${CrossBpat}_homozygous_snps_outer.vcf \\ | cat ${DIR_VARIANTS_FILTERED}/${CrossBmat}_homozygous_snps_outer.vcf - \\ &gt; ${DIR_VARIANTS_FILTERED}/CrossB_homozygous_snps_outer.vcf # Compress the combined Cross B .vcf bgzip -c \\ ${DIR_VARIANTS_FILTERED}/CrossB_homozygous_snps_outer.vcf \\ &gt; ${DIR_VARIANTS_FILTERED}/CrossB_homozygous_snps_outer.vcf.gz # Intersect the combined Cross A &amp; Cross B SNPs bedtools intersect -header -u \\ -a ${DIR_VARIANTS_FILTERED}/CrossA_homozygous_snps_outer.vcf.gz \\ -b ${DIR_VARIANTS_FILTERED}/CrossB_homozygous_snps_outer.vcf.gz \\ &gt; ${DIR_ANALYSIS}/Analysis_SNP_Set.vcf # Create a basic .bed file from the intersected SNPs # Use the genomic position as the start and end coordinates # Label each variant sequentially (i.e., SNP_1, SNP_2, etc.) grep -v &#39;^#&#39; ${DIR_ANALYSIS}/Analysis_SNP_Set.vcf \\ | awk -v OFS=&quot;\\t&quot; &#39;{print $1, $2, $2}&#39; \\ | awk -v OFS=&quot;\\t&quot; &#39;$4=(FNR FS $4)&#39; \\ | awk -v OFS=&quot;\\t&quot; &#39;{print $1, $2, $3, &quot;snp_&quot;$4}&#39; \\ &gt; ${DIR_ANALYSIS}/Analysis_SNP_Set.bed The resultant Analysis_SNP_Set.bed file should look like this: NC_037638.1 155356 155356 snp_1 NC_037638.1 174473 174473 snp_2 NC_037638.1 183350 183350 snp_3 Intersect informative F1 SNPs with genes Finally, annotate the genes with the positions of informative F1 SNPs. In Section 3, these SNPs will be subset to those within the longest transcript for each gene. Mitochondrial genes (on the chromosome with nucleotide accession “NC_001566.1”) are filtered, as these are innapropriate for assessing parent-of-origin effects (as all mitochondrial genes will be maternally expressed). DIR_ANALYSIS=&quot;AST-tutorial/ANALYSIS_SETS&quot; DIR_INDEX=&quot;AST-tutorial/INDEX&quot; # Intersect the gene annotations and SNP set # Combine the SNP_ID and GeneIDs in column 4 (name) as SNP_ID:GeneID bedtools intersect -wb \\ -a ${DIR_INDEX}/Amel_HAv3.1_genes.gff3 \\ -b ${DIR_ANALYSIS}/Analysis_SNP_Set.bed \\ | awk -v OFS=&quot;\\t&quot; &#39;{print $10, $11, $12, $13 &quot;:&quot; $9, $6, $7}&#39; \\ | grep -v &#39;^NC_001566.1&#39; | sort -k1,1 -k2,2n \\ &gt; ${DIR_ANALYSIS}/SNPs_for_Analysis.bed sort --parallel=8 -k1,1 -k2,2n \\ ${DIR_ANALYSIS}/SNPs_for_Analysis.bed \\ &gt; ${DIR_ANALYSIS}/SNPs_for_Analysis_Sorted.bed The resultant SNPs_for_Analysis_Sorted.bed bed should look like this: NC_037638.1 54751 54751 snp_99616:LOC107964061 . + NC_037638.1 91396 91396 snp_99617:LOC113219112 . + NC_037638.1 147182 147182 snp_99618:LOC726544 . - 2.4 Quantification of F2 allele-specific transcription 2.4.1 mRNA-seq read pre-processing 2.4.1.1 Retrieve mRNA-seq reads from SRA SRA Cross Phenotype SRR28865816 A Worker SRR28865815 A Worker SRR28865811 A Queen SRR28865812 A Queen SRR28865777 A Queen SRR28865776 A Worker SRR28865775 A Queen SRR28865774 A Worker SRR28865773 A Queen SRR28865772 B Queen SRR28865770 B Queen SRR28865769 B Worker SRR28865768 B Queen SRR28865771 B Queen SRR28865767 B Worker SRR28865808 B Worker SRR28865806 B Worker SRR28865804 A Worker SRR28865803 B Worker SRR28865802 B Queen DIR_RAW=&quot;AST-tutorial/RAW/mRNA&quot; FILES=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \\ SRR28865777 SRR28865776 SRR28865775 SRR28865774 \\ SRR28865773 SRR28865772 SRR28865770 SRR28865769 \\ SRR28865768 SRR28865771 SRR28865767 SRR28865808 \\ SRR28865806 SRR28865804 SRR28865803 SRR28865802) for FILE in &quot;${FILES[@]}&quot; do prefetch -O ${DIR_RAW} ${FILE} fasterq-dump -O ${DIR_RAW} ${DIR_RAW}/${FILE}.sra rm ${DIR_RAW}/{FILE}.sra done 2.4.1.2 Assess raw quality control metrics DIR_RAW=&quot;AST-tutorial/RAW/mRNA&quot; DIR_REPORTS=&quot;AST-tutorial/REPORTS/mRNA&quot; THREADS=&quot;integer number of threads for parallel processing, usually 4 is fine&quot; FILES=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \\ SRR28865777 SRR28865776 SRR28865775 SRR28865774 \\ SRR28865773 SRR28865772 SRR28865770 SRR28865769 \\ SRR28865768 SRR28865771 SRR28865767 SRR28865808 \\ SRR28865806 SRR28865804 SRR28865803 SRR28865802) for FILE in &quot;${FILES[@]}&quot; do fastqc \\ -t ${THREADS} \\ ${DIR_RAW}/${FILE}_1.fastq ${DIR_RAW}/${FILE}_2.fastq \\ --outdir ${DIR_REPORTS} done 2.4.1.3 Trim and filter reads DIR_RAW=&quot;AST-tutorial/RAW/mRNA&quot; DIR_TRIM=&quot;AST-tutorial/TRIM/mRNA&quot; DIR_REPORTS=&quot;AST-tutorial/REPORTS/mRNA&quot; THREADS=&quot;integer number of threads for parallel processing, usually 8 is fine&quot; FILES=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \\ SRR28865777 SRR28865776 SRR28865775 SRR28865774 \\ SRR28865773 SRR28865772 SRR28865770 SRR28865769 \\ SRR28865768 SRR28865771 SRR28865767 SRR28865808 \\ SRR28865806 SRR28865804 SRR28865803 SRR28865802) for FILE in &quot;${FILES[@]}&quot; do fastp \\ -w ${THREADS} \\ -i ${DIR_RAW}/${FILE}_1.fastq -I ${DIR_RAW}/${FILE}_2.fastq \\ -o ${DIR_TRIM}/${FILE}_1.fastq -O ${DIR_TRIM}/${FILE}_2.fastq \\ -j ${DIR_REPORTS}/${FILE}_fastp.json done 2.4.1.4 Aggregate QC reports with MultiQC DIR_REPORTS=&quot;AST-tutorial/REPORTS/mRNA&quot; multiqc ${DIR_REPORTS}/. \\ -o ${DIR_REPORTS} \\ -n mRNA.html 2.4.2 mRNA-seq read alignment to F1 transcriptomes Here, the mRNA-seq reads for each F2 sample are aligned to their respective parental (F1) transcriptomes. DIR_TRIM=&quot;AST-tutorial/TRIM/mRNA&quot; DIR_PARENT_GENOMES=&quot;AST-tutorial/PARENT_GENOMES&quot; DIR_ALIGN=&quot;AST-tutorial/ALIGN_PARENT_GENOMES/UNFILTERED&quot; DIR_INDEX=&quot;AST-tutorial/INDEX&quot; THREADS=&quot;integer number of threads for parallel processing, usually 8 is fine&quot; CrossApat=&quot;SRR28865844&quot; CrossAmat=&quot;SRR28865843&quot; CrossA=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \\ SRR28865777 SRR28865776 SRR28865775 SRR28865774 \\ SRR28865773 SRR28865804) for FILE in &quot;${CrossA[@]}&quot; do STAR \\ --outSAMtype BAM SortedByCoordinate \\ --runThreadN ${THREADS} \\ --sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf \\ --genomeDir ${DIR_PARENT_GENOMES}/${CrossApat} \\ --readFilesIn ${DIR_TRIM}/${FILE}_1.fastq ${DIR_TRIM}/${FILE}_2.fastq \\ --outFileNamePrefix ${DIR_ALIGN}/${CrossApat}_${FILE} \\ --outSAMattributes NH HI AS nM NM STAR \\ --outSAMtype BAM SortedByCoordinate \\ --runThreadN ${THREADS} \\ --sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf \\ --genomeDir ${DIR_PARENT_GENOMES}/${CrossAmat} \\ --readFilesIn ${DIR_TRIM}/${FILE}_1.fastq ${DIR_TRIM}/${FILE}_2.fastq \\ --outFileNamePrefix ${DIR_ALIGN}/${CrossAmat}_${FILE} \\ --outSAMattributes NH HI AS nM NM done CrossBpat=&quot;SRR28865842&quot; CrossBmat=&quot;SRR28865841&quot; CrossB=(SRR28865772 SRR28865770 SRR28865769 SRR28865768 \\ SRR28865771 SRR28865767 SRR28865808 SRR28865806 \\ SRR28865803 SRR28865802) for FILE in &quot;${CrossB[@]}&quot; do STAR \\ --outSAMtype BAM SortedByCoordinate \\ --runThreadN ${THREADS} \\ --sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf \\ --genomeDir ${DIR_PARENT_GENOMES}/${CrossBpat} \\ --readFilesIn ${DIR_TRIM}/${FILE}_1.fastq ${DIR_TRIM}/${FILE}_2.fastq \\ --outFileNamePrefix ${DIR_ALIGN}/${CrossBpat}_${FILE} \\ --outSAMattributes NH HI AS nM NM STAR \\ --outSAMtype BAM SortedByCoordinate \\ --runThreadN ${THREADS} \\ --sjdbGTFfile ${DIR_INDEX}/Amel_HAv3.1.gtf \\ --genomeDir ${DIR_PARENT_GENOMES}/${CrossBmat} \\ --readFilesIn ${DIR_TRIM}/${FILE}_1.fastq ${DIR_TRIM}/${FILE}_2.fastq \\ --outFileNamePrefix ${DIR_ALIGN}/${CrossBmat}_${FILE} \\ --outSAMattributes NH HI AS nM NM done 2.4.3 mRNA-seq alignment post-processing and allele-specific read counting This final step seems a bit convoluted at first, but after testing many other read counting tools, I found this is the only approach that allows for counting strand-specific reads at the SNP-level without making any additional assumptions. DIR_ANALYSIS=&quot;AST-tutorial/ANALYSIS_SETS&quot; DIR_INDEX=&quot;AST-tutorial/INDEX&quot; DIR_ALIGN=&quot;AST-tutorial/ALIGN_PARENT_GENOMES/UNFILTERED&quot; DIR_ALIGN_FILTERED=&quot;AST-tutorial/ALIGN_PARENT_GENOMES/FILTERED&quot; DIR_COUNT=&quot;AST-tutorial/COUNT&quot; THREADS=&quot;integer number of threads for parallel processing, usually 8 is fine&quot; ASET=&quot;${DIR_ANALYSIS}/SNPs_for_Analysis_Sorted.bed&quot; Parents=(&quot;SRR28865844&quot; &quot;SRR28865843&quot;) CrossA=(SRR28865816 SRR28865815 SRR28865811 SRR28865812 \\ SRR28865777 SRR28865776 SRR28865775 SRR28865774 \\ SRR28865773 SRR28865804) for FILE in &quot;${CrossA[@]}&quot; do for PARENT in &quot;${Parents[@]}&quot; do # Keep only mapped, primary alignments with 0 mismatches samtools view -@ ${THREADS} -m 6G -e &#39;[NM]==0&#39; -F 260 -O BAM \\ ${DIR_ALIGN}/${PARENT}_${FILE}Aligned.sortedByCoord.out.bam \\ &gt; ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam # Index the alignments samtools index ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam \\ ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam.bai # Convert BAM to BED bedtools bamtobed -i ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam \\ &gt; ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}.bed # Sort BED by coordinate sort --parallel=${THREADS} \\ -k1,1 -k2,2n ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}.bed \\ &gt; ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bed # Count reads by SNP accounting for strand of alignment bedtools intersect -S -sorted -c -a ${ASET} \\ -b ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bed \\ &gt; ${DIR_COUNT}/${PARENT}_${FILE}.txt done done Parents=(&quot;SRR28865842&quot; &quot;SRR28865841&quot;) CrossB=(SRR28865772 SRR28865770 SRR28865769 SRR28865768 \\ SRR28865771 SRR28865767 SRR28865808 SRR28865806 \\ SRR28865803 SRR28865802) for FILE in &quot;${CrossB[@]}&quot; do for PARENT in &quot;${Parents[@]}&quot; do # Keep only mapped, primary alignments with 0 mismatches samtools view -@ ${THREADS} -m 6G -e &#39;[NM]==0&#39; -F 260 -O BAM \\ ${DIR_ALIGN}/${PARENT}_${FILE}Aligned.sortedByCoord.out.bam \\ &gt; ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam # Index the alignments samtools index ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam \\ ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam.bai # Convert BAM to BED bedtools bamtobed -i ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bam \\ &gt; ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}.bed # Sort BED by coordinate sort --parallel=${THREADS} \\ -k1,1 -k2,2n ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}.bed \\ &gt; ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bed # Count reads by SNP accounting for strand of alignment bedtools intersect -S -sorted -c -a ${ASET} \\ -b ${DIR_ALIGN_FILTERED}/${PARENT}_${FILE}_sorted.bed \\ &gt; ${DIR_COUNT}/${PARENT}_${FILE}.txt done done Resultant counts files should look identical to SNPs_for_Analysis_Sorted.bed with an additional column containing the read counts: NC_037638.1 54751 54751 snp_99616:LOC107964061 . + 8 NC_037638.1 91396 91396 snp_99617:LOC113219112 . + 7 NC_037638.1 147182 147182 snp_99618:LOC726544 . - 9 In Section 3: SNP-level parent-specific read counts to allele-specific transcription, the resultant count files will be combined into a single matrix, and various statistical methods will be used to assess and compare allele-specific transcription between phenotypes. "],["snp-level-parent-specific-read-counts-to-allele-specific-transcription-1.html", "3 SNP-level parent-specific read counts to allele-specific transcription 3.1 Overview 3.2 Setup 3.3 Generate sample x SNP:gene count matrix 3.4 Prepare sample x SNP:gene counts for analysis 3.5 Conduct statistical tests 3.6 Allele-specific transcription analysis 3.7 Data visualization 3.8 Session info", " 3 SNP-level parent-specific read counts to allele-specific transcription 3.1 Overview This R tutorial follows quantification of SNP-level parent-specific mRNA-seq read counts and documents a statistical method for comparing allele-specific transcription between phenotypes in offspring derived from reciprocal crosses. Sequencing data for this tutorial are described in Bresnahan et al., 2024, “Intragenomic conflict underlies extreme phenotypic plasticity in queen-worker caste determination in honey bees (Apis mellifera)”, bioRxiv. See the tutorial README for more details. The honey bee gene annotations and the F1 SNPs intersecting honey bee genes generated in Section 2 are imported. SNPs are assigned to the longest transcript for each gene. SNPs intersecting \\(n &gt; 2\\) genes or \\(n = 2\\) genes on the same strand are discarded, as it is impossible to assign SNP-level counts to genes with transcripts that overlap on the same strand. Additionally, SNPs within miRNA, tRNA, and repetitive pseudogenes are discarded as these are inappropriate for allele-specific analysis using short sequencing reads (for details, see: Wang &amp; Clark, 2014, Heredity.) The F2 read counts at F1 SNPs generated in Section 2 are imported. SNPs with \\(n &lt; 1\\) count in any Lineage are discarded. In cases where the distance between SNPs is shorter than the average read length (and thus the read counts are exactly the same for both SNPs), one SNP is chosen at random. Finally, genes with \\(n &lt; 2\\) SNPs are discarded, as this method requires at least two independent observations per gene for statistical analysis. To adjust for differences in sequencing depth between libraries, library size factors are estimated and the read counts are normalized using the median of ratios normalization (MRN) method from DESeq2. For each SNP, a Storer-Kim binomial exact test of two proportions is conducted using the MRN counts to test the hypothesis that the proportion of maternal and paternal read counts are statistically different. A general linear mixed-effects model with interaction terms (GLIMMIX) is fit for each gene to assess the effects of Parent, Lineage, and their interaction on the raw read counts at each SNP, using the log of the library size factors as an offset to adjust for variation in sequencing depth between libraries. A Wald test is performed to assess the statistical significance of the effects. For a gene to be considered as showing parent-of-origin or lineage-of-origin effects, all SNPs are required to exhibit the same directional bias in the Storer-Kim test, and a strictly parent-of-origin or lineage-of-origin effect in the GLIMMIX. To avoid identifying genes with parent-of-origin effects influenced by lineage-of-origin effects, or lineage-of-origin genes influenced by parent-of-origin effects, genes with a significant interaction effect are considered unbiased. Additionally, the proportions of Lineage A reads in samples with a Lineage B mother and Lineage A father (proportion 1, or p1), and the proportion of Lineage A reads in samples with a Lineage A mother and Lineage B father (p2), are subjected to a threshold test following methods described in Wang &amp; Clark, 2014, Heredity. 3.2 Setup 3.2.1 Tutorial files Download all files used in this tutorial here or via the command line: wget https://github.com/sbresnahan/AST-tutorial/blob/d99afee8a9b735b8ff927a921cf75ec2997fe322/AST-tutorial.zip 3.2.2 Packages Several CRAN packages are required for this tutorial. Use pacman to install those that are not already on your machine. if (!require(&quot;pacman&quot;)) install.packages(&quot;pacman&quot;) pacman::p_load(&quot;tidyverse&quot;, &quot;plyr&quot;, &quot;Rfast&quot;, &quot;tryCatchLog&quot;, &quot;lmerTest&quot;, &quot;lme4&quot;, &quot;car&quot;, &quot;gridExtra&quot;, &quot;doParallel&quot;, &quot;ggpubr&quot;, &quot;grid&quot;, &quot;tagcloud&quot;, &quot;ggprism&quot;) This tutorial also uses a few Bioconductor packages. Use BiocManager to install them. if (!require(&quot;BiocManager&quot;, quietly = TRUE)) install.packages(&quot;BiocManager&quot;) BiocManager::install(c(&quot;DESeq2&quot;,&quot;genomation&quot;,&quot;GenomicFeatures&quot;)) Load all packages. pkgs &lt;- c(&quot;tidyverse&quot;, &quot;plyr&quot;, &quot;Rfast&quot;, &quot;tryCatchLog&quot;, &quot;lmerTest&quot;, &quot;lme4&quot;, &quot;car&quot;, &quot;gridExtra&quot;, &quot;doParallel&quot;, &quot;ggpubr&quot;, &quot;grid&quot;, &quot;tagcloud&quot;, &quot;ggprism&quot;,&quot;DESeq2&quot;,&quot;genomation&quot;,&quot;GenomicFeatures&quot;) invisible(lapply(pkgs, function(x) library(x, character.only=TRUE))) 3.2.3 Custom functions For efficiency, most of the code detailed throughout this tutorial has been compiled as custom functions, which are included in the project repository as AST-tutorial/AST_functions.R. These can be loaded into R during setup: source(&quot;AST-tutorial/AST_functions.R&quot;) In this tutorial, each function will first be written explicitly to document how it works before demonstrating its use. 3.2.4 Sample metadata Many of the custom functions described above reference the sample metadata, which must follow this specific format: sample.id parent block phenotype individual lineage PARENT_FILE M or F integer character FILE A or B Note: for sample.id, PARENT_FILE corresponds to the file name suffix assigned to the SNP-level read count files generated in Section 2, with PARENT being the respective F1 genome to which the F2 mRNA-seq read FILE was aligned. Sample metadata for the sequencing data described in Bresnahan et al., 2024, bioRxiv used in this tutorial has been included in the project repository as AST-tutorial/metadata.csv: metadata &lt;- read.csv(&quot;AST-tutorial/metadata.csv&quot;) 3.3 Generate sample x SNP:gene count matrix 3.3.1 Filter SNPs by transcript The honey bee gene annotations saved as $AST-tutorial/{DIR_INDEX}/Amel_HAv3.1_genes.gff3 and the F1 SNPs intersecting honey bee genes saved as $AST-tutorial/{DIR_ANALYSIS}/SNPs_for_Analysis_Sorted.bed generated in Section 2 are imported. SNPs are assigned to the longest transcript for each gene, and SNPs within miRNA, tRNA, and repetitive pseudogenes and are discarded. A list of these genes has been included in the project repository as AST-tutorial/genelist_filter.csv. # Load in filter list filterlist &lt;- read.csv(&quot;AST-tutorial/genelist_filter.csv&quot;,header=F)[,c(1)] # Make txdb from the honey bee gene annotations txdb &lt;- makeTxDbFromGFF(&quot;AST-tutorial/INDEX/Amel_HAv3.1.gff&quot;) transcripts &lt;- transcripts(txdb) Function: filter_SNPs filter_SNPs &lt;- function(SNP_gene_bed,exons,filterlist){ ## SNP_gene_bed = bed file of SNPs intersecting genes ## exons = GRanges object containing exon regions ## filterlist = list of gene IDs to filter # Load in the SNPs BED file SNPs &lt;- readBed(SNP_gene_bed) # Get SNPs that overlap with transcripts SNPs.exons &lt;- findOverlaps(SNPs,exons) SNPs &lt;- SNPs[queryHits(SNPs.exons)] SNPs &lt;- data.frame(SNPs)[,c(1,2,7)] names(SNPs) &lt;- c(&quot;chr&quot;,&quot;pos&quot;,&quot;SNP_gene&quot;) # Create SNP and geneID columns by splitting SNP_gene on &quot;:&quot; SNPs$SNP &lt;- as.character(map(strsplit(SNPs$SNP_gene,split = &quot;:&quot;), 1)) SNPs$geneID &lt;- as.character(map(strsplit(SNPs$SNP_gene,split = &quot;:&quot;), 2)) # Filter SNPs in genes within filterlist SNPs &lt;- SNPs[!SNPs$geneID%in%filterlist,] # Delete any duplicate rows and clean up SNPs &lt;- SNPs[!duplicated(SNPs),] rm(SNPs.exons) # Output return(SNPs) } # Execute SNPs &lt;- filter_SNPs(&quot;AST-tutorial/ANALYSIS_SETS/SNPs_for_Analysis_Sorted.bed&quot;, transcripts,filterlist) # Save output write.csv(SNPs,&quot;AST-tutorial/ANALYSIS/INTERMEDIATES/SNPs.csv&quot;,row.names=F) Example rows: chr pos SNP_gene SNP geneID NC_037638.1 54752 snp_99616:LOC107964061 snp_99616 LOC107964061 NC_037638.1 91397 snp_99617:LOC113219112 snp_99617 LOC113219112 NC_037638.1 147183 snp_99618:LOC726544 snp_99618 LOC726544 NC_037638.1 148247 snp_99619:LOC726544 snp_99619 LOC726544 NC_037638.1 148360 snp_99620:LOC726544 snp_99620 LOC726544 NC_037638.1 148370 snp_99621:LOC726544 snp_99621 LOC726544 3.3.2 Merge count files to matrix The F2 read counts at F1 SNPs generated in Section 2 are imported. In cases where the distance between SNPs is shorter than the average read length (and thus the read counts are exactly the same for both SNPs), one SNP is chosen at random. Function: make_ASE_counts_matrix make_ASE_counts_matrix &lt;- function(DIR_COUNTS,cores,SNPs){ ## DIR_COUNTS = name of subdirectory containing count files ## cores = number of threads for parallel processing ## SNPs = dataframe generated by filter_SNPs2 # Read in data ## List files ending in .txt in DIR_COUNTS SNP_counts &lt;- data.frame(matrix(ncol=0,nrow=length(SNPs$SNP_gene))) SNP_counts$SNP_gene &lt;- SNPs$SNP_gene files.counts &lt;- list.files(path=DIR_COUNTS, pattern=&quot;*.txt&quot;, full.names=TRUE, recursive=FALSE) ## Left join count files by SNP_ID:geneID for(i in 1:length(files.counts)){ print(i) tmp.name &lt;- strsplit(strsplit(files.counts[[i]], split = &quot;/&quot;)[[1]][length(strsplit(files.counts[[i]], split = &quot;/&quot;)[[1]])], split = &quot;[.]&quot;)[[1]][1] if(tmp.name%in%metadata$sample.id){ tmp &lt;- read.table(files.counts[[i]],header=F)[,c(4,7)] names(tmp) &lt;- c(&quot;SNP_gene&quot;,tmp.name) SNP_counts &lt;- SNP_counts %&gt;% left_join(tmp, by = c(&#39;SNP_gene&#39; = &#39;SNP_gene&#39;)) } } rm(tmp,files.counts,tmp.name) # Clean up dataframe row.names(SNP_counts) &lt;- SNP_counts$SNP_gene SNP_counts$SNP_gene &lt;- NULL SNP_counts[is.na(SNP_counts)] &lt;- 0 SNP_counts$gene &lt;- as.character(map(strsplit(row.names(SNP_counts), split = &quot;:&quot;), 2)) # Remove duplicate rows within genes genelist &lt;- unique(SNP_counts$gene) delete.rows &lt;- list() ## Use DoParallel to perform loop operation in parallel registerDoParallel(cores=cores) delete.rows &lt;- foreach(i=1:length(genelist)) %dopar% { tmp &lt;- SNP_counts[SNP_counts$gene==genelist[i],] d &lt;- row.names(tmp[duplicated(tmp),]) delete.rows &lt;- return(d) } SNP_counts &lt;- SNP_counts[!row.names(SNP_counts)%in%unlist(delete.rows),] # Output return(SNP_counts) } # Execute SNP_counts &lt;- make_ASE_counts_matrix(&quot;AST-tutorial/COUNT&quot;,4,SNPs) # Save output write.csv(SNP_counts,&quot;AST-tutorial/ANALYSIS/INTERMEDIATES/SNP_gene_counts.csv&quot;) Example rows: B3D_CMG0017_S9 B3D_CMG0018_S10 snp_99616:LOC107964061 8 7 snp_99617:LOC113219112 9 6 snp_99618:LOC726544 7 11 snp_99619:LOC726544 3 7 snp_1:LOC726544 3 7 snp_99623:LOC726544 3 8 3.4 Prepare sample x SNP:gene counts for analysis 3.4.1 Normalize by library size In spite of every effort to standardize the library preparation and sequencing procedures, there are innumerable sources of between-sample variation that cannot be controlled for which will result in variation in sequencing depth between libraries. This is apparent here: To adjust for differences in sequencing depth between libraries, library size factors are estimated and the read counts are normalized using the median of ratios normalization (MRN) method from DESeq2. First, sum the parent-specific counts by library and estimate library size factors: Function: calcSizeFactors calcSizeFactors &lt;- function(counts,mdata){ ## counts = counts matrix generated by make_ASE_counts_matrix ## mdata = dataframe containing sample metadata # Merge parent-specific counts by library counts &lt;- counts[,names(counts)%in%mdata$sample.id] counts_merged &lt;- data.frame(matrix(ncol=0,nrow=length(row.names(counts)))) samples &lt;- unique(mdata$individual) for(i in 1:length(samples)){ tmp &lt;- counts[,names(counts)%in%mdata[mdata$individual==samples[i],&quot;sample.id&quot;]] tmp.merged &lt;- rowSums(tmp) counts_merged &lt;- cbind(counts_merged,tmp.merged) } names(counts_merged) &lt;- samples mdata.merged &lt;- mdata[,c(&quot;individual&quot;, &quot;phenotype&quot;, &quot;lineage&quot;)] mdata.merged &lt;- mdata.merged[!duplicated(mdata.merged),] # Estimate library size factors using the median of ratios normalization method from DESeq2 dds &lt;- DESeqDataSetFromMatrix(countData = counts_merged, colData = mdata.merged, design = ~ lineage + phenotype) dds &lt;- estimateSizeFactors(dds) sizeFactors &lt;- sizeFactors(dds) sF2meta &lt;- data.frame(individual=names(sizeFactors)) sF2meta$sF &lt;- as.numeric(sizeFactors) sF2meta &lt;- left_join(sF2meta,mdata[,c(&quot;individual&quot;, &quot;sample.id&quot;)], multiple = &quot;all&quot;) sF2meta &lt;- sF2meta[,c(3,2)] sF2meta &lt;- sF2meta[match(names(counts), sF2meta$sample.id),] sFs &lt;- sF2meta$sF names(sFs) &lt;- sF2meta$sample.id # Output return(sFs) } # Execute size_factors &lt;- calcSizeFactors(SNP_counts,metadata) Second, perform the MRN method on the parent-specific counts using the library size factors calculated above. Function: normalizeASReadCounts normalizeASReadCounts &lt;- function(counts,size_factors){ ## counts = counts matrix generated by make_ASE_counts_matrix ## size_factors = array from calcSizeFactors # Normalize counts for each library by parent mdata &lt;- metadata[metadata$sample.id%in%names(counts),] counts &lt;- counts[,names(counts)%in%mdata$sample.id] as.dds &lt;- DESeqDataSetFromMatrix(countData = counts, colData = mdata, design = ~ lineage+phenotype) sizeFactors(as.dds) = size_factors counts_normalized &lt;- data.frame(counts(as.dds, normalized=TRUE)) # Output return(counts_normalized) } # Execute SNP_counts_normalized &lt;- normalizeASReadCounts(SNP_counts,size_factors) # Save output write.csv(SNP_counts_normalized,&quot;AST-tutorial/ANALYSIS/INTERMEDIATES/SNP_gene_counts_normalized.csv&quot;) 3.4.2 Split by phenotype and filter low count SNPs Further filtering steps are performed after splitting the count matrix by phenotype. This is important as the set of transcribed genes is expected to vary between phenotypes. For the remainder of the tutorial, these phenotypes will be referred to in the code as “WL” and “QL” to match the sample metadata. WL.IDs &lt;- metadata[metadata$phenotype==&quot;WL&quot;,&quot;sample.id&quot;] WL_counts_normalized &lt;- SNP_counts_normalized[,names(SNP_counts_normalized)%in%WL.IDs] WL_counts &lt;- SNP_counts[row.names(SNP_counts)%in%row.names(WL_counts_normalized), names(SNP_counts)%in%WL.IDs] QL.IDs &lt;- metadata[metadata$phenotype==&quot;QL&quot;,&quot;sample.id&quot;] QL_counts_normalized &lt;- SNP_counts_normalized[,names(SNP_counts_normalized)%in%QL.IDs] QL_counts &lt;- SNP_counts[row.names(SNP_counts)%in%row.names(QL_counts_normalized), names(SNP_counts)%in%QL.IDs] SNPs with \\(n &lt; 1\\) count in any Lineage are discarded. Finally, genes with \\(n &lt; 2\\) SNPs are discarded, as this method requires at least two independent observations per gene for statistical analysis. Note: here is where the utility of functions really becomes apparent. For the remainder of the tutorial, the same functions will be applied to the “WL” and “QL” count matrices. Function: filter_counts filter_counts &lt;- function(counts,lcf){ # counts = phenotype-specific counts matrix # lcf = low count filter threshold (integer) # Remove rows with &lt; lcf counts counts by Lineage LA &lt;- metadata[metadata$lineage==&quot;A&quot;,&quot;sample.id&quot;] LB &lt;- metadata[metadata$lineage==&quot;B&quot;,&quot;sample.id&quot;] counts &lt;- counts[rowSums(counts[,names(counts)%in%LA])&gt;lcf,] counts &lt;- counts[rowSums(counts[,names(counts)%in%LB])&gt;lcf,] # Flag rows with greater than 10000 counts ## Different functions are used for SK tests with &lt; and &gt; 10000 counts for computational efficiency counts$SUM &lt;- rowSums(counts) counts$SKrow &lt;- F counts[counts$SUM&lt;10000,&quot;SKrow&quot;] &lt;- T counts$SUM &lt;- NULL # Remove genes with &lt; 2 SNPs counts$gene &lt;- as.character(map(strsplit(row.names(counts), split = &quot;:&quot;), 2)) genelist &lt;- unique(counts$gene) delete.rows &lt;- list() for(i in 1:length(genelist)){ tmp &lt;- counts[counts$gene==genelist[i],] tmp &lt;- tmp[!duplicated(tmp),] if(length(row.names(tmp))&lt;2){ delete.rows &lt;- append(delete.rows,genelist[i]) } } counts &lt;- counts[!counts$gene%in%unlist(delete.rows),] counts$gene &lt;- NULL # Return filtered counts return(counts) } # Execute WL_counts_normalized &lt;- filter_counts(WL_counts_normalized,0) WL_counts &lt;- WL_counts[row.names(WL_counts)%in%row.names(WL_counts_normalized),] QL_counts_normalized &lt;- filter_counts(QL_counts_normalized,0) QL_counts &lt;- QL_counts[row.names(QL_counts)%in%row.names(QL_counts_normalized),] # Save output write.csv(WL_counts_normalized,&quot;AST-tutorial/ANALYSIS/INTERMEDIATES/G53xY39_WL_counts_normalized.csv&quot;) write.csv(WL_counts,&quot;AST-tutorial/ANALYSIS/INTERMEDIATES/G53xY39_WL_counts.csv&quot;) write.csv(QL_counts_normalized,&quot;AST-tutorial/ANALYSIS/INTERMEDIATES/G53xY39_QL_counts_normalized.csv&quot;) write.csv(QL_counts,&quot;AST-tutorial/ANALYSIS/INTERMEDIATES/G53xY39_QL_counts.csv&quot;) 3.5 Conduct statistical tests The data are now ready for statistical analysis of allele-specific transcription. Storer-Kim, GLIMMIX, and threshold tests will be performed for each phenotype, separately, and then a Chi-squared test will be performed to compare the distribution of genes showing parent- and lineage-biased transcription between phenotypes. 3.5.1 SNP-level Storer-Kim (SK) binomial exact tests For each SNP, a Storer-Kim binomial exact test of two proportions is conducted using the MRN counts to test the hypothesis that the proportion of maternal and paternal read counts are statistically different. Function: twobinom This test is performed using a modified version of the twobinom function from WRS2. For computational efficiency, the outer() command has been replaced with Rfast::Outer(). twobinom&lt;-function(r1,n1,r2,n2,alpha=.05){ # r1 = success in group 1 # n1 = total in group 1 # r2 = success in group 2 # r2 = total in group 2 n1p&lt;-n1+1 n2p&lt;-n2+1 n1m&lt;-n1-1 n2m&lt;-n2-1 q &lt;- r1/n1 p &lt;- r2/n2 if(is.na(q)){q &lt;- 0} if(is.na(p)){p &lt;- 0} chk&lt;-abs(q-p) x&lt;-c(0:n1)/n1 y&lt;-c(0:n2)/n2 phat&lt;-(r1+r2)/(n1+n2) m1&lt;-t(Outer(x,y,&quot;-&quot;)) m2&lt;-matrix(1,n1p,n2p) flag&lt;-(abs(m1)&gt;=chk) m3&lt;-m2*flag rm(m1,m2,flag) xv&lt;-c(1:n1) yv&lt;-c(1:n2) xv1&lt;-n1-xv+1 yv1&lt;-n2-yv+1 dis1&lt;-c(1,pbeta(phat,xv,xv1)) dis2&lt;-c(1,pbeta(phat,yv,yv1)) pd1&lt;-NA pd2&lt;-NA for(i in 1:n1){pd1[i]&lt;-dis1[i]-dis1[i+1]} for(i in 1:n2){pd2[i]&lt;-dis2[i]-dis2[i+1]} pd1[n1p]&lt;-phat^n1 pd2[n2p]&lt;-phat^n2 m4&lt;-t(Outer(pd1,pd2,&quot;*&quot;)) test&lt;-sum(m3*m4) rm(m3,m4) list(p.value=test,p1=q,p2=p,est.dif=q-p) } Function: AST.SK This is a wrapper function to perform the Storer-Kim test comparing the proportion of maternal to paternal read counts on each SNP in the count matrix. The Storer-Kim test has greater statistical power for low count observations, but is computationally expensive to perform and crashes when the total counts for a SNP are n &gt; 10,000. Therefore, filter_counts flagged these rows, and AST.SK will instead perform a Fisher’s Exact test of two proportions, which is equivalent to a Storer-Kim test for high count observations. The test p-value is reported for each SNP. Note: because this test is computationally expensive, executing AST.SK will take approximately 2hrs per 10,000 SNPs. AST.SK &lt;- function(counts,phenotype,cores){ # counts = phenotype-specific counts matrix # phenotype = corresponding phenotype of counts # cores = # of threads for multithreaded search # Split data by pat and mat pat.exp &lt;- counts[,metadata[metadata$parent%in%c(&quot;M&quot;)&amp; metadata$phenotype==phenotype,&quot;sample.id&quot;]] mat.exp &lt;- counts[,metadata[metadata$parent%in%c(&quot;F&quot;)&amp; metadata$phenotype==phenotype,&quot;sample.id&quot;]] # Set up for DoParallel SKrows=counts$SKrow registerDoParallel(cores=cores) i.len=length(row.names(pat.exp)) # For each row, conduct an SK test and return the p-value return.df &lt;- foreach(i=1:i.len, .combine=rbind, .export=ls(globalenv()),.packages=&quot;Rfast&quot;) %dopar% { SNP_gene=row.names(pat.exp[i,]) p1.s=sum(pat.exp[i,]) p2.s=sum(mat.exp[i,]) p.o=sum(p1.s,p2.s) if(SKrows[i]==T){ test=twobinom(r1=p1.s,n1=p.o, r2=p2.s,n2=p.o)$p.value }else{ test=fisher.test(matrix(c(p1.s,p2.s, p2.s,p1.s),ncol = 2))$p.value } return.append=data.frame(SNP_gene=SNP_gene,p=test) return(return.append) } return.df=return.df[match(row.names(pat.exp), return.df$SNP_gene),] # Return return(return.df) } # Execute WL.SK &lt;- AST.SK(WL_counts_normalized,&quot;WL&quot;,6) QL.SK &lt;- AST.SK(QL_counts_normalized,&quot;QL&quot;,6) # Save output write.csv(WL.SK,&quot;AST-tutorial/ANALYSIS/RESULTS/WLSK.csv&quot;, row.names=F) write.csv(QL.SK,&quot;AST-tutorial/ANALYSIS/RESULTS/QLSK.csv&quot;, row.names=F) 3.5.2 Gene-level general linear mixed models (GLIMMIX) A general linear mixed-effects model with interaction terms (GLIMMIX) is fit for each gene to assess the effects of Parent, Lineage, and their interaction on the raw read counts at each SNP, using the log of the library size factors as an offset to adjust for variation in sequencing depth between libraries. Function: AST.GLIMMIX This is a wrapper function to fit a GLIMMIX for each gene in the count matrix and perform Wald tests to assess the statistical significance of the effects (parent, lineage, and their interaction). Wald test p-values are reported for each effect for each gene. Genes for which a model could not be fit due to errors are returned with a p-value of “1” for each effect. AST.GLIMMIX &lt;- function(counts,size_factors,cores){ # counts = phenotype-specific counts matrix # phenotype = corresponding phenotype of counts # cores = # of threads for multithreaded search # size_factors = object from calcSizeFactors sizeFactors &lt;- data.frame(t(size_factors[names(size_factors)%in%names(counts)])) sizeFactors &lt;- gather(sizeFactors, sample.id, sizeFactor, names(sizeFactors),factor_key=FALSE) counts$SNP_gene &lt;- row.names(counts) counts$geneID &lt;- as.character(unlist(map(strsplit(counts$SNP_gene, split = &quot;:&quot;), 2))) genelist &lt;- unique(counts$geneID) registerDoParallel(cores=cores) i.len &lt;- length(genelist) df.out &lt;- foreach(i=1:i.len,.combine=rbind) %dopar% { counts.sub &lt;- counts[counts$geneID==genelist[i],] counts.sub$geneID &lt;- NULL counts.sub &lt;- gather(counts.sub, sample.id, count, names(counts.sub), -SNP_gene, factor_key=TRUE) counts.sub &lt;- join(counts.sub, metadata, by = &quot;sample.id&quot;) counts.sub &lt;- join(counts.sub,sizeFactors,by=&quot;sample.id&quot;) counts.sub$parent &lt;- as.factor(str_sub(counts.sub$parent,-1,-1)) counts.sub$SNP_gene &lt;- as.factor(counts.sub$SNP_gene) counts.sub$lineage &lt;- as.factor(counts.sub$lineage) counts.sub$individual &lt;- as.factor(counts.sub$individual) testfail &lt;- F test &lt;- &quot;null&quot; tryCatchLog(test &lt;- lmer(count~parent+lineage+parent*lineage+(1|SNP_gene)+ (1|individual)+offset(log(sizeFactor)),data=counts.sub), error = function(e) {testfail &lt;- T}) if(class(test)==&quot;character&quot;){testfail &lt;- T} if(testfail==F){ test &lt;- summary(test) parent.p.list &lt;- test[[&quot;coefficients&quot;]][2,5] Lineage.p.list &lt;- test[[&quot;coefficients&quot;]][3,5] parent.Lineage.p.list &lt;- test[[&quot;coefficients&quot;]][4,5] }else{ parent.p.list &lt;- 1 Lineage.p.list &lt;- 1 parent.Lineage.p.list &lt;- 1 } return(data.frame(ID=genelist[i], parent.p=parent.p.list, Lineage.p=Lineage.p.list, parentXLineage.p=parent.Lineage.p.list)) } return(df.out) } # Execute WL.GLIMMIX &lt;- AST.GLIMMIX(WL_counts,size_factors,6) QL.GLIMMIX &lt;- AST.GLIMMIX(QL_counts,size_factors,6) # Save output write.csv(WL.GLIMMIX,&quot;AST-tutorial/ANALYSIS/RESULTS/WLGLIMMIX.csv&quot;, row.names=F) write.csv(QL.GLIMMIX,&quot;AST-tutorial/ANALYSIS/RESULTS/QLGLIMMIX.csv&quot;, row.names=F) 3.6 Allele-specific transcription analysis 3.6.1 FDR correction and threshold tests The False Discovery Rate (FDR) is now computed for the Storer-Kim and GLIMMIX (Wald) test results, requiring FDR &lt; 0.05 for significance. For a gene to be considered as showing parent-of-origin or lineage-of-origin effects, all SNPs are required to exhibit the same directional bias (i.e., maternal or paternal) in the Storer-Kim test, and a strictly parent-of-origin or lineage-of-origin effect in the GLIMMIX. To avoid identifying genes with parent-of-origin effects influenced by lineage-of-origin effects, or lineage-of-origin genes influenced by parent-of-origin effects, genes with a significant interaction effect are considered unbiased. Additionally, the proportions of Lineage A reads in samples with a Lineage B mother and Lineage A father (proportion 1, or p1), and the proportion of Lineage A reads in samples with a Lineage A mother and Lineage B father (p2), are subjected to a threshold test following methods described in Wang &amp; Clark, 2014, Heredity. Specifically, thresholds of p1&lt;0.4 and p2&gt;0.6 are required for maternal bias, p1&gt;0.6 and p2&lt;0.4 for paternal bias, p1&lt;0.4 and p2&lt;0.4 for lineage B bias, and p1&gt;0.6 and p2&gt;0.6 for lineage A bias. Function: AST.Analysis AST.Analysis &lt;- function(counts,phenotype,SK,GLIMMIX){ # counts = phenotype-specific count matrix # phenotype = corresponding phenotype of counts # SK = object from AST.SK # GLIMMIX = object from AST.GLIMMIX # Split count matrices by lineage and parent for plotting counts &lt;- counts[,names(counts)%in%metadata$sample.id] p1.pat &lt;- counts[,metadata[metadata$parent%in%c(&quot;M&quot;)&amp;metadata$lineage==&quot;B&quot;&amp;metadata$phenotype==phenotype,&quot;sample.id&quot;]] p1.mat &lt;- counts[,metadata[metadata$parent%in%c(&quot;F&quot;)&amp;metadata$lineage==&quot;B&quot;&amp;metadata$phenotype==phenotype,&quot;sample.id&quot;]] p2.pat &lt;- counts[,metadata[metadata$parent%in%c(&quot;M&quot;)&amp;metadata$lineage==&quot;A&quot;&amp;metadata$phenotype==phenotype,&quot;sample.id&quot;]] p2.mat &lt;- counts[,metadata[metadata$parent%in%c(&quot;F&quot;)&amp;metadata$lineage==&quot;A&quot;&amp;metadata$phenotype==phenotype,&quot;sample.id&quot;]] # Set up a data.frame to plot p1 and p2 for each SNP p1.plot &lt;- data.frame(rowSums(p1.pat)/(rowSums(p1.mat)+rowSums(p1.pat))) names(p1.plot) &lt;- c(&quot;p1&quot;) p1.plot[is.nan(p1.plot$p1),&quot;p1&quot;] &lt;- 0 p2.plot &lt;- data.frame(rowSums(p2.mat)/(rowSums(p2.mat)+rowSums(p2.pat))) names(p2.plot) &lt;- c(&quot;p2&quot;) p2.plot[is.nan(p2.plot$p2),&quot;p2&quot;] &lt;- 0 plot &lt;- cbind(p1.plot,p2.plot) # Join results of Storer-Kim tests plot &lt;- plot[row.names(plot)%in%SK$SNP_gene,] plot$SK.p &lt;- SK$p plot$SNP_gene &lt;- row.names(plot) plot$gene &lt;- as.character(map(strsplit(plot$SNP_gene, split = &quot;:&quot;), 2)) # Reformat output from GLIMMMIX models GLIMMIX.biased &lt;- data.frame(gene=GLIMMIX$ID, parent.p=GLIMMIX$parent.p, Lineage.p=GLIMMIX$Lineage.p, parentXLineage.p=GLIMMIX$parentXLineage.p) # Correct for multiple testing plot$SK.padj &lt;- p.adjust(plot$SK.p,&quot;BH&quot;) plot$bias &lt;- &quot;NA&quot; GLIMMIX$parent.padj &lt;- p.adjust(GLIMMIX$parent.p,&quot;BH&quot;) GLIMMIX$Lineage.padj &lt;- p.adjust(GLIMMIX$Lineage.p,&quot;BH&quot;) GLIMMIX$parentXLineage.padj &lt;- p.adjust(GLIMMIX$parentXLineage.p,&quot;BH&quot;) GLIMMIX.biased &lt;- GLIMMIX[GLIMMIX$parent.padj&lt;0.05|GLIMMIX$Lineage.padj&lt;0.05,1] GLIMMIX.biased &lt;- setdiff(GLIMMIX.biased,GLIMMIX[GLIMMIX$parentXLineage.padj&lt;0.05,1]) # For each gene, check whether all SNPs are biased in the same direction at established thresholds ## Genes with parentXLineage effects are flagged as unbiased for(i in 1:length(row.names(plot))){ p &lt;- plot[i,&quot;SK.padj&quot;] p1 &lt;- plot[i,&quot;p1&quot;] p2 &lt;- plot[i,&quot;p2&quot;] if(p&lt;0.05&amp;p1&gt;0.6&amp;p2&lt;0.4){plot[i,&quot;bias&quot;] &lt;- &quot;pat&quot;} if(p&lt;0.05&amp;p1&lt;0.4&amp;p2&gt;0.6){plot[i,&quot;bias&quot;] &lt;- &quot;mat&quot;} if(p&lt;0.05&amp;p1&lt;0.4&amp;p2&lt;0.4){plot[i,&quot;bias&quot;] &lt;- &quot;Lineage B&quot;} if(p&lt;0.05&amp;p1&gt;0.6&amp;p2&gt;0.6){plot[i,&quot;bias&quot;] &lt;- &quot;Lineage A&quot;} } biaslist &lt;- data.frame(matrix(ncol=2,nrow=0)) names(biaslist) &lt;- c(&quot;gene&quot;,&quot;bias&quot;) genelist &lt;- unique(plot$gene) for(i in 1:length(genelist)){ tmp &lt;- unique(plot[plot$gene==genelist[i],&quot;bias&quot;]) if(length(tmp)&gt;1){ if(length(tmp)==2){ if(any(tmp%in%&quot;NA&quot;)){ bias &lt;- tmp[!tmp%in%&quot;NA&quot;] }else{bias &lt;- &quot;NA&quot;} }else{ bias &lt;- &quot;NA&quot; } }else{bias &lt;- tmp} biaslist &lt;- rbind(biaslist,data.frame(gene=genelist[[i]], bias=bias)) } plot &lt;- plot %&gt;% left_join(biaslist, by = c(&#39;gene&#39; = &#39;gene&#39;)) names(plot)[c(7:8)] &lt;- c(&quot;xbias&quot;,&quot;bias&quot;) plot$bias.plot &lt;- &quot;NA&quot; for(i in 1:length(row.names(plot))){ p1 &lt;- plot$p1[i] p2 &lt;- plot$p2[i] bias &lt;- plot$bias[i] if(!bias==&quot;NA&quot;){ if(bias==&quot;pat&quot;){if(p1&gt;0.6&amp;p2&lt;0.4){plot[i,&quot;bias.plot&quot;]&lt;- &quot;pat&quot;}} if(bias==&quot;mat&quot;){if(p1&lt;0.4&amp;p2&gt;0.6){plot[i,&quot;bias.plot&quot;] &lt;- &quot;mat&quot;}} if(bias==&quot;Lineage B&quot;){if(p1&lt;0.4&amp;p2&lt;0.4){plot[i,&quot;bias.plot&quot;] &lt;- &quot;Lineage B&quot;}} if(bias==&quot;Lineage A&quot;){if(p1&gt;0.6&amp;p2&gt;0.6){plot[i,&quot;bias.plot&quot;] &lt;- &quot;Lineage A&quot;}} } } plot[!plot$gene%in%GLIMMIX.biased,&quot;bias.plot&quot;] &lt;- &quot;NA&quot; plot &lt;- rbind(plot[plot$bias.plot%in%c(&quot;NA&quot;),], plot[plot$bias.plot%in%c(&quot;mat&quot;, &quot;Lineage A&quot;, &quot;Lineage B&quot;, &quot;pat&quot;),]) plot$bias.plot &lt;- factor(plot$bias.plot, levels = c(&quot;NA&quot;,&quot;mat&quot;, &quot;Lineage A&quot;, &quot;Lineage B&quot;, &quot;pat&quot;)) # Return return(plot) } # Execute WL.plot &lt;- AST.Analysis(WL_counts_normalized, &quot;WL&quot;,metadata,WL.SK,WL.GLIMMIX) QL.plot &lt;- AST.Analysis(QL_counts_normalized, &quot;QL&quot;,metadata,QL.SK,QL.GLIMMIX) # Save output write.csv(WL.plot,&quot;AST-tutorial/ANALYSIS/RESULTS/WLAST.csv&quot;,row.names=F) write.csv(QL.plot,&quot;AST-tutorial/ANALYSIS/RESULTS/QLAST.csv&quot;,row.names=F) The last three columns of a dataframe produced by AST.Analysis are informative of SNP and gene-level allelic bias: xbias = SNP-level bias given 1) threshold test applied to p1 and p2, and 2) FDR-corrected SK-test p-value (SK.padj) bias = Gene-level bias given consistency of xbias value for each SNP bias.plot = Gene-level bias given bias and FDR-corrected GLIMMIX (Wald test) p-values. This is the reported value of allelic bias for each gene, and will be used to quantify the number of genes showing each category of allelic bias for performing Chi-squared tests. 3.6.2 Chi-squared tests Finally, allele-specific transcription is compared between phenotypes. Specifically, Chi-squared tests are performed for each category of allelic bias to test the hypothesis that the distributions of biased and unbiased genes in each phenotype are statistically different. For these tests, we will use the honey bee gene set (saved as ${DIR_INDEX}/Amel_HAv3.1_geneIDs.txt in Section 2) minus the genes in filterlist, as the “gene universe” - i.e., all possible genes that could have been expressed in these samples and tested for allele-specific transcription. allgenes &lt;- read.table(&quot;AST-tutorial/INDEX/Amel_HAv3.1_geneIDs.txt&quot;,header=F)[,1] allgenes &lt;- setdiff(allgenes,filterlist) Function: AST.chisq This is a wrapper function to perform these tests and format a table for plotting. AST.chisq &lt;- function(pheno1.label,pheno1.plot, pheno2.label,pheno2.plot, allgenes){ # pheno1.label = label for phenotype 1 # pheno1.plot = object from AST.Analysis # pheno2.label = label for phenotype 2 # pheno2.plot = object from AST.Analysis # allgenes = list containing background set of genes # Collapse pheno plots by gene pheno1.plot &lt;- pheno1.plot[,c(&quot;gene&quot;,&quot;bias.plot&quot;)] pheno1.plot &lt;- pheno1.plot[!duplicated(pheno1.plot),] pheno2.plot &lt;- pheno2.plot[,c(&quot;gene&quot;,&quot;bias.plot&quot;)] pheno2.plot &lt;- pheno2.plot[!duplicated(pheno2.plot),] # Quantify genes in each category of bias and combine into a table gmid.df &lt;- data.frame( Unresponsive=c(length(unique(pheno1.plot[pheno1.plot$bias.plot==&quot;mat&quot;,&quot;gene&quot;])), length(unique(pheno1.plot[pheno1.plot$bias.plot==&quot;Lineage A&quot;,&quot;gene&quot;])), length(unique(pheno1.plot[pheno1.plot$bias.plot==&quot;Lineage B&quot;,&quot;gene&quot;])), length(unique(pheno1.plot[pheno1.plot$bias.plot==&quot;pat&quot;,&quot;gene&quot;]))), Bias=c(&quot;mat&quot;,&quot;Lineage A&quot;,&quot;Lineage B&quot;,&quot;pat&quot;), Responsive=c(length(unique(pheno2.plot[pheno2.plot$bias.plot==&quot;mat&quot;,&quot;gene&quot;])), length(unique(pheno2.plot[pheno2.plot$bias.plot==&quot;Lineage A&quot;,&quot;gene&quot;])), length(unique(pheno2.plot[pheno2.plot$bias.plot==&quot;Lineage B&quot;,&quot;gene&quot;])), length(unique(pheno2.plot[pheno2.plot$bias.plot==&quot;pat&quot;,&quot;gene&quot;])))) # Perform Chi-squared tests mat.test &lt;- chisq.test(data.frame(Success=c(gmid.df[1,1],gmid.df[1,3]), Failure=c(length(allgenes)-gmid.df[1,1], length(allgenes)-gmid.df[1,3]), row.names=c(pheno1.label,pheno2.label)), correct=F)$p.value LineageA.test &lt;- chisq.test(data.frame(Success=c(gmid.df[2,1],gmid.df[2,3]), Failure=c(length(allgenes)-gmid.df[2,1], length(allgenes)-gmid.df[2,3]), row.names=c(pheno1.label,pheno2.label)), correct=F,simulate.p.value = TRUE)$p.value LineageB.test &lt;- chisq.test(data.frame(Success=c(gmid.df[3,1],gmid.df[3,3]), Failure=c(length(allgenes)-gmid.df[3,1], length(allgenes)-gmid.df[3,3]), row.names=c(pheno1.label,pheno2.label)), correct=F,simulate.p.value = TRUE)$p.value pat.test &lt;- chisq.test(data.frame(Success=c(gmid.df[4,1],gmid.df[4,3]), Failure=c(length(allgenes)-gmid.df[4,1], length(allgenes)-gmid.df[4,3]), row.names=c(pheno1.label,pheno2.label)), correct=F)$p.value # Build table for plotting names(gmid.df) &lt;- c(pheno1.label,&quot;Bias&quot;,pheno2.label) gmid.df$p &lt;- c(mat.test,LineageA.test,LineageB.test,pat.test) gmid.df &lt;- gmid.df[,c(4,1,2,3)] nsrows &lt;- row.names(gmid.df[gmid.df$p&gt;0.05,]) gmid.df$p &lt;- formatC(gmid.df$p, format = &quot;e&quot;, digits = 2) gmid.df[nsrows,&quot;p&quot;] &lt;- &quot;(ns)&quot; gmid.df &lt;- gmid.df[,c(2,3,4,1)] # Output return(gmid.df) } # Execute chisq.df &lt;- AST.chisq(&quot;WL&quot;,WL.plot,&quot;QL&quot;,QL.plot,allgenes) We can see from this table that the QL group has more genes with paternal allele-biased transcription than the WL group, but the other categories of allelic bias are comparable between groups: WL Bias QL p 20 mat 11 (ns) 1 Lineage A 0 (ns) 3 Lineage B 2 (ns) 51 pat 92 5.84e-04 3.7 Data visualization In this final section, the SNP-level counts are aggregated by gene to calculate and plot allelic transcription on p1 and p2. Data are plotted for each phenotype, separately, and then combined into a single plot separated by the Chi-squared test result table generated in the previous section. First, reorder the data generated by AST.Analysis so that unbiased genes appear first: WL.plot &lt;- rbind(WL.plot[WL.plot$bias.plot%in%c(&quot;NA&quot;),], WL.plot[WL.plot$bias.plot%in%c( &quot;mat&quot;, &quot;Lineage A&quot;, &quot;Lineage B&quot;, &quot;pat&quot;),]) WL.plot$bias.plot &lt;- factor(WL.plot$bias.plot, levels = c(&quot;NA&quot;,&quot;mat&quot;, &quot;Lineage A&quot;, &quot;Lineage B&quot;, &quot;pat&quot;)) QL.plot &lt;- rbind(QL.plot[QL.plot$bias.plot%in%c(&quot;NA&quot;),], QL.plot[QL.plot$bias.plot%in%c( &quot;mat&quot;, &quot;Lineage A&quot;, &quot;Lineage B&quot;, &quot;pat&quot;),]) QL.plot$bias.plot &lt;- factor(QL.plot$bias.plot, levels = c(&quot;NA&quot;,&quot;mat&quot;, &quot;Lineage A&quot;, &quot;Lineage B&quot;, &quot;pat&quot;)) Function: AST.collapse Next, use this function to collapse the data generated by AST.Analysis and calculate p1 and p2 by gene: AST.collapse &lt;- function(data.plot,data.counts,phenotype){ # data.plot = object from AST.Analysis # data.counts = phenotype-specific counts matrix # phenotype = corresponding phenotype of data.plot and data.counts # Wrangle data data.counts &lt;- data.counts[,names(data.counts)%in%metadata$sample.id] data.counts$SNP_gene &lt;- row.names(data.counts) data &lt;- data.plot %&gt;% left_join(data.counts, by = c(&#39;SNP_gene&#39; = &#39;SNP_gene&#39;)) genelist &lt;- unique(data$gene) p1.mean &lt;- list() p2.mean &lt;- list() biaslist &lt;- list() altbias &lt;- list() # Collapse SNPs by gene for(i in 1:length(genelist)){ tmp &lt;- data[data$gene==genelist[i],] altbias[i] &lt;- as.character(tmp$bias)[1] if(!any(tmp$bias==&quot;NA&quot;) &amp; length(tmp[!tmp$bias.plot==&quot;NA&quot;,&quot;p1&quot;])&gt;0){ tmp.sub &lt;- tmp[!tmp$bias.plot==&quot;NA&quot;,] # Split count matrices by Lineage and parent of origin for plotting p1.pat &lt;- tmp.sub[,metadata[metadata$parent%in%c(&quot;M&quot;)&amp;metadata$lineage==&quot;B&quot;&amp;metadata$phenotype==phenotype,&quot;sample.id&quot;]] p1.mat &lt;- tmp.sub[,metadata[metadata$parent%in%c(&quot;F&quot;)&amp;metadata$lineage==&quot;B&quot;&amp;metadata$phenotype==phenotype,&quot;sample.id&quot;]] p2.pat &lt;- tmp.sub[,metadata[metadata$parent%in%c(&quot;M&quot;)&amp;metadata$lineage==&quot;A&quot;&amp;metadata$phenotype==phenotype,&quot;sample.id&quot;]] p2.mat &lt;- tmp.sub[,metadata[metadata$parent%in%c(&quot;F&quot;)&amp;metadata$lineage==&quot;A&quot;&amp;metadata$phenotype==phenotype,&quot;sample.id&quot;]] p1.mean.x &lt;- mean(sum(p1.pat)/(sum(p1.mat)+sum(p1.pat))) if(is.nan(p1.mean.x)){p1.mean.x &lt;- 0} if(is.infinite(p1.mean.x)){p1.mean.x &lt;- 1} p1.mean[i] &lt;- p1.mean.x p2.mean.x &lt;- mean(sum(p2.mat)/(sum(p2.mat)+sum(p2.pat))) if(is.nan(p2.mean.x)){p2.mean.x &lt;- 0} if(is.infinite(p2.mean.x)){p2.mean.x &lt;- 1} p2.mean[i] &lt;- p2.mean.x biaslist[i] &lt;- as.character(tmp.sub$bias.plot[1]) }else{ p1.mean[i] &lt;- mean(tmp$p1) p2.mean[i] &lt;- mean(tmp$p2) biaslist[i] &lt;- &quot;NA&quot;}} return.data &lt;- data.frame(gene=unlist(genelist), bias.plot=unlist(biaslist), p1=unlist(p1.mean), p2=unlist(p2.mean), altbias=unlist(altbias)) return.data$bias.plot &lt;- factor(return.data$bias.plot, levels = c(&quot;NA&quot;,&quot;mat&quot;, &quot;Lineage A&quot;, &quot;Lineage B&quot;, &quot;pat&quot;)) return.data &lt;- return.data[order(return.data$bias.plot),] # Output return(return.data) } # Execute WL.collapse &lt;- AST.collapse(WL.plot,WL_counts_normalized,&quot;WL&quot;) QL.collapse &lt;- AST.collapse(QL.plot,QL_counts_normalized,&quot;QL&quot;) # Save output write.csv(WL.collapse,&quot;AST-tutorial/ANALYSIS/RESULTS/WLAST_collapse.csv&quot;,row.names=F) write.csv(QL.collapse,&quot;AST-tutorial/ANALYSIS/RESULTS/QLAST_collapse.csv&quot;,row.names=F) Finally, generate scatter plots of allele-specific transcription joined by the table of Chi-squared test results. Function: AST.scatter This function generates a scatter plot of each gene by p1 and p2 for each phenotype. The points in these plots are assigned a color to represent its expression status: black is maternal, green is lineage A, gold is lineage B, blue is paternal, and grey is not significant. For consistency across studies, the colors for maternal, paternal, and unbiased genes are fixed, but the colors for each lineage can be customized. To help with visualizing the local density of points, each point is assigned its color value on a gradient between darker and lighter tones, indicating higher and lower local densities, respectively. AST.scatter &lt;- function(data,title, LineageA.color.dark=&quot;#058762&quot;,LineageA.color.light=&quot;#4dc4a2&quot;, LineageB.color.dark=&quot;#f7af05&quot;,LineageB.color.light=&quot;#f7ca60&quot;){ # data = object from AST.collapse # title = string, title of plot get_density &lt;- function(x, y, ...){ dens &lt;- MASS::kde2d(x, y, ...) ix &lt;- findInterval(x, dens$x) iy &lt;- findInterval(y, dens$y) ii &lt;- cbind(ix, iy) return(dens$z[ii])} data$bias.plot &lt;- factor(data$bias.plot) biases &lt;- levels(data$bias.plot) data$color &lt;- NA for(i in 1:length(biases)){ data.sub &lt;- data[data$bias.plot==biases[i],] if(length(data.sub$gene)&gt;2){ density &lt;- get_density(data.sub$p1, data.sub$p2, n = 100) if(biases[i]==&quot;NA&quot;){pal &lt;- colorRampPalette(colors = c(&quot;grey90&quot;, &quot;grey70&quot;))(60)} if(biases[i]==&quot;pat&quot;){pal &lt;- colorRampPalette(colors = c(&quot;#a1d2ed&quot;, &quot;#02a0f5&quot;))(60)} if(biases[i]==&quot;mat&quot;){pal &lt;- colorRampPalette(colors = c(&quot;grey50&quot;, &quot;black&quot;))(60)} if(biases[i]==&quot;Lineage A&quot;){pal &lt;- colorRampPalette(colors = c(LineageA.color.light, LineageA.color.dark))(60)} if(biases[i]==&quot;Lineage B&quot;){pal &lt;- colorRampPalette(colors = c(LineageB.color.light, LineageB.color.dark))(60)} data[data$bias.plot==biases[i],&quot;color&quot;] &lt;- smoothPalette(density,pal=pal) } if(length(data.sub$gene)&lt;=2){ if(biases[i]==&quot;NA&quot;){pal &lt;- &quot;grey90&quot;} if(biases[i]==&quot;pat&quot;){pal &lt;- &quot;#a1d2ed&quot;} if(biases[i]==&quot;mat&quot;){pal &lt;- &quot;grey50&quot;} if(biases[i]==&quot;Lineage A&quot;){pal &lt;- LineageA.color.light} if(biases[i]==&quot;Lineage B&quot;){pal &lt;- LineageB.color.light} data[data$bias.plot==biases[i],&quot;color&quot;] &lt;- pal } } data &lt;- data[,c(3,4,6)] data &lt;- data[!duplicated(data),] # Generate plot g &lt;- ggplot(data, aes(x=p1, y=p2, color=color)) + geom_point(size=3) + theme_prism() + xlab(expression(bold(paste(&quot;% A allele in &quot;,B[mother], &quot; x &quot;,A[father],sep=&quot;&quot;)))) + ylab(expression(bold(paste(&quot;% A allele in &quot;,A[mother], &quot; x &quot;,B[father],sep=&quot;&quot;)))) + ggtitle(title) + theme(text = element_text(size=18), plot.title = element_text(hjust = 0.5)) + guides(alpha=F, color=F) + scale_x_continuous(limits = c(0, 1), breaks = seq(0, 1, .2)) + scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, .2)) + scale_colour_identity() # Output return(g) } # Execute for WL WL.scatter &lt;- AST.scatter(WL.collapse,&quot;192hpf WL transcription&quot;, LineageA.color.dark=&quot;#427447&quot;,LineageA.color.light=&quot;#C1C9B5&quot;, LineageB.color.dark=&quot;#d09d6f&quot;,LineageB.color.light=&quot;#f7dcc3&quot;) WL.scatter # Execute for QL QL.scatter &lt;- AST.scatter(QL.collapse,&quot;192hpf QL transcription&quot;, LineageA.color.dark=&quot;#427447&quot;,LineageA.color.light=&quot;#C1C9B5&quot;, LineageB.color.dark=&quot;#d09d6f&quot;,LineageB.color.light=&quot;#f7dcc3&quot;) QL.scatter Function: AST.scatter.table Plot the Chi-squared results table generated by AST.chisq. AST.scatter.table &lt;- function(chisq.df, LineageA.color=&quot;#009e73&quot;, LineageB.color=&quot;#e69f00&quot;){ # chisq.table = table generated by AST.chisq # Build table plot gmid.df &lt;- chisq.df cols &lt;- matrix(&quot;black&quot;, nrow(gmid.df), ncol(gmid.df)) cols[1,2] &lt;- &quot;#000000&quot; cols[2,2] &lt;- LineageA.color cols[3,2] &lt;- LineageB.color cols[4,2] &lt;- &quot;#56b4e9&quot; ccols &lt;- matrix(&quot;white&quot;, nrow(gmid.df), ncol(gmid.df)) ccols[1:4,3] &lt;- &quot;#f4efea&quot; ccols[1:4,1] &lt;- &quot;#f4efea&quot; ccols[1:4,2] &lt;- &quot;#e4d8d1&quot; cfonts &lt;- matrix(&quot;bold&quot;, nrow(gmid.df), ncol(gmid.df)) gmid.df[2,2] &lt;- &quot;A&quot; gmid.df[3,2] &lt;- &quot;B&quot; tt &lt;- ttheme_default(core=list(fg_params = list(col = cols, cex = 1, fontface = cfonts), bg_params = list(col=NA, fill = ccols), padding.h=unit(2, &quot;mm&quot;)), rowhead=list(bg_params = list(col=NA)), colhead=list(bg_params = list(fill =c(&quot;#f4efea&quot;, &quot;#e4d8d1&quot;, &quot;#f4efea&quot;, &quot;white&quot;)), fg_params = list(rot=90, cex = 1,col=c(&quot;black&quot;, &quot;black&quot;, &quot;black&quot;, &quot;white&quot;)))) # Generate plot gmid &lt;- tableGrob(gmid.df, rows = NULL, theme=tt) # Output return(gmid) } # Execute chisq.table &lt;- AST.scatter.table(chisq.df, LineageA.color=&quot;#427447&quot;, LineageB.color=&quot;#d09d6f&quot;) plot(chisq.table) Combine the plots into a single figure to replicate Figure 4 from Bresnahan et al., 2024, “Intragenomic conflict underlies extreme phenotypic plasticity in queen-worker caste determination in honey bees (Apis mellifera)”, bioRxiv. fig4 &lt;- arrangeGrob(WL.scatter, chisq.table, QL.scatter, widths=c(5,2.5,5)) fig4 Figure 4. Queen-destined larvae show enriched paternal allele-biased transcription relative to worker-destined larvae. Allele-specific transcriptomes were assessed in F2 worker-destined larvae (WL) and queen-destined larvae (QL) collected from a reciprocal cross between different stocks of European honey bees. The x-axis represents, for each transcript, the proportion of lineage A reads in larvae with a lineage B mother and lineage A father (p1). The y-axis represents, for each transcript, the proportion of lineage A reads in larvae with a lineage A mother and lineage B father (p2). Each color represents a transcript which is significantly biased at all tested SNP positions: black is maternal (mat), green is lineage A, gold is lineage B, blue is paternal (pat), and grey is not significant. Center table: the number of transcripts showing each category of allelic bias and p-values for Chi-squared tests of independence for comparisons between the castes are indicated (NS = not significant). Significance of allele-biased transcription was determined using the overlap between two statistical tests: a general linear mixed model (GLIMMIX), and a Storer-Kim binomial exact test along with thresholds of p1&lt;0.4 and p2&gt;0.6 for maternal bias, p1&gt;0.6 and p2&lt;0.4 for paternal bias, p1&lt;0.4 and p2&lt;0.4 for lineage B bias and p1&gt;0.6 and p2&gt;0.6 for lineage A bias. 3.8 Session info sessionInfo() ## R version 4.3.2 (2023-10-31) ## Platform: aarch64-apple-darwin20 (64-bit) ## Running under: macOS Ventura 13.1 ## ## Matrix products: default ## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib ## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0 ## ## locale: ## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 ## ## time zone: America/New_York ## tzcode source: internal ## ## attached base packages: ## [1] stats4 grid parallel stats graphics grDevices utils ## [8] datasets methods base ## ## other attached packages: ## [1] GenomicFeatures_1.54.4 AnnotationDbi_1.64.1 ## [3] genomation_1.34.0 DESeq2_1.42.1 ## [5] SummarizedExperiment_1.32.0 Biobase_2.62.0 ## [7] MatrixGenerics_1.14.0 matrixStats_1.1.0 ## [9] GenomicRanges_1.54.1 GenomeInfoDb_1.38.1 ## [11] IRanges_2.36.0 S4Vectors_0.40.2 ## [13] BiocGenerics_0.48.1 ggprism_1.0.4 ## [15] tagcloud_0.6 ggpubr_0.6.0 ## [17] doParallel_1.0.17 iterators_1.0.14 ## [19] foreach_1.5.2 gridExtra_2.3 ## [21] car_3.1-2 carData_3.0-5 ## [23] lmerTest_3.1-3 lme4_1.1-35.1 ## [25] Matrix_1.6-3 tryCatchLog_1.3.1 ## [27] Rfast_2.1.0 RcppParallel_5.1.7 ## [29] RcppZiggurat_0.1.6 Rcpp_1.0.11 ## [31] plyr_1.8.9 lubridate_1.9.3 ## [33] forcats_1.0.0 stringr_1.5.1 ## [35] dplyr_1.1.4 purrr_1.0.2 ## [37] readr_2.1.4 tidyr_1.3.0 ## [39] tibble_3.2.1 ggplot2_3.4.4 ## [41] tidyverse_2.0.0 kableExtra_1.3.4 ## ## loaded via a namespace (and not attached): ## [1] splines_4.3.2 BiocIO_1.12.0 ## [3] bitops_1.0-7 filelock_1.0.2 ## [5] XML_3.99-0.15 lifecycle_1.0.4 ## [7] rstatix_0.7.2 lattice_0.22-5 ## [9] MASS_7.3-60 backports_1.4.1 ## [11] magrittr_2.0.3 sass_0.4.7 ## [13] rmarkdown_2.25 jquerylib_0.1.4 ## [15] yaml_2.3.7 plotrix_3.8-4 ## [17] DBI_1.1.3 minqa_1.2.6 ## [19] RColorBrewer_1.1-3 abind_1.4-5 ## [21] zlibbioc_1.48.0 rvest_1.0.3 ## [23] RCurl_1.98-1.13 rappdirs_0.3.3 ## [25] GenomeInfoDbData_1.2.11 svglite_2.1.2 ## [27] codetools_0.2-19 DelayedArray_0.28.0 ## [29] xml2_1.3.5 tidyselect_1.2.0 ## [31] futile.logger_1.4.3 farver_2.1.1 ## [33] BiocFileCache_2.10.1 webshot_0.5.5 ## [35] GenomicAlignments_1.38.0 jsonlite_1.8.7 ## [37] systemfonts_1.0.5 tools_4.3.2 ## [39] progress_1.2.2 glue_1.6.2 ## [41] SparseArray_1.2.2 xfun_0.41 ## [43] withr_2.5.2 numDeriv_2016.8-1.1 ## [45] formatR_1.14 fastmap_1.1.1 ## [47] boot_1.3-28.1 fansi_1.0.5 ## [49] digest_0.6.33 timechange_0.2.0 ## [51] R6_2.5.1 seqPattern_1.34.0 ## [53] colorspace_2.1-0 biomaRt_2.58.2 ## [55] RSQLite_2.3.3 utf8_1.2.4 ## [57] generics_0.1.3 data.table_1.14.8 ## [59] rtracklayer_1.62.0 prettyunits_1.2.0 ## [61] httr_1.4.7 S4Arrays_1.2.0 ## [63] pkgconfig_2.0.3 gtable_0.3.4 ## [65] blob_1.2.4 impute_1.76.0 ## [67] XVector_0.42.0 htmltools_0.5.7 ## [69] bookdown_0.38 scales_1.2.1 ## [71] png_0.1-8 knitr_1.45 ## [73] lambda.r_1.2.4 rstudioapi_0.15.0 ## [75] tzdb_0.4.0 reshape2_1.4.4 ## [77] rjson_0.2.21 nlme_3.1-163 ## [79] curl_5.1.0 nloptr_2.0.3 ## [81] cachem_1.0.8 KernSmooth_2.23-22 ## [83] restfulr_0.0.15 pillar_1.9.0 ## [85] vctrs_0.6.4 dbplyr_2.4.0 ## [87] evaluate_0.23 cli_3.6.1 ## [89] locfit_1.5-9.9 compiler_4.3.2 ## [91] futile.options_1.0.1 Rsamtools_2.18.0 ## [93] rlang_1.1.2 crayon_1.5.2 ## [95] ggsignif_0.6.4 labeling_0.4.3 ## [97] stringi_1.8.2 viridisLite_0.4.2 ## [99] gridBase_0.4-7 BiocParallel_1.36.0 ## [101] munsell_0.5.0 Biostrings_2.70.1 ## [103] BSgenome_1.70.2 hms_1.1.3 ## [105] bit64_4.0.5 KEGGREST_1.42.0 ## [107] highr_0.10 broom_1.0.5 ## [109] memoise_2.0.1 bslib_0.6.0 ## [111] bit_4.0.5 "],["404.html", "Page not found", " Page not found The page you requested cannot be found (perhaps it was moved or renamed). You may want to try searching to find the page's new location, or use the table of contents to find the page you are looking for. "]]