1 Introduction

The purpose of this project is to develop a reproducible workflow that is specific to my research entitled “Exploring sagebrush microbial metagenomes from deep, host-derived sequencing”.

The objectives of this study are to profile the microbial taxa associated with big sagebrush leaves, reconstruct MAGs where possible, and explore the potential functions of the resulting MAGs. We ask the following questions:

Which microbial taxa can we profile from deeply sequenced sagebrush shotgun reads?
How does host genotype affect microbial composition and diversity?
Are there differences in the community composition between greenhouse-grown vs. wild plants?

1.1 Why this study?

The traditional method for characterizing microbes is the isolation of individual strains from laboratory cultures (Gulati and Plosky, 2020; Andersen and Schluter, 2021). However, many microbes are difficult to culture because their natural habitat is too complex to be reproduced in the laboratory or because they rely on other species to successfully grow. This has limited the scope of microbial studies and has left many microbes uncharacterized (Bharti and Grimm, 2019; Berg et al., 2020). However, metagenomics sequencing data provides a culture-independent approach to studying complex communities of microbes (Browne et al., 2016; Nayfach et al., 2019). This culture-independent approach includes taxonomic and functional profiling of microbes, as well as reconstruction of metagenome-assembled genomes (MAGs). A MAG refers to a collection of similar scaffolds that are grouped together from a metagenome assembly based on tetranucleotide frequencies (TNFs), abundances, complimentary marker genes (Lin and Liao, 2016), taxonomic alignments (Wang et al., 2019) and codon usage (Yu et al., 2018) which represents a microbial genome.

1.2 Why Sagebrush?

Big sagebrush (Artemisia tridentata) is an important shrub species that dominates much of the western United States’ intermountain basins, shrub steppes, and deserts (USDA). The wide ecological attributes of sagebrush make it a foundational species affecting community composition, ecosystem processes, and wildlife habitat over vast portions of western North America (Remington et al., 2021). Its evergreen leaves offer nutrients and shade that facilitate the establishment of diverse understory plants even in arid environments. Sagebrush foliage is relatively unpalatable to livestock and other large herbivores due to high levels of terpenes and tannins, thereby buffering more palatable understory plants from intensive grazing (Remington et al., 2021). These grazing refuges help maintain native herbaceous diversity and allow sensitive species like perennial bunchgrasses to persist. Greater sage-grouse, pygmy rabbits, pronghorn, and mule deer rely on sagebrush for shelter and their primary winter food source (Shipley et al., 2006). Certain songbirds, such as Brewer’s sparrow and sage thrasher, require intact sagebrush stands for nesting and rearing offspring. Sagebrush flowers are a vital nectar source for native pollinators including bees, wasps, flies, butterflies, and hummingbirds. Unfortunately, sagebrush ecosystems are critically threatened by wildfires, invasive species, climate change, land conversion, and urban development (Miller, 2010) which has resulted in the loss of more than 50% of the historic sagebrush range (Rigge et al., 2020). The major ecological role of sagebrush underscores the need to protect the remaining intact sagebrush communities through ecological restoration and sustainable management in order to conserve biodiversity and ecosystem services across the western U.S. (Remington et al., 2021).

2 Sampling

2.1 Genomic Data

We used sequence data from previous studies on sagebrush at Boise State University (Melton et al., 2022). These data are divided into three categories based on the environment in which the plants were grown prior to sequencing.

Category	Sequencing Type	Sequencing Depth	Number of Sample
Magenta box	Illumina paired-end	~160X	2
Greenhouse	Illumina paired-end	~20X	12
Field/Wild	Illumina paired-end	~10-12X	14

3 Bioinformatic Workflow

The workflow described here is specific to only one category of the samples (Field/Wild) that were analyzed for this project. However, the same workflow was used for all the other categories. As shown in Figure 3.1, the bioinformatic workflow can be described in five sections

Preprocessing
Taxonomic assignment of short reads
Assembly, and binning
Classification of Metagenome Assembled Genomes
Gene Prediction and Functional profiling

Overview of the bioinformatic workflow for the recovery of metagenome-assembled genomes from host-derived sequences of big sagebrush Artemisia tridentata subp. tridentata

Figure 3.1: Overview of the bioinformatic workflow for the recovery of metagenome-assembled genomes from host-derived sequences of big sagebrush Artemisia tridentata subp. tridentata

3.1 Data structure and organisation of files

An overview of the project structure can be seen in Figure 3.2. The data visualization in Figure 3.3 enhances the understanding of the data flow and code.

Figure 3.2: Overview of the project structure

01_Bowtie : This contains the .bam file that was generated from previous preliminary mapping with host genome using bowtie2.
02_BMTAGER: The folder that contains the output of further clean up with BMTagger
03_KBASE_Analysis: The folder that contains the outputs from Kbase analysis platform
04_Kraken: Working folder for Kraken analysis.
05_DIAMOND: Working folder for DIAMOND analysis
06_Func_annotation: Working folder for the functional analysis

Figure 3.3: Data/Project Visualization

3.2 Bioinformatic Tool

Analyses presented in this report requires a UNIX-based Operating System and ssh, sbatch and tmux protocols should be installed. Check this website for more details and documentation on how to connect to remote computers using the Secure Shell Protocol.

In addition, R and Rstudio was used to tidy/format the output of data and to produce figures. All the R analyses were performed using R version 4.2.1 (2022-06-23). The software can be downloaded here.

KBase (Web-based analyses) was accessed using Google Chrome Version 119.0.6045.105 (Official Build) (x86_64). You can install Google Chrome using this link

Finally, dependencies associated to each analysis will be presented alongside the code.

3.3 Setting Up!

# 1. Remotely connect to HPC computer using ssh (adjust with your credentials/IP)

ssh userID@IP

# 2. Start a dev-session 

dev-session bsu

# 3. Start tmux session and name it Sagebrush_MAGS. 

tmux new -s Sagebrush_MAGS

 # Note: Use Ctrl+b and d to safe exit a tmux session

# 3.  Navigate to project location (adjust path)

cd /path/to/Project_folder/

3.3.1 Create a metagenomics conda environment

Create a conda environment from Data carpentry tutorial, and name the environment “metagenomics”. This environment will contain several software that will be used in this project.

wget https://raw.githubusercontent.com/carpentries-lab/metagenomics-analysis/gh-pages/files/spec-file-Ubuntu22.txt

#Then run to create an environment called 'metagenomics' using the yml.txt file that you donwloaded (spec-file-Ubuntu22.txt)

conda create --name metagenomics --file spec-file-Ubuntu22.txt

# Activate the metagenomics environment

conda activate metagenomics

3.4 Preprocessing

The sequences data in this study are host-derived as they were extracted from plant tissues, and with our focus on the associated microbial reads, we consider host genomes as contaminants that could impact our analyses. Therefore the need to remove them. Raw sequence data from previous projects had been initially mapped with sagebrush (Artemisia tridentata subp. tridentata) genome to generate .bam file which are stored in 01_Bowtie folder. See Figure 3.1.

3.4.1 Seperate ‘unmapped’ reads from ‘mapped’ reads

3.4.1.1 Data requirement

- Input: bam files from Bowtie2

-Output: Unmapped non-host fastq.gz files.

3.4.1.2 Dependencies

This is done using samtools which can be installed using Bioconda here

conda install -c bioconda samtools

3.4.1.3 Code

Here, we used a ’for loop` code that contain different code that did all the process described below.

We removed the host reads from the bam files
Then retrieved the unmapped non-host reads
Made it a sorted bam
Split the unmapped_sorted.bam file into R1 and R2
Finally, I deleted the unwanted intermediate “unmapped.bam” and “unmapped_sorted.bam” files.

# Step 1: Remove the host reads from the bam files

for F in /path/to/Project_folder/01_Bowtie/*.bam; do
    PWD=/path/to/Project_folder/
    BASENAME=${F##*/}
    SAMPLE=${BASENAME%.bam}_IDT3_NC
    OUTPUT_DIR=$PWD/01_Bowtie/Non_host_reads


# Step 2: Get the unmapped non-host reads

samtools view -b -f 12 -F 256 $F  > $OUTPUT_DIR/${SAMPLE}_unmapped.bam &&

# Step 3: Make it a sorted bam

samtools sort -n -m 5G -@ 2 $OUTPUT_DIR/${SAMPLE}_unmapped.bam -o $OUTPUT_DIR/${SAMPLE}_unmapped_sorted.bam &&

# Step 4: split the unmapped_sorted.bam file into R1 and R2

samtools fastq -@ 8 $OUTPUT_DIR/${SAMPLE}_unmapped_sorted.bam \
  -1 $OUTPUT_DIR/${SAMPLE}_host_removed_R1.fastq.gz \
  -2 $OUTPUT_DIR/${SAMPLE}_host_removed_R2.fastq.gz \
  -0 /dev/null -s /dev/null -n &&

# Step 5: Remove the intermediate "unmapped.bam" and "unmapped_sorted.bam" files

rm $OUTPUT_DIR/${SAMPLE}_unmapped.bam &&
rm $OUTPUT_DIR/${SAMPLE}_unmapped_sorted.bam

done

3.4.2 Removal of ‘host’ reads using BMTagger

Downstream analysis with the “unmapped non-host” reads from Bowtie2 showed that there were still sagebrush reads in our “non-host” reads therefore, we used BMTagger, a tool specifically designed to remove the human genome from metagenomic datasets but can be adapted to remove other host genomes. See Figure 3.1.

In order to improve the clean-up process, we indexed reference genomes from Artemisia tridentata, Artemisia annua, and the human hg38 genome which were downloaded on NCBI and mapped them with our datasets.

3.4.2.1 Dependencies

For this, we used the one of the modules of Metawrap called read_qc.sh. We prefered this module because it does the following all at once:

Quality Assessment using fastqc.
Trims bad sequence data using trim_galore.
Removes host reads using bmtagger by mapping against the host build .bitmask and .srprism files.

Install metawrap and all of it dependencies using this code. Visit this github for more information and on how to build the ‘host’ BMTagger database.

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda config --add channels ursky

# Unix/Linux only
mamba create -y -n metawrap-env python=2.7

conda activate metawrap-env

3.4.2.2 Create the BMTagger folders

Set up for BMTagger clean up

# Create a BMTAGGER folder 

mkdir /path/to/Project_folder/02_BMTAGGER       # This is going to be my PWD

# Reference: https://github.com/bxlab/metaWRAP/blob/master/Usage_tutorial.md

# Change to the Bmtagger folder

cd /path/to/Project_folders/02_BMTAGGER

# Make a new folder called 01_RAW_READS

mkdir /path/to/Project_folder/02_BMTAGGER/01_RAW_READS

cd /path/to/Project_folder/02_BMTAGGER/01_RAW_READS

# Copy the raw non-host reads in '/path/to/Project_folder/01_Bowtie/Non_host_reads' into '/path/to/Project_folder/02_BMTAGGER/01_RAW_READS'

cp /path/to/Project_folder/01_Bowtie/Non_host_reads/* /path/to/Project_folder/02_BMTAGGER/01_RAW_READS

# Unzip the files in 01_RAW_READS (change the extension from .fastq.gz to .fastq)

cd /path/to/Project_folder/02_BMTAGGER/01_RAW_READS

gunzip *.gz

3.4.2.3 Clean with Artemisia tridenta subp tridenta genome

See Figure 3.1.

# Create a new tmux session and change to the working directory

tmux new -s bmtagger

cd /path/to/Project_folder/02_BMTAGGER

# Activate the conda environment in this new tmux session

conda activate metawrap-env


# Run metaWRAP-Read_qc to trim the reads, remove host (A_tridentata) reads

# Make a directory for the results

mkdir /path/to/Project_folder/02_BMTAGGER/02_READ_QC


# Process all samples at the same time with a parallel for loop (especially if you have many samples)

# Note: my PWD is /path/to/Project_folder/02_BMTAGGER 

# Raw reads are found in /path/to/Project_folder/02_BMTAGGER/01_RAW_READS

for F in 01_RAW_READS/*_R1.fastq; do 
    R=${F%_*}_R2.fastq
    BASE=${F##*/}
    SAMPLE=${BASE%_host*}
    metawrap read_qc -1 $F -2 $R -t 20 -x A_tridentata -o 02_READ_QC/$SAMPLE --skip-trimming &
done    


# change to /path/to/Project_folder/02_BMTAGGER
    
cd /path/to/Project_folder/02_BMTAGGER


# move over the final QC'ed reads into a new folder but first make a new folder that will store the reads

mkdir /path/to/Project_folder/02_BMTAGGER/03_CLEAN_READS


for i in 02_READ_QC/*; do 
 b=${i#*/}
 mv ${i}/final_pure_reads_1.fastq 03_CLEAN_READS/${b}_1.fastq
 mv ${i}/final_pure_reads_2.fastq 03_CLEAN_READS/${b}_2.fastq
done

3.4.2.4 Clean with Artemisia annua genome

See Figure 3.1.

The A. annua genome that was used was downloaded from NCBI under Bioproject PRJNA416223.

The reads in /path/to/Project_folder/02_BMTAGGER/03_CLEAN_READS have been cleaned of A. tridentata reads. Now, I want to clean them with A. annua reads for further clean-up due to the similarity of these genomes.

Now, these reads in /path/to/Project_folder/02_BMTAGGER/03_CLEAN_READS will serve as template for this new cleaning process

# Make a folder that will store the output

mkdir /path/to/Project_folder/02_BMTAGGER/02_READ_QC-A_annua


for F in 03_CLEAN_READS/*_1.fastq; do 
 R=${F%_*}_2.fastq
 BASE=${F##*/}
 SAMPLE=${BASE%_*}
 metawrap read_qc -1 $F -2 $R -t 5 -x A_annua -o 02_READ_QC-A_annua/$SAMPLE --skip-trimming &
done




## move over the final QC'ed reads into a new folder

mkdir /path/to/Project_folder/02_BMTAGGER/03_CLEAN_READS-A_annua


for i in 02_READ_QC-A_annua/*; do 
 b=${i#*/}
 mv ${i}/final_pure_reads_1.fastq 03_CLEAN_READS-A_annua/${b}_1.fastq
 mv ${i}/final_pure_reads_2.fastq 03_CLEAN_READS-A_annua/${b}_2.fastq
done

3.4.2.5 Clean with human (Homo sapien) genome

See Figure 3.1.

The reads in /path/to/Project_folder/02_BMTAGGER/03_CLEAN_READS-A_annuaS have been cleaned of A. tridentata and A. annua reads. Now, I want to clean them with Homo sapien genome to remove any potential contaminants that may get carried forward from the DNA extraction process.

The human genome that was used was downloaded from https://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/ using wget and ftp protocol: wget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/*fa.gz. Check this link for details.

Now, these reads in /path/to/Project_folder/02_BMTAGGER/03_CLEAN_READS-A_annua will serve as template for this new cleaning process

# Make a folder that will store the output

mkdir /path/to/Project_folder/02_BMTAGGER/02_READ_QC-human


for F in 03_CLEAN_READS-A_annua/*_1.fastq; do 
 R=${F%_*}_2.fastq
 BASE=${F##*/}
 SAMPLE=${BASE%_*}
 metawrap read_qc -1 $F -2 $R -t 5 -x hg38 -o 02_READ_QC-human/$SAMPLE --skip-trimming &
done


## move over the final QC'ed reads into a new folder

mkdir /path/to/Project_folder/02_BMTAGGER/03_CLEAN_READS-human


for i in 02_READ_QC-human/*; do 
 b=${i#*/}_host_removed
 mv ${i}/final_pure_reads_1.fastq 03_CLEAN_READS-human/${b}_1.fastq
 mv ${i}/final_pure_reads_2.fastq 03_CLEAN_READS-human/${b}_2.fastq
done


# Note that I added the phrase "host_removed" to the final cleaned read

3.4.2.6 Output

Table 3.1 shows the result of the cleaning process

Table 3.1: Approximate size of ‘host’ cleaned reads that were removed from the sequencing data.
Sample_ID	Category	Initial_read_size	Approximate_Size_of_A.tridentata_cleaned_reads	Approximate_Size_of_A.annua_cleaned_reads	Approximate_Size_of_human_cleaned_reads	Percentage_Cleaned_Reads	Final_read_size	No_of_cleaned_PE_raw_reads_in_basepairs
Magenta Samples
IDT3	Magenta	38.0 Gb	24.0 Gb	4.4 Gb	970 Mb	77.28947	8.4 Gb	25,738,784
UTT2	Magenta	58.0 Gb	34.0 Gb	4.6 Gb	1.31 Gb	68.81034	18.2 Gb	55,350,386
Greenhouse Samples
29-ID	Greenhouse	4.4 Gb	2.2 Gb	658 Mb	0 Mb	64.95455	1.48 Gb	4,207,818
143-ID	Greenhouse	3.6 Gb	1.83 Gb	550 Mb	0 Mb	66.11111	1.29 Gb	3,683,012
163-ID	Greenhouse	3.4 Gb	1.84 Gb	540 Mb	0 Mb	70.00000	1.11 Gb	3,158,346
192-ID	Greenhouse	3.4 Gb	1.66 Gb	456 Mb	0 Mb	62.23529	1.31 Gb	3,717,942
246-NV	Greenhouse	4.2 Gb	2.20 Gb	602 Mb	0 Mb	66.71429	1.58 Gb	4,490,656
326-NV	Greenhouse	3.6 Gb	1.98 Gb	576 Mb	0 Mb	71.00000	1.15 Gb	3,267,482
379-NV	Greenhouse	3.8 Gb	2.02 Gb	560 Mb	0 Mb	67.89474	1.29 Gb	3,666,774
399-NV	Greenhouse	4.0 Gb	2.40 Gb	580 Mb	0 Mb	74.50000	1.21 Gb	3,440,062
422-UT	Greenhouse	5.8 Gb	3.20 Gb	780 Mb	0 Mb	68.62069	2.02 Gb	5,794,018
443-UT	Greenhouse	4.0 Gb	2.02 Gb	562 Mb	0 Mb	64.55000	1.47 Gb	4,186,566
541-UT	Greenhouse	4.0 Gb	2.20 Gb	620 Mb	0 Mb	70.50000	1.25 Gb	3,558,232
579-UT	Greenhouse	3.8 Gb	1.98 Gb	564 Mb	0 Mb	66.94737	1.30 Gb	3,696,426
Field/Wild Samples
2-Wild	Field/Wild	3.2 Gb	1.8 Gb	464 Mb	4.4 Mb	70.88750	911 Mb	2,601,426
3-Wild	Field/Wild	3.4 Gb	2.0 Gb	474 Mb	3.6 Mb	72.87059	907 Mb	2,590,152
8-Wild	Field/Wild	4.0 Gb	2.0 Gb	524 Mb	2.6 Mb	63.16500	2.3 Gb	6,519,242
14-Wild	Field/Wild	3.0 Gb	2.0 Gb	502 Mb	1.0 Mb	83.43333	853 Mb	2,436,802
15-Wild	Field/Wild	4.0 Gb	2.2 Gb	588 Mb	48 Mb	70.90000	1.2 Gb	3,604,450
21-Wild	Field/Wild	4.0 Gb	2.2 Gb	540 Mb	12.8 Mb	68.82000	1.3 Gb	3,559,682
24-Wild	Field/Wild	3.0 Gb	1.7 Gb	470 Mb	1.9 Mb	72.39667	800 Mb	2,283,504
26-Wild	Field/Wild	3.6 Gb	2.2 Gb	548 Mb	13.4 Mb	76.70556	1.0 Gb	2,960,114
29-Wild	Field/Wild	3.6 Gb	2.2 Gb	448 Mb	2.2 Mb	73.61667	853 Mb	2,433,302
32-Wild	Field/Wild	3.4 Gb	2.2 Gb	510 Mb	1.5 Mb	79.75000	911 Mb	2,590,152
33-Wild	Field/Wild	3.8 Gb	2.2 Gb	512 Mb	3.6 Mb	71.46316	1.2 Gb	3,528,676
34-Wild	Field/Wild	3.4 Gb	1.9 Gb	488 Mb	8.4 Mb	70.48235	1.1 Gb	3,125,510
35-Wild	Field/Wild	4.0 Gb	1.8 Gb	234 Mb	556 Mb	64.75000	1.3 Gb	3,839,712
36-Wild	Field/Wild	3.6 Gb	2.0 Gb	518 Mb	3.0 Mb	70.02778	965 Mb	2,754,048

3.5 Taxonomic assignment of short reads

As shown in Figure 3.1, the Department of Energy Systems Biology Knowledgebase (KBase) System was used for our downstream metagenomic analyses (Arkin et al., 2018) with default parameters for all software unless otherwise stated. You can sign up to Kbase using this link.

Figure 3.4 below shows the Kbase pipeline that was used in this study. You can watch the short video below on how to get started on Kbase here

Figure 3.4: Overview of Kbase Pipeline

3.5.1 Import into Kbase

Start by importing the sets of paired-end reads in FASTQ format as described in Figure 3.5. The Import App creates a PairedEndLibrary object that we can then run through FastQC to further check the quality of the reads.

Figure 3.5: Screenshot on how to Import Data into Kbase

3.5.2 Run FastQC

3.5.2.1 Data Requirement

Input: PairedEndLibrary that was generated using the Import Application
Output: Quality Report in HTML format.

3.5.2.2 Dependencies

We used Assess Read Quality with FastQC - v0.12.1 for further check the quality of our data. An example of the result can be is on Figure 3.6

Figure 3.6: Screenshot of a FastQC report for one of the samples

3.5.3 Taxonomic Assignment of short reads with Kaiju

3.5.3.1 Data Requirement

Input: PairedEndLibrary that was generated using the Import Application
Output: Taxonomic visualization plots. It also produces a taxonomic report that can be exported into other programs to generate better quality plots.

3.5.3.2 Dependencies

We used Classify Taxonomy of Metagenomic Reads with Kaiju - v1.9.0 on Kbase to classify the short Illumina reads. The software was configured as follows

Taxonomic Level           = ALL

Reference DB              = NCBI BLAST nr + euk
  
Low Abundance Filter      = 0.5

Subsample percent         = 100

The outputs of Kaiju were exported out of Kbase and used to plot bar graph. Figure 3.6 is a barplot generated using the outputs from KBase.

Taxonomic barplot showing the relative abundance of the top taxa across A) Magenta samples B) Greenhouse samples C) Wild/Field Samples at the genus level.

Figure 3.7: Taxonomic barplot showing the relative abundance of the top taxa across A) Magenta samples B) Greenhouse samples C) Wild/Field Samples at the genus level.

3.5.4 Taxonomic Assignment of short reads with Kraken2 and DIAMOND

We also used other classification tool such as Kraken2 and DIAMOND/MEGAN (Buchfink et al., 2015; Wood et al., 2019) so that we can compare the results of Kaiju with other softwares. We found that the results are very similar. In order to keep this project concise, we will not explain the details of these software.

3.6 Assembly, and binning

3.6.1 Assembly

3.6.1.1 Data Requirement

Input: PairedEndLibrary that was generated using the Import Application or Merged Read library if you have multiple samples.
Output: Output ContigSet.

3.6.1.2 Dependencies

We built co-assemblies of sequence data from plants in the same category; i.e., microbial sequences from all greenhouse-grown plants were co-assembled and those from all the field-grown were co-assembled.

KBase provides three assembly tools: MetaSPAdes v3.15.3 (Nurk et al., 2017), MEGAHIT v1.2.9 (Li et al., 2015), and IDBA-UD v1.1.3 (Peng et al., 2012), and we used all three of these programs to assemble the metagenomic reads into contiguous sequences. See figure 3.4

Assembling with multiple different tools is a common way to find which one produces the best results. Using Compare Assembled Contig Distributions v1.1.2 software on Kbase, we saw that metaSPAdes has the best N50, N75, and it produced the longest contigs. See Figure 3.4

Figure 3.8: Result of Contig Comparison

3.6.2 Binning

3.6.2.1 Data Requirement

Input: PairedEndLibrary that was generated using the Import Application.
Output: BinnedContig in .fasta format.

3.6.2.2 Dependencies

Just like with the assembly process, we used three binning programs: MaxBin2 v2.2.4 (Wu et al., 2015), MetaBAT2 v1.7 (Kang et al., 2019), and CONCOCT v1.1 (Alneberg et al., 2014) for MAG reconstruction. The individual outputs (bins) from these programs were passed into DASTools- v1.1.2 to produce high-quality non-redundant MAGs. See figure 3.4.

3.7 Classification of Metagenome Assembled Genomes

3.7.0.1 Data Requirement

Input: BinnedContig in .fasta format
Output: CheckM quality table in .csv format. Classification plot generated with SpeciesTree.

3.7.0.2 Dependencies

Prior to classification, CheckM v1.0.18 was used to assess the quality of the MAGs and then removed the Bin.003 with >10% contamination. See table 3.2

Table 3.2: CheckM quality assessment of metagenome-assembled genomes that were recovered from greenhouse plants.
Bin_Name	Marker_Lineage	Genomes_Size_.bp.	No_of_Markers	No_of_Marker_Sets	Completeness	Contamination
Bin.001	o__Sphingomonadales	3972965	569	293	98.62	0.90
Bin.002	c__Alphaproteobacteria	5311195	349	230	100.00	0.87
Bin.003	k__Archaea	21065390	149	107	70.02	23.30
^a The red highlighted area shows the low-quality bin that was filtered with the set parameter of >50% completion and <10% contamination

Genome Taxonomy Database tool kit, GTDB-Tk v1.7.0 (Chaumeil et al., 2019) was used to classify the other two bins.. See figure 3.9

Figure 3.9: Classification Tree generated by SpeciesTree - v2.2.0 on KBase

In addition to GTDB classification, we exported the fasta files of these bins from Kbase and classified them using Kraken2 and DIAMOND. We found that the Kraken2 and DIAMOND results were consistent with the GDTB-generated taxonomy but we will not go into the details of these software.

3.8 Gene Prediction and Functional profiling

Gene prediction helps to predict the genetic potential of the recovered metagenome-assembled genomes.

3.8.1 DRAM Annotation on KBase.

We used the Annotate and Distill Assemblies with DRAM software on KBase and the output can be seen in Figure 3.10

Figure 3.10: DRAM Annotation of MAGS

3.8.2 Gene Prediction

We also annotated the MAGS using Prodigal (Hyatt et al., 2010), the annotated genomes were then passed into KofamScan (https://github.com/takaram/kofam_scan), a command line interface of the web-based KofamKOALA (Aramaki et al., 2019), a gene function annotation tool based on KEGG Orthology and hidden Markov model. MinPath (Ye and Doak, 2009) was used to reconstruct a parsimonious Kyoto Encyclopedia of Genes and Genomes, KEGG (Kanehisa and Goto, 2000) biological pathway using protein family predictions, this provides comprehensive reports on the metabolic modules and pathways that are found in MAGs.

3.8.2.1 Prodigal

3.8.2.1.1 Data Requirement

Input: Genomes of the recovered bins in FASTA format. It can either be in .fa or .fasta.
Output: .fna, .ggf and .protein.faa files. The protein file .protein.faa will be used as input for KOFAM Scan

3.8.2.1.2 Dependencies

If installed the metagenomics environment described in this section, just go ahead and activate the environment.

conda activate metagenomics

If you did not install that environment and you don’t want to, you can install prodigal directly.

conda install -c bioconda prodigal

3.8.2.1.3 Code

# Start a new tmux session

tmux new -s Functional.Annotation

# Activate the environment

conda activate metagenomics


# Make a new folder for the functional annotation

mkdir /path/to/Project_folder/06_Func_annotation

cd /path/to/Project_folder/06_Func_annotation


# Make a folder to store the FASTA files

mkdir /path/to/Project_folder/06_Func_annotation/01_bins.fasta

# Download the bins from KBase and move them into /path/to/Project_folder/06_Func_annotation/01_bins.fasta using FileZilla



# Run Prodigal on the bins 
# Run the program in normal mode. See https://github.com/hyattpd/prodigal/wiki/Advice-by-Input-Type#draft-genomes for more details.



# Create a for loop that runs the analysis for all of the bins  


for F in /path/to/Project_folder/06_Func_annotation/01_bins.fasta/*.fa; do
SAMPLE_NAME=$(basename "${F}" .fa)

mkdir /path/to/Project_folder/06_Func_annotation/${SAMPLE_NAME}_output

OUTPUT_DIR=/path/to/Project_folder/06_Func_annotation/${SAMPLE_NAME}_output

prodigal -a $OUTPUT_DIR/${SAMPLE_NAME}.protein.faa -d $OUTPUT_DIR/${SAMPLE_NAME}.contigs.fna -i $F  -o $OUTPUT_DIR/${SAMPLE_NAME}.gff

done

3.8.2.2 KOFAM Scan

3.8.2.2.1 Data Requirement

Input: Protein file .faa from Prodigal
Output: KEGG Mapper file in .txt.

3.8.2.2.2 Dependencies

## Run KOFAM

#Ref: https://taylorreiter.github.io/2019-05-11-kofamscan/

# Make a directory at /path/to/Project_folder/06_Func_annotation

mkdir /path/to/Project_folder/06_Func_annotation/02_KOFAM_Scan

cd /path/to/Project_folder/06_Func_annotation/02_KOFAM_Scan

wget ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz        # download the ko list 
wget ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz       # download the hmm profiles
wget ftp://ftp.genome.jp/pub/tools/kofam_scan/kofam_scan-1.3.0.tar.gz # download kofamscan tool
wget ftp://ftp.genome.jp/pub/tools/kofamscan/README.md      # download README


# unzip and untar the relevant files:

gunzip ko_list.gz

tar xf profiles.tar.gz

tar xf kofam_scan-1.3.0.tar.gz

## Make a directory that will store the outputs

mkdir /path/to/Project_folder/06_Func_annotation/02_KOFAM_Scan/01_output

# Make a conda environment using miniconda

mamba create -n kofamscan

# activate the environment

mamba activate kofamscan

# Now, install ruby into the new environment

mamba install -c conda-forge ruby

# Also install kofamscan hmmer and parallel

mamba install kofamscan hmmer parallel




#make a config.yml file that has the directories, paste below and update with your own directories

vim config.yml

####PASTE BELOW INTO NEW YML AND UPDATE WITH NEW PATHS
# Path to your KO-HMM database
# A database can be a .hmm file, a .hal file or a directory in which
# .hmm files are. Omit the extension if it is .hal or .hmm file
profile: /path/to/Project_folder/06_Func_annotation/02_KOFAM_Scan/profiles

# Path to the KO list file
ko_list: /path/to/Project_folder/06_Func_annotation/02_KOFAM_Scan/ko_list

# Path to an executable file of hmmsearch
# You do not have to set this if it is in your $PATH
hmmsearch: ~/miniconda3/envs/kofamscan/bin/hmmsearch

# Path to an executable file of GNU parallel
# You do not have to set this if it is in your $PATH
parallel: ~/miniconda3/envs/kofamscan/bin/parallel

# Number of hmmsearch processes to be run parallelly
cpu: 48
####

3.8.2.2.3 Code

# Use a for loop

for F in /path/to/Project_folder/06_Func_annotation/*Bin*/*.faa; do

SAMPLE_NAME=$(basename "${F}" .faa)

OUTPUT_DIR=/path/to/Project_folder/06_Func_annotation/02_KOFAM_Scan/01_output

~/miniconda3/pkgs/kofamscan-1.3.0-hdfd78af_2/bin/exec_annotation -c /path/to/Project_folder/06_Func_annotation/02_KOFAM_Scan/config.yml $F -o $OUTPUT_DIR/${SAMPLE_NAME}.txt

done

3.8.2.3 MinPath

3.8.2.3.1 Data Requirement

Input: KEGG Mapper file from KofamScan
Output: KEGG Output file with annotated genes.

3.8.2.3.2 Dependencies

# Change to the folder

cd /path/to/Project_folder/06_Func_annotation

# Make a new folder for the MinPath annotation

mkdir 04_MinPath

cd 04_MinPath


## Copy the git repository that contain Minpath executable (https://github.com/mgtools/MinPath)

git clone https://github.com/mgtools/MinPath


# Use the information on this webpage to install glpk (https://github.com/mgtools/MinPath/blob/master/glpk-4.6/INSTALL)

# Move into the glpk folder

cd /path/to/Project_folder/06_Func_annotation/04_MinPath/MinPath/glpk-4.6

# Configuring the package

./configure


# Compile the package

make

# Check the package

make check

# Install the package. Note: By default, 'make install' will install the package's files in
'usr/local/bin', 'usr/local/lib'. So, specify an installation
prefix other than '/usr/local'

make prefix=~/miniconda3/bin install

### make sure that you can call command glpsol under glpk-4.6/examples

/path/to/Project_folder/06_Func_annotation/04_MinPath/MinPath/glpk-4.6/examples

./glpsol -h

3.8.2.3.3 Code

## Now run the analysis
# Ref: https://github.com/mgtools/MinPath/blob/master/readme

# Make a folder for the input and output files

mkdir /path/to/Project_folder/06_Func_annotation/04_MinPath/01_input

mkdir /path/to/Project_folder/06_Func_annotation/04_MinPath/02_output


# Copy the .txt files in "/path/to/Project_folder/06_Func_annotation/02_KOFAM_Scan/01_output" into "/path/to/Project_folder/06_Func_annotation/04_MinPath/01_input"


cp /path/to/Project_folder/06_Func_annotation/02_KOFAM_Scan/01_output/*txt /path/to/Project_folder/06_Func_annotation/04_MinPath/01_input

# Change to /path/to/Project_folder/06_Func_annotation/04_MinPath/01_input


cd /path/to/Project_folder/06_Func_annotation/04_MinPath/01_input

Format the KEGG Mapper files into a format that can be used by the MinPath program

for F in /path/to/Project_folder/06_Func_annotation/04_MinPath/01_input/*.txt; do

SAMPLE_NAME=$(basename "${F}" .txt)

OUTPUT_DIR=/path/to/Project_folder/06_Func_annotation/04_MinPath/01_input

##### convert this output to a form that can be used with KEGG decoder

awk '{print $1, $2}' $F > $OUTPUT_DIR/${SAMPLE_NAME}.mapper.txt


# Delete the first two rows of the data using tail command

tail -n +3 $OUTPUT_DIR/${SAMPLE_NAME}.mapper.txt > $OUTPUT_DIR/${SAMPLE_NAME}.mapper2.txt

# There are some rows that contains asterisks "* " in the first column, I will delete these strings but the rows using the awk command

awk '$1 != "*" { print }' $OUTPUT_DIR/${SAMPLE_NAME}.mapper2.txt > $OUTPUT_DIR/${SAMPLE_NAME}.mapper3.txt


## Turn the file into a tab-separated file

# Use sed or awk

sed 's/ /\t/g' $OUTPUT_DIR/${SAMPLE_NAME}.mapper3.txt > $OUTPUT_DIR/${SAMPLE_NAME}.mapper4.txt


awk 'BEGIN {FS=" "; OFS="\t"} {$1=$1}1' $OUTPUT_DIR/${SAMPLE_NAME}.mapper3.txt > $OUTPUT_DIR/${SAMPLE_NAME}.mapper4.txt


# In the second column, replace the rows that contains hypen "-" with nothing "" but do not delete the rows

awk '{sub(/-/, "", $2)}1' $OUTPUT_DIR/${SAMPLE_NAME}.mapper4.txt > $OUTPUT_DIR/${SAMPLE_NAME}.mapper5.txt

done

Alternatively, you can make the code more concise by using the pipe (|) function for the codes above.

for F in /path/to/Project_folder/06_Func_annotation/04_MinPath/01_input/*.txt; do

SAMPLE_NAME=$(basename "${F}" .txt)

OUTPUT_DIR=/path/to/Project_folder/06_Func_annotation/04_MinPath/01_input

awk '{print $1, $2}' $F | tail -n +3 | awk '$1 != "*" { print }' | awk 'BEGIN {FS=" "; OFS="\t"} {$1=$1}1' | awk '{sub(/-/, "", $2)}1' > $OUTPUT_DIR/${SAMPLE_NAME}.mapperfile.txt

done

Now that the Mapper files have been formated appropriately, let’s run the program. Check this link for more details.

for F in /path/to/Project_folder/06_Func_annotation/04_MinPath/01_input/*.mapperfile.txt; do

SAMPLE_NAME=$(basename "${F}" .txt)

OUTPUT_DIR=/path/to/Project_folder/06_Func_annotation/04_MinPath/02_output

python /path/to/Project_folder/06_Func_annotation/04_MinPath/MinPath/MinPath.py \
    -ko $F \
    -report $OUTPUT_DIR/${SAMPLE_NAME}.minpath \
    -details $OUTPUT_DIR/${SAMPLE_NAME}.minpath.details

done

You will get a KEGG Output file with annotated genes that you can import into other programs.

4 References

Alneberg, J., B.S. Bjarnason, I. de Bruijn, M. Schirmer, J. Quick, U.Z. Ijaz, L. Lahti, et al. 2014. Binning metagenomic contigs by coverage and composition. Nat Methods 11: 1144–1146.

Andersen, S.B., and J. Schluter. 2021. A metagenomics approach to investigate microbiome sociobiology. Proceedings of the National Academy of Sciences 118: e2100934118. Available at: https://www.pnas.org/doi/abs/10.1073/pnas.2100934118.

Aramaki, T., R. Blanc-Mathieu, H. Endo, K. Ohkubo, M. Kanehisa, S. Goto, and H. Ogata. 2019. KofamKOALA: KEGG Ortholog assignment based on profile HMM and adaptive score threshold. Bioinformatics 36: 2251–2252. Available at: https://doi.org/10.1093/bioinformatics/btz859.

Arkin, A.P., R.W. Cottingham, C.S. Henry, N.L. Harris, R.L. Stevens, S. Maslov, P. Dehal, et al. 2018. KBase: The united states department of energy systems biology knowledgebase. Nature Biotechnology 36: 566–569. Available at: https://doi.org/10.1038/nbt.4163.

Berg, G., D. Rybakova, D. Fischer, T. Cernava, M.-C.C. Vergès, T. Charles, X. Chen, et al. 2020. Microbiome definition re-visited: Old concepts and new challenges. Microbiome 8: 103. Available at: https://doi.org/10.1186/s40168-020-00875-0.

Bharti, R., and D.G. Grimm. 2019. Current challenges and best-practice protocols for microbiome analysis. Briefings in Bioinformatics 22: 178–193. Available at: https://api.semanticscholar.org/CorpusID:209409553.

Browne, H.P., S.C. Forster, B.O. Anonye, N. Kumar, B.A. Neville, M.D. Stares, D. Goulding, and T.D. Lawley. 2016. Culturing of “unculturable” human microbiota reveals novel taxa and extensive sporulation. Nature 533: 543–546. Available at: https://doi.org/10.1038/nature17645.

Buchfink, B., C. Xie, and D.H. Huson. 2015. Fast and sensitive protein alignment using DIAMOND. Nature Methods 12: 59–60. Available at: https://doi.org/10.1038/nmeth.3176.

Chaumeil, P.-A., A.J. Mussig, P. Hugenholtz, and D.H. Parks. 2019. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36: 1925–1927. Available at: https://doi.org/10.1093/bioinformatics/btz848.

Gulati, M., and B. Plosky. 2020. As the microbiome moves on toward mechanism. Molecular Cell 78: 567. Available at: https://www.sciencedirect.com/science/article/pii/S1097276520303075.

Hyatt, D., G.-L. Chen, P.F. LoCascio, M.L. Land, F.W. Larimer, and L.J. Hauser. 2010. Prodigal: Prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11: 119. Available at: https://doi.org/10.1186/1471-2105-11-119.

Kanehisa, M., and S. Goto. 2000. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Research 28: 27–30. Available at: https://doi.org/10.1093/nar/28.1.27.

Kang, D.D., F. Li, E. Kirton, A. Thomas, R. Egan, H. An, and Z. Wang. 2019. MetaBAT 2: An adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7: e7359.

Li, D., C.-M. Liu, R. Luo, K. Sadakane, and T.-W. Lam. 2015. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31: 1674–1676. Available at: https://doi.org/10.1093/bioinformatics/btv033.

Lin, H.-H., and Y.-C. Liao. 2016. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Scientific Reports 6: 24175. Available at: https://doi.org/10.1038/srep24175.

Melton, A.E., A.W. Child, J. Beard Richard S, C.D.C. Dumaguit, J.S. Forbey, M. Germino, M.-A. de Graaff, et al. 2022. A haploid pseudo-chromosome genome assembly for a keystone sagebrush species of western North American rangelands. G3 Genes|Genomes|Genetics 12: jkac122. Available at: https://doi.org/10.1093/g3journal/jkac122.

Miller, R. 2010. Characteristics of sagebrush habitats and limitations to long-term conservation.

Nayfach, S., Z.J. Shi, R. Seshadri, K.S. Pollard, and N.C. Kyrpides. 2019. New insights from uncultivated genomes of the global human gut microbiome. Nature 568: 505–510. Available at: https://doi.org/10.1038/s41586-019-1058-x.

Nurk, S., D. Meleshko, A. Korobeynikov, and P.A. Pevzner. 2017. metaSPAdes: A new versatile metagenomic assembler. Genome Res 27: 824–834.

Peng, Y., H.C.M. Leung, S.M. Yiu, and F.Y.L. Chin. 2012. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28: 1420–1428. Available at: https://doi.org/10.1093/bioinformatics/bts174.

Remington, T.E., P.A. Deibert, S.E. Hanser, D.M. Davis, L.A. Robb, and J.L. Welty. 2021. Sagebrush conservation strategy—challenges to sagebrush conservation. U. S. G. Survey [ed.],. Reston, VA. Available at: https://doi.org/10.3133/ofr20201125.

Rigge, M., C. Homer, L. Cleeves, D.K. Meyer, B. Bunde, H. Shi, G. Xian, et al. 2020. Quantifying western u.s. Rangelands as fractional components with multi-resolution remote sensing and in situ data. Remote Sensing 12: Available at: https://www.mdpi.com/2072-4292/12/3/412.

Shipley, L.A., T.B. Davila, N.J. Thines, and B.A. Elias. 2006. Nutritional requirements and diet choices of the pygmy rabbit (brachylagus idahoensis): A sagebrush specialist. Journal of Chemical Ecology 32: 2455–2474. Available at: https://doi.org/10.1007/s10886-006-9156-2.

Wang, Z., Z. Wang, Y.Y. Lu, F. Sun, and S. Zhu. 2019. SolidBin: improving metagenome binning with semi-supervised normalized cut. Bioinformatics 35: 4229–4238. Available at: https://doi.org/10.1093/bioinformatics/btz253.

Wood, D.E., J. Lu, and B. Langmead. 2019. Improved metagenomic analysis with kraken 2. Genome Biology 20: 257. Available at: https://doi.org/10.1186/s13059-019-1891-0.

Wu, Y.-W., B.A. Simmons, and S.W. Singer. 2015. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32: 605–607. Available at: https://doi.org/10.1093/bioinformatics/btv638.

Ye, Y., and T.G. Doak. 2009. A parsimony approach to biological pathway reconstruction/inference for genomes and metagenomes. PLoS Comput Biol 5: e1000465.

Yu, G., Y. Jiang, J. Wang, H. Zhang, and H. Luo. 2018. BMC3C: binning metagenomic contigs using codon usage, sequence composition and read coverage. Bioinformatics 34: 4172–4179. Available at: https://doi.org/10.1093/bioinformatics/bty519.

Appendices

A Appendix 1

Citations of all R packages used to generate this report.

[1] J. Allaire, Y. Xie, C. Dervieux, et al. rmarkdown: Dynamic Documents for R. R package version 2.25. 2023. https://github.com/rstudio/rmarkdown.

[2] S. M. Bache and H. Wickham. magrittr: A Forward-Pipe Operator for R. R package version 2.0.3. 2022. https://magrittr.tidyverse.org.

[3] C. Boettiger. knitcitations: Citations for Knitr Markdown Files. R package version 1.0.12. 2021. https://github.com/cboettig/knitcitations.

[4] J. Cheng, C. Sievert, B. Schloerke, et al. htmltools: Tools for HTML. R package version 0.5.7. 2023. https://github.com/rstudio/htmltools.

[5] R. Francois and D. Hernangómez. bibtex: Bibtex Parser. R package version 0.5.1. 2023. https://github.com/ropensci/bibtex.

[6] C. Glur. data.tree: General Purpose Hierarchical Data Structure. R package version 1.1.0. 2023. https://github.com/gluc/data.tree.

[7] R. Iannone. DiagrammeR: Graph/Network Visualization. R package version 1.0.10. 2023. https://github.com/rich-iannone/DiagrammeR.

[8] Y. Qiu. prettydoc: Creating Pretty Documents from R Markdown. R package version 0.4.1. 2021. https://github.com/yixuan/prettydoc.

[9] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2022. https://www.R-project.org/.

[10] K. Ren and K. Russell. formattable: Create Formattable Data Structures. R package version 0.2.1. 2021. https://renkun-ken.github.io/formattable/.

[11] H. Wickham, J. Bryan, M. Barrett, et al. usethis: Automate Package and Project Setup. R package version 2.2.2. 2023. https://usethis.r-lib.org.

[12] H. Wickham, R. François, L. Henry, et al. dplyr: A Grammar of Data Manipulation. R package version 1.1.3. 2023. https://dplyr.tidyverse.org.

[13] H. Wickham, J. Hester, W. Chang, et al. devtools: Tools to Make Developing R Packages Easier. R package version 2.4.5. 2022. https://devtools.r-lib.org/.

[14] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. Boca Raton, Florida: Chapman and Hall/CRC, 2016. ISBN: 978-1138700109. https://bookdown.org/yihui/bookdown.

[15] Y. Xie. bookdown: Authoring Books and Technical Documents with R Markdown. R package version 0.36. 2023. https://github.com/rstudio/bookdown.

[16] Y. Xie. Dynamic Documents with R and knitr. 2nd. ISBN 978-1498716963. Boca Raton, Florida: Chapman and Hall/CRC, 2015. https://yihui.org/knitr/.

[17] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014.

[18] Y. Xie. knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.45. 2023. https://yihui.org/knitr/.

[19] Y. Xie, J. Allaire, and G. Grolemund. R Markdown: The Definitive Guide. Boca Raton, Florida: Chapman and Hall/CRC, 2018. ISBN: 9781138359338. https://bookdown.org/yihui/rmarkdown.

[20] Y. Xie, C. Dervieux, and E. Riederer. R Markdown Cookbook. Boca Raton, Florida: Chapman and Hall/CRC, 2020. ISBN: 9780367563837. https://bookdown.org/yihui/rmarkdown-cookbook.

[21] H. Zhu. kableExtra: Construct Complex Table with kable and Pipe Syntax. R package version 1.3.4. 2021. http://haozhu233.github.io/kableExtra/.

B Appendix 2

Version information about R, the operating system (OS) and attached or R loaded packages. This appendix was generated using sessionInfo().

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DiagrammeR_1.0.10    data.tree_1.1.0      devtools_2.4.5      
##  [4] usethis_2.2.2        bibtex_0.5.1         knitcitations_1.0.12
##  [7] htmltools_0.5.7      prettydoc_0.4.1      magrittr_2.0.3      
## [10] dplyr_1.1.3          kableExtra_1.3.4     formattable_0.2.1   
## [13] bookdown_0.36        rmarkdown_2.25       knitr_1.45          
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.4.7         tidyr_1.3.0        sass_0.4.8         pkgload_1.3.3     
##  [5] jsonlite_1.8.8     viridisLite_0.4.2  bslib_0.6.1        shiny_1.8.0       
##  [9] highr_0.10         yaml_2.3.8         remotes_2.4.2.1    sessioninfo_1.2.2 
## [13] pillar_1.9.0       backports_1.4.1    glue_1.7.0         digest_0.6.33     
## [17] RColorBrewer_1.1-3 promises_1.2.1     rvest_1.0.3        RefManageR_1.4.0  
## [21] colorspace_2.1-0   httpuv_1.6.12      plyr_1.8.9         pkgconfig_2.0.3   
## [25] purrr_1.0.2        xtable_1.8-4       scales_1.3.0       webshot_0.5.5     
## [29] processx_3.8.2     svglite_2.1.2      later_1.3.1        timechange_0.2.0  
## [33] tibble_3.2.1       generics_0.1.3     ellipsis_0.3.2     withr_2.5.2       
## [37] cachem_1.0.8       cli_3.6.2          crayon_1.5.2       mime_0.12         
## [41] memoise_2.0.1      evaluate_0.23      ps_1.7.5           fs_1.6.3          
## [45] fansi_1.0.5        xml2_1.3.5         pkgbuild_1.4.2     profvis_0.3.8     
## [49] tools_4.2.1        prettyunits_1.2.0  formatR_1.14       lifecycle_1.0.4   
## [53] stringr_1.5.1      V8_4.4.0           munsell_0.5.0      callr_3.7.3       
## [57] compiler_4.2.1     jquerylib_0.1.4    systemfonts_1.0.5  rlang_1.1.3       
## [61] rstudioapi_0.15.0  visNetwork_2.1.2   htmlwidgets_1.6.2  miniUI_0.1.1.1    
## [65] curl_5.1.0         R6_2.5.1           lubridate_1.9.3    fastmap_1.1.1     
## [69] utf8_1.2.4         DiagrammeRsvg_0.1  stringi_1.8.3      Rcpp_1.0.11       
## [73] vctrs_0.6.5        tidyselect_1.2.0   xfun_0.41          urlchecker_1.0.1

Exploring sagebrush microbial metagenomes from deep, host-derived sequencing

Reproducible Workflow

Adedotun Arogundade

2025-07-21

1 Introduction

1.1 Why this study?

1.2 Why Sagebrush?

2 Sampling

2.1 Genomic Data

3 Bioinformatic Workflow

3.1 Data structure and organisation of files

3.2 Bioinformatic Tool

3.3 Setting Up!

3.3.1 Create a metagenomics conda environment

3.4 Preprocessing

3.4.1 Seperate ‘unmapped’ reads from ‘mapped’ reads

3.4.1.1 Data requirement

3.4.1.2 Dependencies

3.4.1.3 Code

3.4.2 Removal of ‘host’ reads using BMTagger

3.4.2.1 Dependencies

3.4.2.2 Create the BMTagger folders

3.4.2.3 Clean with Artemisia tridenta subp tridenta genome

3.4.2.4 Clean with Artemisia annua genome

3.4.2.5 Clean with human (Homo sapien) genome

3.4.2.6 Output

3.5 Taxonomic assignment of short reads

3.5.1 Import into Kbase

3.5.2 Run FastQC

3.5.2.1 Data Requirement

3.5.2.2 Dependencies

3.5.3 Taxonomic Assignment of short reads with Kaiju

3.5.3.1 Data Requirement

3.5.3.2 Dependencies

3.5.4 Taxonomic Assignment of short reads with Kraken2 and DIAMOND

3.6 Assembly, and binning

3.6.1 Assembly

3.6.1.1 Data Requirement

3.6.1.2 Dependencies

3.6.2 Binning

3.6.2.1 Data Requirement

3.6.2.2 Dependencies

3.7 Classification of Metagenome Assembled Genomes

3.7.0.1 Data Requirement

3.7.0.2 Dependencies

3.8 Gene Prediction and Functional profiling

3.8.1 DRAM Annotation on KBase.

3.8.2 Gene Prediction

3.8.2.1 Prodigal

3.8.2.1.1 Data Requirement

3.8.2.1.2 Dependencies

3.8.2.1.3 Code

3.8.2.2 KOFAM Scan

3.8.2.2.1 Data Requirement

3.8.2.2.2 Dependencies

3.8.2.2.3 Code

3.8.2.3 MinPath

3.8.2.3.1 Data Requirement

3.8.2.3.2 Dependencies

3.8.2.3.3 Code

4 References

Appendices

A Appendix 1

B Appendix 2