4. Prepare abundance tables for downstream analyses

So, after classifying the reads of each sample with Kraken2 we estimate the abudances with Bracken. The output files with the taxa abundances were saved in the sample.bracken files. In the following steps we will use some packages in R to explore the microbial datasets and create charts from the abundance tables generated with Bracken.

4.1. Parse the kraken/bracken outputs to abundance tables

First of all, we have to create an abundance table comprehensive of all the taxa abundances for every sample. To do that we will use a custom R script, brackenToAbundanceTable_v2.R, which merges the species aundances contained in the bracken output of each sample in one table. The script needs as argument the path to the folder containing the bracken results. Move to the folder with the bracken results and type the following command:

brackenToAbundanceTable_v2.R .

The script will generate the abundance table taxa_abundance_bracken_names_IDs.txt, which contains the species names and the NCBI IDs in each row, and the samples in the columns.

Note

You can generate abundance tables for various datasets (with bracken files contained in dedicated folders) and rename the final tables based on the folder (i.e. dataset) name. Try the following commands in the the folder containing all the datasets-subfolders:

for i in $(find -name "*names_IDs.txt" -type f | sort); do tag=$(basename $(dirname "$i")); cp $i ${FOLDER}/${tag}_abundance_bracken_names_IDs.txt; done

We will compare the oral microbiome of the Gombe chimps reconstructed from their dental calculus with other microbiomes. To do that we will merge the species abundance table of the chimps with the abundance table refDataset1_abundance_table_IDs.merged containing species abundances of various microbiomes:

Human gut microbiome from present-day agriculturalist communities in Peru and an urban-industrialized community from the US.

Human skin microbiome from children with atopic dermatitis.

Soil microbiome from Arctic tundra in Alaska and grassland in Oklahoma.

Human supragingival plaque microbiome from the Human Microbiome Project Database.

This table (download here) was generated by running the raw sequencing data of each dataset against the customkraken2_2Gb_Feb2023_db database that we are using with Kraken2. Do not compare datasets generated from different databases. To merge the chimp and reference abundance tables use the abundanceTablesMerger_v2.R script. The output is a file named abundance_table_IDs.merged, which we will use for downstream analyses.

abundanceTablesMerger_v2.R taxa_abundance_bracken_names_IDs.txt refDataset1_abundance_table_IDs.merged

If you have multiple datasets, you can merge them by listing each file one after the other:

abundanceTablesMerger_v2.R dataset1.txt dataset2.txt dataset3.txt

4.2. Normalization of the abundance table for genome length

Since the Kraken database was built from complete Bacterial, Archaeal and Viral genomes, we must make a normalization for the genome leghts of each species. This normalization is important when we want to analyse relative taxa abundances to characterise full microbiomes. To do that will use a Python script that takes three arguments:

The abundance table with names
The table of genome lengths (parsed from the NCBI genome browser)
The name of the output file

The script searches the names of the species in the file with the list of genome lengths of species in the NCBI and divides the abundance of each species by the length of the assembly available in the NCBI. You can put the table of the genome lenghts inside the folder containing your species abundance table, or you can assign it a variable with the path. Run the normalization with gL-normalizer-lite_v3.py and type the command:

TABLE=/path/to/folder/gL-prokaryotes-viruses-Jun2022.table
python gL-normalizer-lite_v3.py abundance_table_IDs.merged $TABLE abundance_table_IDs.merged.norm

The output table is named abundance_table_IDs.merged.norm. The table reports the Species, the NCBI IDs, the Assembly stage of the genome in NCBI, the Length in Megabases, the kind of Match, exact, species (when a peculiar strain is not found in the list of lenghts), and genus (when a species is not found). Some species may not be contained in the table of the genome lengths (for example if the taxonomy names have been updated or have ambiguities in the species name). The species not normalized are removed from the table, you can check them and their frequency (they are normally very rare) in the gLnotNormalized.log file.

sort -t$'\t' -g -k3,3 gLnotNormalized.log

4.3. Retrieve full taxonomic ranks of the abundance table

In the next step, you will retrieve the full taxonomic ranks of the species in your abundance table. This is useful in the downstream steps to make analyses at taxonomic ranks higher than species (e.g. genus, phylum) in Phyloseq (see below). To retrieve the full taxonomic ranks you must have the program Taxaranks installed.

Run the following script, which incorporates Taxaranks, to attach the full taxonomy to every species ID in your normalized table. Taxaranks takes the NCBI ID corresponding to each species in the table and searches the full taxonomy associated to that ID.

getFullTaxaranks_v2.sh -i abundance_table_IDs.merged.norm

The script generates the following files:

a file.taxonomy - the full taxonomic ranks (up to Kingdom) of the species entries.
a file.taxonomy.err - the species for which no taxonomic ranks could be found.
a file.taxonomy.final - full taxonomy ranks are included in the original abundance table.

Note

The file.taxonomy.err should be empty in principle. If not, then the full taxonomic ranks could not be retrieved for the species listed. In that case, try to update the NCBI taxonomy database as reported in the Taxaranks website.

Note

Depending on the composition of the Kraken DB that you used, you may want to remove unwanted taxa. To do that use grep, for example to remove Eukaryota and Viruses:

grep -v "Viruses\|Eukaryota\|Fungi" abundance_table_IDs.merged.norm.taxonomy.final > abundance_table_IDs.merged.norm.taxonomy.final.Archaea_Bacteria

If you want you could also generate a new table (without Eukaryota and Viruses) including only the species names, without the full taxonomy.

cut -f 20- abundance_table_IDs.merged.norm.final.Archaea_Bacteria > abundance_table_IDs.merged.norm.final.Archaea_Bacteria.species