2. Quality filtering of reads

2.1. Reads quality control

The reads generated by a sequencing platform are delivered in fastq format. The first step of the analysis of raw sequences is the quality-control. To do that, we will use FastQC, which provides a modular set of analyses that you can use to have a first impression of whether your data has any problems of which you should be aware before doing any further analysis. To run FastQC type the following command:

fastqc filename.fastq.gz

To analyze multiple fastq files you can run FastQC as follows:

fastqc *.fastq.gz

At the end of the analysis, FastQC generates for each input file a summary report, like in the screenshot below:

Note

If you are working on a server, you can download the reports of FastQC (and any other file) in your laptop with the command scp (Secure Copy), which allows files to be copied to, from, or between different hosts. It uses ssh for data transfer and provides the same authentication and same level of security as ssh. For example, to copy from a remote host (our server) to your computer:
```
scp username@remotehost:/full_path_to_file /some/local/directory
```

To copy a folder you need to call the option -r

scp -r username@remotehost:/full_path_to_file /some/local/directory

If you are using a pem file to connect to the server, you have to use in order to download the files:
```
scp -i filename.pem -r username@remotehost:/full_path_to_file /some/local/directory
```

2.2. Reads quality filtering

Reads filtering is a crucial step as it will affect all downstream analyses. One of the important things to do is to trim the adapters that were used during the preparation of the genomic libraries. For this step we will use the program AdapterRemoval, which performs adapter trimming and quality filtering of the reads. When working with degraded samples (such as ancient DNA), in case of paired-end sequencing, is it recommended to merge (collapse) the paired reads with negative insert sizes (an overlap between two sequencing reads derived from a single DNA fragment) into single collapsed reads. Here, we will analyse the DNA reads generated by the sequencing of dental calculus samples from skeletal material of chimpanzees from the Gombe National Park (Tanzania). With the following command, the trimmed and merged reads are saved in the file basename.collapsed.gz, where basename is the prefix selected for saving the files. The basename.collapsed.truncated.gz contains merged reads that have been trimmed due to the –trimns or –trimqualities options. The files basename.pair1.truncated.gz and basename.pair2.truncated.gz contain trimmed pairs of reads which were not collapsed, basename.singleton.truncated contains reads where one mate was discarded.

AdapterRemoval --file1 SRR8899212_1.fastq.gz --file2 SRR8899212_2.fastq.gz --basename SRR8899212 --minlength 30 --trimns --trimqualities --minquality 25 --collapse --gzip

Here some of the options of AdapterRemoval:

Option	Function
-file1 string	Forward reads input file(s) in fastq(.gz) file format. Required option (single-end reads).
-file2 string	Reverse reads input file(s) in fastq(.gz) file format.
–basename string	Default prefix for all output files for which no filename was explicitly set [current: your_output]
–adapter1 sequence	Adapter sequence expected to be found in mate 1 reads [current: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG]
–adapter2 sequence	Adapter sequence expected to be found in mate 2 reads [current: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT]
–trimns	If set, trim ambiguous bases (N) at 5’/3’ termini [current: off]
–trimqualities	If set, trim bases at 5’/3’ termini with quality scores <= to –minquality value [current: off]
–minquality integer	PHRED inclusive minimum values; see –trimqualities for details [current: 2]
–minlength integer	Reads shorter than this length are discarded following trimming [current: 15].
–collapse	When set, paired ended read alignments of –minalignmentlength or more bases are combined into a single consensus sequence, representing the complete insert
–minalignmentlength integer	If –collapse is set, paired reads must overlap at least this number of bases to be collapsed, and single-ended reads must overlap at least this number of bases with the adapter to be considered complete template molecules [current: 11].

To analyse multiple samples one after the other, use a for loop and a text file (tab-delimited) where you list the file names and the basename. These will be fed to the command using cat. Here is a sample sheet file for the analysis of the 8 chimp dental calculus samples.

cat samples_sheet_AdapterRemoval.txt | while read line; do
  R1="$(echo $line | awk '{ print $1 }')"
  R2="$(echo $line | awk '{ print $2 }')"
  ID="$(echo $line | awk '{ print $3 }')"
  echo "Analysing sample: $ID"
  AdapterRemoval --file1 $R1 --file2 $R2 --basename $ID --minlength 30 --trimns --trimqualities --minquality 25 --collapse --gzip
done

Note

Here we are using a paired-end library, for single-end libraries use the following command. The quality-filtered reads are saved in the file basename.truncated.gz, the discarded FASTQ reads are written to basename.discarded.gz, and settings and summary statistics are written to basename.settings.
```
AdapterRemoval --file1 filename_R1.fastq --basename filename --minlength 30 --trimns --trimqualities --minquality 25 --gzip
```
Several tools can be used for reads pre-processing and filtering, for example: ClipAndMerge, leeHom, Atropos, fastp.

After reads filtering open your adapter-trimmed fastq file again in FastQC and see the differences before (the two original paired-end reads files) and after (the collapsed reads file) adapter trimming.