Illumina fastq format / quality score confusion

My notes on the sequence/quality data from Illumina genome analyser.

Phred

Phred is a program that takes the trace files produced by traditional DNA sequencing, calls the bases and assigns a quality value to each called base.

It calculates a score for each base as:

Qphred = -10 x log10(1/$error_prob)

Where $error_prob is the probability of the base call being wrong.

Note that if you see this written in Perl, it will look like:

Qphred = -10 x log($error_prob) / log(10)

Because Perl’s log function takes natural log (and as you’ve probably forgotten from GCSE maths, log_n(x) = log_e(x)/log_e(n)).

It encodes the base quality scores for a given sequence read as a series of ASCII characters by using, for each score, the ASCII character corresponding to Qphred + 33 (ASCII characters 0-31 are non-printing control characters, 32 is a blank space)


FASTQ

FASTQ files store the called base sequence and the associated quality scores in a single file. Its format is:

@seqname
seq
seqname
quality

For the regex literate, the following should match:

seqname  [A-Za-z0-9_.:-]+
seq	      [A-Za-z\n\.~]+
quality     [!-~\n]+

Illumina Genome Analyzer Pipeline Output Files

Illumina’s Next-Gen Sequencing platform (Genome Analyzer II) works in a different way to traditional Sanger sequencing but it still has to call bases and give them a quality score. The base-calling bit of Illumina’s analysis pipeline is called Bustard.

Pipeline v1.3 generates a *_qseq.txt file which contains the sequence and quality data for each tile in each lane of the flow cell. Pipeline v1.0 generates three separate files (_seq.txt, _sig2.txt and _prb.txt) to contain the same info. As far as I know, you can ask GERALD to spit out fastq formatted files, one per lane – which is the kind of thing most alignment tools want as input.

The sequence and quality info is also in the GERALD output files *export.txt and *sorted.txt. These contain the ELAND alignment results, one file per lane.


Caveats when dealing with Illumina files

Quality Filtering

Illumina calculates a Chastity Score as “the ratio of the highest of the four (base type) intensities to the sum of highest two”.
This score is used to remove clusters with low signal to noise ratio (often caused by clusters being too close to each other so their signals bleed into one another). The default cut-off for filtering is >=0.6.

But only the GERALD sorted.txt file has actually been filtered. Other files contain information about all the reads and just have a flag to indicate whether they passed the filtering test. So you may want to remove the failures before using the file for anything.

Where to find the filtering results:

* GERALD-generated *.fastq files have sequence IDs terminating in either a “Y” or a “N” indicating whether or not that sequence passed the quality filter.
* Pipeline v1.3 *_qseq files store the same information as 1 or 0 in their final column.
* Pipeline v1.0 I’m not sure about. It’s different to v1.3 as filtering is not done by Bustard but by GERALD. Don’t have the v1.0 docs to see what the filtering results file looks like.

Scoring Scheme

I don’t understand why, but Solexa originally decided not to use the phred calculation for their quality scores, instead using:

Qsolexa = -10 x log10($error_prob/(1-$error_prob))

They then decided that instead of using the phred ASCII conversion, they would use the ASCII character corresponding to Qsolexa+64

As of Pipeline v1.3, Illumina decided they would use the more standard phred scores. Which is great, except that to encode them as ASCII they’re using Qphred+64.

So, in summary there are now 3 different types of fastq files:

Standard fastq:   ASCII( Qphred + 33 )
Illumina pre v1.3 :  ASCII( Qsolexa + 64 )
Illumina post v1.3: ASCII( Qphred+64 )

Discriminating between fastq-like file types

min Qphred = 4
min QSolexa = -5

(according to www.phrap.com and the Bioc-sig-seq mailing list)

So:

Filetype                 min ASCII char code
Standard fastq:      37 ( 4 + 33 )
Illumina pre v1.3 :   59 ( -5 + 64 )
Illumina post v1.3:  68 ( 4 + 64 )

So presumably:

Chars 37-58 are only seen in standard fastq:

& ' ( ) * + , - . / 0 2 3 4 5 6 7 8 9 :

If none of the above characters are present, data is probably Illumina so:

Chars 59-68 are not seen in the Illumina files post v1.3:

; < = > ? @ A B C D

I haven’t checked the above is actually true yet. Use with caution đŸ™‚

Any Base Code

The standard IUB Base Code for any base is N. The Illumina Pipeline uses a dot. So, AGTGTANNAGT would be AGTGTA..AGT

Conversion

Most alignment tools expect standard fastq files, although some will let you use Illumina formats if you specifically tell them to. It’s fairly straightforward to convert from the Illumina types to standard fastq though.

For each of the sequence lines, all you need to do is swap all the dots for Ns:

#for sequence lines, swap dots for Ns.
$line =~s/\./N/g

v1.3 qualities lines just need to have 31 subtracted from their ASCII values (ord and chr docs):

$line = join '',
          map {chr(ord($_)-31)}
            split '', $line;

v1.0 is a little more complicated as we need to convert the scores as well as the ASCII representation.

Rearrange the formula for Qsolexa to get a value for 1/p:

pic1

pic2

pic3

And substitute this into the Qphred calculation:

pic4

pic5

So, to convert a solexa quality line in perl you can do something like:

$line = join '',
          map {chr(33 + 10* ( log(1+10**((ord($_)-64)/10)) / log(10) ))}
            split '', $line;

If you’ve got v1.0 files, you might want to have a look at the fq_all2std.pl script which is provided by the Maq.

Advertisements

3 responses to “Illumina fastq format / quality score confusion

  1. Useful info – thanks – keep up the posting.

  2. Very helpful. I’ve been trying to learn about the output files by reading the Illumina manual but hadn’t got this far yet.

  3. Hi Maria!

    Glad it’s useful. Just be warned that I’m not 100% sure it’s all correct – it’s just what I’ve gleaned from the various bits of documentation I can get my hands on and a couple of frustrating days Googling.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s