Macs Peak Finding

My notes on how Macs works. Probably riddled with errors – check out the links below for official docs:

Website
Genome Biology Paper

We’ve used ChIP to grab bits of DNA that correspond to the region bound by the transcription factor (or marked by the histone modification, or whatever). So the data looks something like this:

190609

Except that the reads are not full length, so only the ends of the DNA fragments are read. Giving a biased situation more like:

190609

Macs deals with this problem by sliding a window (of 2 x user-defined bandwidth) along the genome locating a set of reliably enriched tags (user-defined Mfold enrichment relative to background model).

190609

It then selects a set of 1000 of these reliable tags at random and aligns them by their mid-points:

190609

Macs uses this alignment to estimate the end bias:

190609

And shifts all of the tags in the 3′ direction so that they are over the binding site, rather than the ends of the DNA fragments:

190609

Once the tags have been corrected for end bias, Macs can then use them to determine the location of peaks of significant enrichment that correspond to binding sites (or locations of histone modification, or whatever you’re looking at)

Scale the total control tag count to the total sample tag count and remove duplicate tags in excess of those expected by chance (these duplicates typically arise from amplification biases). Now slide a 2d window across the genome to identify candidate peaks with a significant (user defined p-value) tag enrichment over background.

Background is modeled with a Poisson distribution, which expresses the probability of a number of events (lambda) happening in a fixed period (of time or in this case distance along the genome). It takes a single parameter, lamba, the expected number of instances that occur in the given region.

190609

The background level varies across the genome so Macs uses a local background, calculated from the control sample as:

190609

where λ1k, λ5k and λ10k are λ estimated from the 1 kb, 5 kb or 10 kb window centered at the peak. If there is no control sample, then the experimental sample is used to calculate the background, but the 1K region is not used.

Overlapping enriched peaks are merged and the location with the highest pileup is predicted as the precise binding site.

If a control sample is being used, Macs calculates the FDR empirically. At each p-value, MACS uses the same parameters to find ChIP peaks over control and control peaks over ChIP (that is, a sample swap). The empirical FDR is defined as Number of control peaks / Number of ChIP peaks.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s