My notes on how Macs works. Probably riddled with errors – check out the links below for official docs:
We’ve used ChIP to grab bits of DNA that correspond to the region bound by the transcription factor (or marked by the histone modification, or whatever). So the data looks something like this:
Except that the reads are not full length, so only the ends of the DNA fragments are read. Giving a biased situation more like:
Macs deals with this problem by sliding a window (of 2 x user-defined bandwidth) along the genome locating a set of reliably enriched tags (user-defined Mfold enrichment relative to background model).
It then selects a set of 1000 of these reliable tags at random and aligns them by their mid-points:
Macs uses this alignment to estimate the end bias:
And shifts all of the tags in the 3′ direction so that they are over the binding site, rather than the ends of the DNA fragments:
Once the tags have been corrected for end bias, Macs can then use them to determine the location of peaks of significant enrichment that correspond to binding sites (or locations of histone modification, or whatever you’re looking at)
Scale the total control tag count to the total sample tag count and remove duplicate tags in excess of those expected by chance (these duplicates typically arise from amplification biases). Now slide a 2d window across the genome to identify candidate peaks with a significant (user defined p-value) tag enrichment over background.
Background is modeled with a Poisson distribution, which expresses the probability of a number of events (lambda) happening in a fixed period (of time or in this case distance along the genome). It takes a single parameter, lamba, the expected number of instances that occur in the given region.
The background level varies across the genome so Macs uses a local background, calculated from the control sample as:
where λ1k, λ5k and λ10k are λ estimated from the 1 kb, 5 kb or 10 kb window centered at the peak. If there is no control sample, then the experimental sample is used to calculate the background, but the 1K region is not used.
Overlapping enriched peaks are merged and the location with the highest pileup is predicted as the precise binding site.
If a control sample is being used, Macs calculates the FDR empirically. At each p-value, MACS uses the same parameters to find ChIP peaks over control and control peaks over ChIP (that is, a sample swap). The empirical FDR is defined as Number of control peaks / Number of ChIP peaks.