These are just my notes and may not be entirely accurate. Feel free to correct me in the comments.
Using RNAseq for:
quantification of messages
discovery and mapping of new messages
novel splice isoforms
improving gene models, better identification of promoters, utrs etc.
identification of RNA editing
Allele specific expression
Main talk content – RNA editing. A bit on ChIPseq in muscle differentiation at the end.
Brief mention of the difficulties caused by the large (multi-log) differences in concentrations of different transcripts. Think it was implied that they had ideas on how to deal with this, but no details given yet.
RPKM: “Reads per Kb Message”:
Currently using 75bp reads from Illumina GA paired reads, older data down to 25bp.
Not sure I picked this up correctly, but I think their approach to identifying RNA editing is to do deep seq on the genome of a cell line and identify SNP, and to do deep sequencing on the transcriptome of the same cell line and identify SNPs, then to take the SNPs which disagree between the genome and the transcriptome as candidate RNA editing sites.
They compared read lengths (2x75bp vs 1x32bp) in the ENCODE tier 1 cell lines (k562 (bcr:abl positive erythroleukamia line) and GM12878 (EBV transformed lymphoblastoid cell line) and found that the longer reads are, unsurprisingly, better at identifying alternatively spliced isoforms of messages. Also , a lot of the noise you see from the shorter reads is cleared up using the longer ones, presumably because ambiguous genome/transcriptome mappings are disambiguated.
ADAR: “Adenosine Deaminase Acting on RNA”. Catalyses: A->Inosine in the transcript which then ->G during amplification for sequencing. (I think?!)
Important function – mouse ADAR K/O dies in mid-gestation and haematopoiesis fails.
ADAR and I-modified RNA expression levels are tissue specific – lots in brain.
Used ERANGE to call expressed SNPs in Human ES Cells.
Require >=4 uniquely mappable and non-identical (start/stop) reads to agree on mismatch identity.
Require >=25% unique reads covering the position support the SNP.
Only use reads with at least 70/75 positions mapping exactly in the initial alignment (…I think)
Don’t have deep sequencing of genome for ES cells, so from the SNPs, use the assymetry of A->G vs G->A to indicate RNA editing?
At this point, I got a bit lost: Compare RNA only SNPs with GM line genomic DNA SNPs also in RNA. Most detected in genome, of remainder some in dbSNP, others (~20-%) are new. Looking at the stuff with multiple events per gene as they occur in clusters?
VISA gene – lots of edits in 3utr in repeats – good place for ADAR action as repeats can fold into 3D structures needed for ADAR targeting.
Stuff ADAR editing is targeting: miRNA genes. RNAs encoding protein regulators of the interferon response. Regulators of apoptosis/survival (might have misheard this, bit rushed.)
Editing pri-mRNA to affect target specificity. Also feasible that targets are being edited to control their miRNA-ability.
RNA editing overlap between ES & GM (Bcell) line. Some overlap, lots of cell line specificity. Less from editing specificity than from differences in gene expression between the lines.
Validation still required.
Sensitivity of edit detection: level of expression, depth of sequencing, mapability of regions.
Muscle differentiation in dish. Myoblast->Myotube. Various transcription factors inlcuding MyoD, Myogenin.
Again, this was quite quick, so not sure I got the details right, but the gist was that they can differentiate muscles in a dish and they’re looking at the binding of various TFs (MyoD, Myogenin etc), RNA pol, various histone modifications, CTCF insulator, REST/NRSF during the differentiation process.
ChIPseq Myogenin: 14712 high confidence sites (defined by?)
For motif: CAGSTG (Note to self Ebox. Binds with E2A. Same as that for Mash1 (E47). Overlap in targets?) there are 2.2 million motif matches in the genome and 0.5% occupied.
Known muscle enhancers have MyoD and Myogenin sites and most have Mef2 and p300, which is as you’d expect, but as you move out only 20Kb from the known enhancer regions, you still get the same number muscle specific genes with occupancy, but you also start to get a lot of non-muscle genes having occupancy.
All well studied positive functional sites confirm. Well studied negatives confirm too. But there are also >10000 equally prominent binding events are not adjacent to genes with classic muscle expression. Combinations of factors also often not close to gene with expected pattern. Conservation enriches a bit for muscle-spec, but not much.
Sort genes into expression pattern groups: Genes that aren’t expressed don’t have nearly as much binding as anything, but everything else (undiff, diff muscle, other stuff) doesn’t make all that much difference.
Integration of everything (didn’t get the details here at all – anyone care to fill in the gaps?):
Lay all (TFs, histone mods etc) on genome. Segment the genome – enriched regions define segment boundaries, Min seg size=40bp. Now take all 39 datasets across the segmented genome and analyse resulting segmentation densities in some way that involves PCA (erm…gap.). This approach seems to be better at separating the muscle-specific stuff.
Needed: 3C-ChIP measurements of true long distance interactions genome wide (Someone at GIS doing this?)
‘easy rider’ hypothesis: positive acting factors, play their specific role at muscle specific genes (~2000) and while they’re at it, they encourage expression of loads of other stuff (~10000) (not specific? Not Ebox?)
Or maybe the gating for transcription is much more combinatoric than yet measured? Maybe we’re just not measuring enough
Or maybe more cleverness happening at the promoter that we thought – perhaps only some promoters can respond to the binding?
Lots of differences between mouse & Human targets.