Tag Archives: Bioinformatics

Dockerised BRAT annotation tool

Been playing with Docker this week. First attempt is a dockerized version of the brat annotation tool brat.nlplab.org.

Dockerfile and config bits on github: https://github.com/cassj/brat-docker

Image on dockerhub, so you can just run with:


docker run -d -p 8000:80\
-e BRAT_USERNAME=\
-e BRAT_PASSWORD=\
-e BRAT_EMAIL=\
cassj/brat

Next generation sequencing information management and analysis system for Galaxy « Blue Collar Bioinformatics

Next generation sequencing information management and analysis system for Galaxy « Blue Collar Bioinformatics.

HowTo: Get a list of all species in Ensembl

From ensembl-dev mailing list: advice on how to get the current list of all ensembl species:

List lives in the compara database and can be retrieved directly with something like:


mysql -u anonymous -h ensembldb.ensembl.org -P 5306 ensembl_compara_52 \
-e "SELECT name FROM genome_db";

Or you can do essentially the same query via the Perl API:


use Bio::EnsEMBL::Registry;
my $reg = "Bio::EnsEMBL::Registry";
$reg->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);


my $genomes = $reg->get_adaptor("Multi", "compara", "GenomeDB")->fetch_all;
print $_->name."\n" foreach (@$genomes);


#or in asciibetical:
print $_."\n" foreach (sort {$a cmp $b} map {$_->name} @$genomes );

If you can’t / don’t want to use compara, you can achieve the same thing by retrieving all of the core DBAdaptors and asking them what species they are:


use Bio::EnsEMBL::Registry;
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);


my @db_adaptors = @{ $registry->get_all_DBAdaptors(-GROUP=>'core') };
print $_->species() ."\n" foreach (@db_adaptors);

biomaRt mapping from human illumina ID to mouse Ensembl ID

Problem:

Dataset 1: ChIPseq-derived transcription factor binding sites. Mouse. Mapped to nearest Ensembl Gene ID,

Dataset 2: Human Illumina Ref6 expression array data (GPL6097, I think) from various cell lines with varying amounts of said transcription factor.

Question:

What are the targets of the transcription factor doing in the expression datasets?

Quick Mapping

As a quick approximation, map the Illumina human IDs to their Human Ensembl IDs, then grab the Ensembl IDs of the homologous mouse gene, filter to include only those annotated as nearest to a binding site and then it’s east to pull out the expression of the binding targets (listing the myriad reasons why the results could well be biologically meaningless is left as an exercise for the reader…).

Reason I ❤ biomaRt


library(biomaRt)
ensembl.human = useMart("ensembl", dataset="hsapiens_gene_ensembl")
ensembl.mouse = useMart("ensembl", dataset="mmusculus_gene_ensembl")
homologs <- getLDS(
attributes=c('illumina_v1', 'ensembl_gene_id', 'chromosome_name'),
filters='illumina_v1',
values=human.illumina.ids,
mart=ensembl.human,
attributesL='ensembl_gene_id',
filtersL='ensembl_gene_id',
valuesL=tfbs.nearest.mouse.ensembl.ids
martL=ensembl.mouse,
uniqueRows=TRUE
)