Setting up Galaxy

At ISMB, the Galaxy guys were talking about using Galaxy as an interface for analysing NGS data. I’m having a go at getting it up and running on EC2. Notes are really for my own reference, but I thought I’d post them in case they were of use to anyone else.

AWS Setup

Obviously, you need an AWS EC2 account in order to get this working. Once you’ve got your account set up, install the AWS command line tools from here (Ignore the negative reviews, they work fine on Linux.). There are a few environment variables to set too. EC2_HOME is where the tools are, a location you’ll need to add to your path. You’ll also need to tell them where your AWS private key and x509 certificate files are (files that should have been generated during AWS registration but, if not, see the AWS x509 docs). Something like the following in your ~/.profile should do the trick:

[bash]
export EC2_HOME=/home/cassj/.ec2
export PATH=$PATH:$EC2_HOME/bin
export EC2_PRIVATE_KEY=/home/cassj/.ec2/amazon-pk.pem
export EC2_CERT=/home/cassj/.ec2/amazon-x509.pem
[/bash]

You’ll also need to create and register a key-pair to access your EC2 instances. The easiest way to do this is via the AWS management console – it’s fairly self-explanatory. Save the .pem file somewhere on your local machine (I’m using cassj.pem ).

Start Instance

Create a security group for the Galaxy server for which we can open appropriate ports. Then run an instance. I’m using the official Ubuntu Intrepid x86 server AMI as a base: ami-5059be39. I’m also using us-east-1b cos that’s where the EBS volume with all my ChIPseq data lives.

[bash]
ec2-add-group galaxy -d ‘Group for Galaxy Server’
ec2-run-instances ami-5059be39 –region us-east-1 –availability-zone us-east-1b –key cassj –group galaxy –instance-type m1.small –instance-count 1
[/bash]

Connect to instance

Open up the ssh port (Am just opening it to everyone. Alternatively, you can restrict the IP addresses using CIDR format).

[bash]
ec2-authorize galaxy -Ptcp -p22 -s 0.0.0.0/0
[/bash]

Run ec2din to check your instance is running and get its address, then ssh in using your keypair, something like:

[bash]
ssh -i cassj.pem ubuntu@ec2-174-129-174-125.compute-1.amazonaws.com
[/bash]

Install Galaxy

Install Galaxy on your running instance. The following will grab the latest version from the repository and stick it in /galaxy.

[bash]
sudo apt-get install mercurial
cd /
sudo hg clone http://www.bx.psu.edu/hg/galaxy galaxy
sudo chown -R ubuntu:ubuntu galaxy
cd galaxy
sudo sh setup.sh
sh run.sh
[/bash]

And modify the file universe_wsgi.ini so that the host is set to the appropriate place, eg
[bash]
host = ec2-174-129-166-230.compute-1.amazonaws.com
[/bash]

Well, that was easy. You seem to need to run run.sh as root initially, but after that it seems to be ok if you run as user ubuntu.

Install Apache for static files

By default Galaxy runs on port 8080. We’ll set up apache running on port 80, tell it to handle any of the requests for static files, to take the load off the Galaxy process and ask it to hand over anything else to Galaxy to deal with. So, install Apache2 and enable mod_rewrite, mod_proxy and mod_proxy_http

[bash]
sudo apt-get install apache2
sudo a2enmod rewrite
sudo a2enmod proxy
sudo a2enmod proxy_http
[/bash]

In /etc/apache2/sites-available/default this will redirect the stuff handed to Apache to Galaxy:

[xml]
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteRule ^/(.*) http://ec2-174-129-166-230.compute-1.amazonaws.com:8080/$1 [P]
[/xml]

And this will handle the limited number of static files that we want Apache to deal with:

[xml]
RewriteRule ^/static/style/(.*) /galaxy/test/static/june_2007_style/blue/$1 [L]
RewriteRule ^/static/(.*) /galaxy/test/static/$1 [L]
RewriteRule ^/images/(.*) /galaxy/test/static/images/$1 [L]
RewriteRule ^/favicon.ico /galaxy/test/static/favicon.ico [L]
RewriteRule ^/robots.txt /galaxy/test/static/robots.txt [L]
</IfModule>
[/xml]

More info on installing Galaxy can be found on the wiki

Restart apache with sudo /etc/init.d/apache2 restart.

Authorize Apache and Galaxy Ports

[bash]
ec2-authorize galaxy -Ptcp -p8080 -s 0.0.0.0/0
ec2-authorize galaxy -Ptcp -p80 -s 0.0.0.0/0
[/bash]

Now if you go to http://<Your AWS URL> you should see your Galaxy installation.
It’s not going to be totally functional because we haven’t installed all of the underlying bioinformatics binaries but my plan is to have separate instances doing the actual analysis anyway. That’s tomorrow’s problem though…

Advertisements

2 responses to “Setting up Galaxy

  1. I remember we were talking about the potential of AWS at that Perl night… You said you were worried about the sheer data transfer/storage costs for chipseq data. Did it turn out to be fairly cost effective after all?

  2. They gave us a grant, so it’s currently free. I need to work out how we’re going to use it in the long term though.

    For doing data analysis, it’s much easier than booking time on a cluster at work, as I can install exactly what I want without having to go through the cluster admin and I don’t have to wait in line. Unless we start doing a lot more data analysis, I think it’s still going to work out much cheaper than buying in our own hardware and it’s certainly going to be less hassle. I’m envisaging just running instances of the Galaxy and analysis AMIs when we’re actually doing the alignment, peak finding and so on. I’ll probably keep the short read data as BAM or BioHDF on S3 so people can get at them and run their own analysis on EC2 if they want to.

    I also need to set up a LIMS-type server for managing and sharing our data. Probably with some kinda REST-API so other stuff (network building tools etc) can query the data (eg. “What experiments have you done that involve NRSF? as RDF?” or “Give me all the features between these chr co-ords for NRSF binding in NS5 cell lines as BED, or SAM or something). We can host this kind of thing at work and it doesn’t really need to scale much. So I imagine I’ll have the analysis pipeline hand me back a list of peaks (genome pos, plus score, pvalue etc) and some metadata describing the analysis workflow and dump that into my LIMS thing.

    I’ll bring my laptop to the biogeeks thing if you want to have a play with the AWS stuff.

    I guess I’ll probably just stick the short read data onto EC2 for analysis and have it spit out the binding peaks and analysis metadata back to me.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s