Computational Lab: 2-4-11
- Start with the files acghTable.dat and probeList.dat.
Using the probeIDs as unique identifiers, match each probe's value to
it's position and transform the data for sample "0001" into lff format.
- Take the scores from column "0001" and insert them into the 'score' field of your lff.
- Filter out any probes with no value (or "NA").
- Use "+" for each strand value
- The phase, qstart, and qstop fields should be filled in with dots (".")
- create an attribute-value pair in the 13th column called "sample" with the value "0001"
- deliverables: the ruby script you use to create the lff file
- Upload this data into Genboree (you may either use the API or
upload it manually). Then, use the Segmentation tool (under Tools >
Plugins) to select regions of high copy-number variance. Require that
each segment contain at least 3 probes, and that the score of each
segment exceed two standard deviations from the mean.
- Combine the resulting track with the segmented data from the other 185 tumors (all185.acgh.lff.gz) and select out only those segments that represent gains on chromosome 12, using the Annotation selector tool.
- note: you don't have to unzip the file before uploading to Genboree
- Now, upload the file refSeq.blocked.noSplice.lff.gz
to your database. It will create a track called "RefSeq:Blocked" This
is the refseq genes track with intronic sequences treated as part of
the gene, and all of the the splice variants removed. Use the Attribute
Lifter tool in Genboree to lift in the sample names from chr 12 gains
that hit these genes.
- Click on the track name and use the tabular view to create a table
with two columns - the gene name, and a comma-seperated list of matching
samples. Download this table, then write a small ruby script that
parses this table, and outputs only the few genes that are altered in
more than 20 samples.
- note: the "numIntersects" field is not a reliable
indicator of how many samples match, only how many distinct annotations
match. You'll have to write a script to count the entries in the
samples attribute.
- deliverables: the ruby script that parses your tabular output. You should then submit a list containing two columns: The first column
containing the genes on chromosome 12 that have gains in more than 20
samples, and the second column the exact number of samples that they're
altered in.
All deliverables:
- ruby script from step 1 that creates your lff file
- ruby script from step 5 that parses your table
- two-column output from step 5
This assignment will be due on Feb 18, 2010.
Zip the files up, title the zip with your name, and send them to chiachiw@bcm.edu.
Feel free to contact me if you're having any problems. Email is
usually the best way, and I'll almost always respond within an hour or
two. We can also arrange a meeting - email me and we'll work out the
details.
I'll look over early submissions and if there are major problems,
I'll return them to you and give you a chance to resubmit. Assignments
completed closer to the due date may not get this opportunity.