[edit] or how I learned to stop freaking out and love the data deluge
This document is a chatty, blog-like entry about some of the stuff I went through, analyzing some of the Forney lab data. Many of the best things I've learned were from looking over someone's shoulder while they did their voodoo. My intention is that this lets you look over my shoulder while I'm using the IBEST systems to actually analyze some data.
The scenario. the Forney lab gets a ton of 454 sequence data. A BIG TON. More like a tonne. I want to learn how to analyze just this sort of data. I get a copy of the data. What do I do now?
My rough plan of attack is:
- find out what format the data are in and how they were generated
- find out why the data were gathered, what scientific questions/hypotheses drove the work (note: just looking through data with no goal in mind--naiive "data mining"-- is a waste of time, in my opinion.)
- reorganize the data so it will be easier to process on our systems, specifically on the clusters
- analyze for specific hypotheses/questions
- slay any ogres that get in the way (write scripts, do math, whatever it takes)
Tip on getting started: your biggest problem will be keeping track of all the files, and remembering what you have done to each. I recommend having a README file in every directory, documenting what you've done in that directory. Also, it really helps to know some basic bash scripting tricks (like for ... do ... done). My howto page has some tips.
For what it's worth, I also am building a howto page, with lots of tricks and tips.
Here are shortcuts to:
Note: I am beginning to think "we" really need to put all the data in a real database first, so that we can pull out individual items (like 14 character name, 10 character equivalent, sequences, alignments, quality scores, etc) and never lose the relationships between the items. This would also make it easier to do on-the-fly analysis between arbitrary subgroups of data. This would also make it easier to archive a project when one is finished. Perhaps I'll have time to work on this someday.
(last major update james 14:30, 17 June 2009 (PDT))

