Project 3

BIOL/CMPU 353 - Bioinformatics
Smith and Schwarz
Spring 2012

Assigned: Tue, Feb 28
Due: Tue, Mar 20 (after spring break)

Prediction of 5' UTRs in Aiptasia

We just spent a field trip at CSHL learning about the role of 3’UTRs in regulating gene expression and bioinformatic approaches for identifying genes whose 3’UTRs are targeted by miRNAs. Now let’s think about the role of 5’UTRs. These regions of the mRNA are likely involved in regulating gene expression and translation, but little is known about how these function. Let’s say that we are interested in isolating the 5’UTR regions from a transcriptome, so that we can study them. How might we design an approach for identifying them? We can use blast searches to help us.

Imagine you’ve just Blasted your large set of sequences against an even larger database, and after many hours (or days) running on your high-performance parallel computing cluster, you have a still larger Blast results file! What now?

  • Blast search previously executed. File located at
    /home/joschwarz/public/AiptasiaTranscriptome_v_SwissProt.blastx
    • This is the file that your program should read from
    • Please do NOT copy this file due to its large size (almost 0.6 GB)
  • Background needed
    • PEDNA Ch 8: Reading text files (fasta, blast results, any text file)
    • Finite State Machines: a framework for designing parsing algorithms
  • Starter code: assign3v1.pl.txt
    • modify this code to read Jodi’s blast results file
    • add additional states and state transitions
    • remember: we want to print only those hits whose alignments begin with an M, and the M occurs in the first position of the Subject: sequence.
  • Output format notes:
    • columns should be tab-delimited to make importing into a spreadsheet easier
    • our algorithm found 4341 4370 1) predicted 5’ UTRs—too many to reasonably list!
    • here are the first 15 predictions from our output
      EST	EST Length	Query Start Position
      Locus_89379_Transcript_1	3623	2244
      Locus_82449_Transcript_1	3961	421
      Locus_82448_Transcript_1	4162	421
      Locus_82446_Transcript_1	3999	407
      Locus_97156_Transcript_1	4598	4489
      Locus_21618_Transcript_1	1175	29
      Locus_124359_Transcript_1	2118	659
      Locus_125402_Transcript_3	1999	92
      Locus_125403_Transcript_1	1880	92
      Locus_129236_Transcript_1	5316	73
      Locus_129235_Transcript_1	5557	73
      Locus_129234_Transcript_1	5212	73
      Locus_52816_Transcript_1	2177	1123
      Locus_52819_Transcript_1	1860	759
      Locus_52818_Transcript_1	1212	164
1) with thanks to Daniel for helping us identify a problem with our solution’s parser, due to the possibility of EST Lengths over 9999 and not accounting for the comma’s in our regex
courses/cs353-201201/assigns/assign03.txt · Last modified: 2012/03/12 23:11 by mlsmith
VCCS Top Events Extended Site Search Login Vassar Science Web Vassar Home Driven by DokuWiki Valid XHTML 1.0