Final Project

Instructions and Requirements for the Final Project
to be submitted by August 8, 2019

Title Page with the following information:
The name of your final project, your name, ID number, department, tel. #, and email.
Start the final project with an abstract describing of the biological purpose of your analyses (not a summary of your work, more of an introduction).
For each type of analysis used, give a short description of what you did, what program you used (including what parameters were used), and most importantly, analyze and explain the meaning of your results (do not explain how the programs work, unless its relevant to the discussion of your results!).
At the end (of the whole project) include the full output of the programs you ran (with the exception of the database search, details below). Please print the files directly from the programs - do NOT edit or put into Word! For all outputs from the web (such as database searches) use the Courier font.
DO NOT include the outputs in your discussion (refer to the results, do NOT copy/paste them). The text should include all the information necessary to understand your project, and the outputs should be in an appendix at the end.
Please print the final project on a Laser Printer (do not submit a handwritten paper).

The project is due NO LATER THAN Thursday, August 8, 2019, by 4pm. Bring it to the Levine building, Room 111. (You can bring it before the deadline too.)

If you have a problem with the deadline (for example: miluim, out of the country - not "I have another exam") speak to Shifra BY July 31. (Telephone x2470).

The computer classroom will be reserved for us after the end of the semester, through the due date at the regularly scheduled hours.

For the Project

Choose any mRNA (your favorite gene, a project from your lab, something that always interested you....). The species must be one that is in the UCSC genome browser. Bacteria and Viral sequences are not acceptable, if you want to work with plants, ask us. To make life easier, try not to take a long gene!
Run a search with your sequence against the appropriate genome database.
Please report on which chromosome your sequence is located, on which strand the gene is located, whether the sequence is draft or finished (if your genome has draft), and the exon/intron structure of your gene (as far as you can tell from your results), any splice variants. Don't forget to explain ALL the hits (even those that are unexpected).
Take the mRNA and run a translation program - find the open reading frame and translate it to a Protein.
Database similarity search
Run a similarity search of your Protein sequence against a Protein database using BLAST.
Please describe in your report:
What program you used, what database, scoring matrix and if you used the filter.
Look at the hits list: the distribution of the hits with the various E scores.
Look at the Alignments and report:
For the top 10 hits: relate to their length and % of identity (similarity). (in addition to e-score, organism, related proteins...)
Summary of the rest of the results (which organisms they come from, are they from the same family.....).
Pairwise comparison
You need to test the validity of the last hit on the hits list from the database similarity search.
Please describe in your report:
Which sequences you compared, which program you used, % similarity and identity
what the alignment looks like and how it compares to the database search that found it
Don't forget that the algorithms have to match! (database search and pairwise - global or local)
Multiple Alignment
When choosing the sequences for multiple alignment, choose sequences that are 80% similar or less (if possible - if you have a particular reason why you want to use more similar sequences, discuss it with us first). (If you don't get less similiar from your database search, redo do it with more hits!)
Use at least 5 sequences in addition to yours.
According to the similarities to your query sequence, you'll choose the method for the multiple alignment. You only have to use ONE method (you can use either clustalw, clustalo or muscle).
Please describe in your report:
What are the sequences used for the alignment, how similar are they to your sequence, what program was used.
Describe the results, namely pointing out regions which are conserved and regions which are variable.
Use the InterProScan to look for motifs in your Protein sequence.
Describe your results. (Don't forget to list which databases, what the hits are, where they are in your sequence....and the sequence signatures, if they are available.)
Can you say that the "motifs" found in your sequence are represented in the multiple alignment (if so, how well are they conserved? if not, why not?)
Summarize your findings.

Your outputs should include:

1) your sequence
2) a printout of the genomic viewer with your blat hits (the blat output itself, in other words, the full list of hits, and the alignment and a printout of the genome browser with your hits visible on it for the best match to your gene)
3) the output of your translation program
4) The full hits list and at least the top ten alignments of your database search (you should also include the alignments of any other sequences you use later on - particularly the sequence you use for pairwise analysis)
5) your pairwise alignment output
6) your multiple alignment
7) a printout of your interpro results (the graphical view is enough, not all of the internal pages!)

For questions or suggestions please contact:

Shifra shifra.ben-dor@weizmann.ac.il