Assignment 4: Database similarity searching
In this assignment you should prepare a report with the answers to various questions.
This report can be written in any text editor or word proccessor.

In the report JUST give the answers - DO NOT print out the questions and write the answers in.

READ THROUGH THE WHOLE ASSIGNMENT BEFORE YOU START. It will help you understand the purpose of the questions!!!!

Add your name and e-mail address on top of the page!!!


This assignment will deal with the various ways you can change parameters to fine tune a database search.

Our target gene is Lip1, a single-pass transmembrane protein that is required for de novo ceramide (sphingolipid) synthesis in yeast. According to the local campus bioinformatics expert, the gene is limited to a very specific branch of fungi. As the other genes involved in ceramide synthesis are highly conserved in almost all eukaryotes, you find this strange. Your job is to see if what was reported is correct, and to try to find the gene in other species (particularly mammals) as well.

Enter the NCBI web site (this will open a new window): http://www.ncbi.nlm.nih.gov/ and get the protein. The accesion number is: NP_014027

Go to the protein blast page: Blast at NCBI

and run a search with most of the default parameters. Go to the bottom of the page and click on the 'Algorithm parameters' blue bar. Change the Expect threshold (E value) to 10, to allow for more distant sequences.

  1. How many hits are there?
  2. Open the search summary (link near the top of the page, under the header bars): what is the matrix and word size?
  3. What species family are most of the hits from? (use the Taxonomy tab at the top of the results) How many hits are not from that family?
  4. How many of the hits are not statistically significant?
  5. Please make a chart of the last 5 hits with the following columns: 1) description 2) accession number 3) percent similarity 4) percent identity 5) length of alignment 6) e-value. Note: you will have to go to the alignments tab to get some of this information.
  6. Look at the alignments of the sequences in your chart. Which do you think are truly Lip1 proteins? (No or I can't tell are legitimate answers).

Now that we have a baseline, we'll start to change parameters (click "Edit and Resubmit" from the top of the page, do not go back, it doesn't work in all browsers. Be sure to change the title of the search so that you can keep track of what you've done. This can be done in the 'Job Title' line under the box where you paste the sequence). The first thing we'll do is change the word size to 2 keeping the E value at 10, and run the blast again.

  1. How many hits are there now?
  2. Look at the bottom ten hits and make a similar chart as before (question 5).
  3. Compare the two charts. What has changed? Has anything stayed the same? Why?

Now we are going to change databases. Click "Edit and Resubmit" again. Change the wordsize back to 6, leave the E value at 10, and the database to swissprot.

  1. How many hits are there?
  2. Make a chart of the hits as before (question 5)
  3. Find the hit from the last species in your previous results, (you may have to go back to the results pages). What has changed?
  4. How (describe differences) and Why are these results so different than the others?

Now we'll see what changing the matrix will do to the search. Click "Edit and Resubmit" again. Change the database back to nr, and change the matrix to PAM250 (wordsize should be 6, E value 10).

  1. How many hits are there?
  2. How many hits do we have that are not statistically significant? What species are they from?
  3. Make a chart of the last 5 hits as before.

One last time, we'll change the matrix again, this time to PAM30.

  1. How many hits are there? How many are not statistically significant?
  2. How many hits are not from the same family (from question 3)?
  3. Make a chart of the last 5 hits.
  4. Why are these results different from the previous runs?

To summarize:

  1. Look back over your various charts. Take the bottom two sequences from the very first run, and follow them through the four nr searches (ignore the swissprot). Look at the alignments, the species, the description. What do you think, are they true Lip1 or not? Explain your reasoning.
  2. What can we do to test if these sequence are true hits or not?
  3. We started the research to see if there is a mammalian equivalent. Using your results to explain, what do you think?

Hand in the report with all the answers Assignment #4