Motif, Pattern, and Profile Assignment
In this assignment you should prepare a report with the answers to various questions.
This report can be written in any text editor or word proccessor.
Do not hand in printouts of the various programs, just answer the questions!
Add your name and e-mail address on the front page!!!
READ THROUGH the assignment before you begin! It will help you understand what you are supposed to do.
Some background:

In this exercise you will practice using various online databases, all of which have something to do with motifs, patterns or profiles.

The idea behind searching this kind of database is using the power of multiple alignments in determining the biological function of a sequence. We would like you to develop a feeling of how to use this kind of database in general, to give you a better understanding on the different kinds of information these databases provide and to understand the benefit in using these databases.


Last week, we worked on kisspeptin, which functions as a neuropeptide. We also worked on the Beethoven gene, which is related to deafness. This week we move to another brain and sensory system, vision. Our gene of interest, IMPG1, came up in a screen for genes that cause Vitelliform Macular Dystrophy (VMD), in which there is impaired vision. It is your job, as the lab bioinformatician, to find out what domains the protein has, and what is known about the proteins that have these domains. The human protein has the accession number NP_001554, and the sequence is given here:

MYLETRRAIFVFWIFLQVQGTKDISINIYHSETKDIDNPPRNETTESTEKMYKMSTMRRIFDLAKHRTKR
SAFFPTGVKVCPQESMKQILDSLQAYYRLRVCQEAVWEAYRIFLDRIPDTGEYQDWVSICQQETFCLFDI
GKNFSNSQEHLDLLQQRIKQRSFPDRKDEISAEKTLGEPGETIVISTDVANVSLGPFPLTPDDTLLNEIL
DNTLNDTKMPTTERETEFAVLEEQRVELSVSLVNQKFKAELADSQSPYYQELAGKSQLQMQKIFKKLPGF
KKIHVLGFRPKKEKDGSSSTEMQLTAIFKRHSAEAKSPASDLLSFDSNKIESEEVYHGTMEEDKQPEIYL
TATDLKRLISKALEEEQSLDVGTIQFTDEIAGSLPAFGPDTQSELPTSFAVITEDATLSPELPPVEPQLE
TVDGAEHGLPDTSWSPPAMASTSLSEAPPFFMASSIFSLTDQGTTDTMATDQTMLVPGLTIPTSDYSAIS
QLALGISHPPASSDDSRSSAGGEDMVRHLDEMDLSDTPAPSEVPELSEYVSVPDHFLEDTTPVSALQYIT
TSSMTIAPKGRELVVFFSLRVANMAFSNDLFNKSSLEYRALEQQFTQLLVPYLRSNLTGFKQLEILNFRN
GSVIVNSKMKFAKSVPYNLTKAVHGVLEDFRSAAAQQLHLEIDSYSLNIEPADQADPCKFLACGEFAQCV
KNERTEEAECRCKPGYDSQGSLDGLEPGLCGPGTKECEVLQGKGAPCRLPDHSENQAYKTSVKKFQNQQN
NKVISKRNSELLTVEYEEFNHQDWEGN

Our first step is to use a general tool for pattern and domain searching, and then we'll go to more specific databases: Run a search against InterPro. InterPro is a tool that searches many databases with different profile definitions at once.

Paste in your sequence, and run with all of the default parameters.

  1. How many InterPro hits are there? What are their names, accession numbers and entry types?
  2. All but one of these hits are related to the same family (in other words, for the purposes of this question, you can ignore the hit which belongs to a different family). Since we have multiple definitions of the same family, which of the various Entry types of this family is the most specific (In terms of function, evolutionary spread, etc)? (click through the IPR links and the pages inside to see. Look at the definitions, the subfamilies if there are, the number of sequences, the range of species...and give details in your answer).

Now lets define all of the "sub" hits. Go to the Detailed signature matches section and answer the following:

  1. For each of the hits (except for the unintegrated) write which database found them, the name, accession number, and the actual positions on the protein (this is best done in a table).
  2. Please summarize this - how many domains does the protein have?
Now we have to evaluate the results, and see how much evidence there is that each domain is real.
  1. How many databases find each domain from question 4?
  2. Which domain looks problematic? Why? (use the previous questions to help answer this...)

Now we'll run this seqeunce against the Pfam database. It can be run here. Click on Sequence Search to get the input box.

  1. How many hits are there? (give details: name, accession number, location in protein).
  2. Describe the hits (answer this for each hit separately, look at the alignments, the scores...)
  3. According to this search, how many domains does the protein have?
Click on the family name link.

Click on the "HMM logo" link from the left side-bar.

  1. Which amino acid is most conserved? Which part of the domain is more conserved (make sure you scroll to the side so you can see the whole domain)?

Click on the "Domain organisation" link from the left side-bar. Answer all questions from the first page that comes up only.

  1. How many proteins have only this domain (ignoring transmembrane domains)?
  2. How many have the same combination of domains as our protein according to interpro (and no other additional domains) (hint - do a text search on the page with the name of the other domain)?

Now we'll go to Prosite, for three reasons:
A) We got the most hits in Interpro from Prosite, so its a good idea to look at the original source
B) This is the only database which found the domain in question 6, and we may be able to get more information on it
C) We can look for common patterns that occur often, which are defined only in Prosite

Open an Prosite window. Go to the "Quick Scan mode of ScanProsite" header, (or look for the big white box) and paste the sequence in. As we want as much information as possible, we'll use the option for frequent sites (in other words, uncheck the box under the sequence entry box). Press the button "Scan".

Answer the following questions (in the following questions, motifs refers to both patterns and profiles):

  1. How many hits are there?
  2. How many different profiles/patterns did the program find?
  3. What general type of motif is the most common (in terms of the number of motifs and hits together)?
  4. Why?
  5. Which motif(s) is (or are) most likely to represent a family of proteins?

Now we'll try to understand the answer to question 6, the problematic domain. First we have to learn more about this family.

Enter the PS##### link of the profile from that family - from the banner line at the top of the hit with the profile name). The page that opens up should describe the whole family. (and have "Documentation PDOC#####" on top).

Read through the whole page to learn about the family and the various motifs that were defined.

  1. What is the function of this domain?
  2. What information do we have about the important amino acid residues (the critical ones for domain function)?
  3. How many signatures (patterns or profiles) does this document link to (don't forget to give accession numbers and names)?
  4. Go back to the search results - How many of the patterns or profiles from the previous question were found in our protein? How good of a match is the one that was found?
In the problematic domain, not all the patterns were found. We are going to try to understand why, and how reliable the results are.

Enter the PS#### of the profile that WAS found in our protein (from the PDOC page, on the bottom). Look at the numerical results section

  1. Based on this section, how good are the chances that our protein belongs to this family? Why? (use the numbers to prove your point, don't just quote the statistics)

Now go back to the family description page (PDOC), to the patterns that are missing in our problem region. Click on the PS###### to enter the page (if there is more than one, do it for all).

  1. Please write down the consensus patterns that were missed here.
  2. Why didn't the pattern(s) find our problem domain? Compare the pattern(s) (otherwise known as the signature, from the previous question) to your input sequence. (you can also go back to your original results for help in this)
  3. Look at the numerical results sections of the patterns, and the answers to the previous two questions. How good are the chances that our protein belongs to this family? Why? (give a detailed answer)

Note - in prosite [ ] means "any of the amino acids inside", ( ) is the number of the amino acid directly before it, { } means "any amino acid but this one" and x is "any amino acid"

Now you go back to the lab to try to plan some experiments.....


Hand in the report with all the answers as Assignment #6