Assignment #6

Motif, Pattern, and Profile Assignment

In this assignment you should prepare a report with the answers to various questions.
This report can be written in any text editor or word proccessor.
Do not hand in printouts of the various programs, just answer the questions!
Add your name and e-mail address on the front page!!!
READ THROUGH the assignment before you begin! It will help you understand what you are supposed to do.

Some background:

In this exercise you will practice using various online databases, all of which have something to do with motifs, patterns or profiles.

The idea behind searching this kind of database is using the power of multiple alignments in determining the biological function of a sequence. We would like you to develop a feeling of how to use this kind of database in general, to give you a better understanding on the different kinds of information these databases provide and to understand the benefit in using these databases.

Last week, we worked on kisspeptin, which functions as a neuropeptide. This week we move to another brain and sensory system, vision. Our gene of interest, IMPG1, came up in a screen for genes that cause Vitelliform Macular Dystrophy (VMD), in which there is impaired vision. It is your job, as the lab bioinformatician, to find out what domains the protein has, and what is known about the proteins that have these domains. The human protein has the accession number NP_001554, and the sequence is given here:

MYLETRRAIFVFWIFLQVQGTKDISINIYHSETKDIDNPPRNETTESTEKMYKMSTMRRIFDLAKHRTKR
SAFFPTGVKVCPQESMKQILDSLQAYYRLRVCQEAVWEAYRIFLDRIPDTGEYQDWVSICQQETFCLFDI
GKNFSNSQEHLDLLQQRIKQRSFPDRKDEISAEKTLGEPGETIVISTDVANVSLGPFPLTPDDTLLNEIL
DNTLNDTKMPTTERETEFAVLEEQRVELSVSLVNQKFKAELADSQSPYYQELAGKSQLQMQKIFKKLPGF
KKIHVLGFRPKKEKDGSSSTEMQLTAIFKRHSAEAKSPASDLLSFDSNKIESEEVYHGTMEEDKQPEIYL
TATDLKRLISKALEEEQSLDVGTIQFTDEIAGSLPAFGPDTQSELPTSFAVITEDATLSPELPPVEPQLE
TVDGAEHGLPDTSWSPPAMASTSLSEAPPFFMASSIFSLTDQGTTDTMATDQTMLVPGLTIPTSDYSAIS
QLALGISHPPASSDDSRSSAGGEDMVRHLDEMDLSDTPAPSEVPELSEYVSVPDHFLEDTTPVSALQYIT
TSSMTIAPKGRELVVFFSLRVANMAFSNDLFNKSSLEYRALEQQFTQLLVPYLRSNLTGFKQLEILNFRN
GSVIVNSKMKFAKSVPYNLTKAVHGVLEDFRSAAAQQLHLEIDSYSLNIEPADQADPCKFLACGEFAQCV
KNERTEEAECRCKPGYDSQGSLDGLEPGLCGPGTKECEVLQGKGAPCRLPDHSENQAYKTSVKKFQNQQN
NKVISKRNSELLTVEYEEFNHQDWEGN

Our first step is to use a general tool for pattern and domain searching, and then we'll go to more specific databases: Run a search against InterPro. InterPro is a tool that searches many databases with different profile definitions at once.

Paste in your sequence, and run with all of the default parameters.

How many InterPro hits are there (interpro integrated entries, with accession numbers starting with IPR)? What are their names, accession numbers and entry types (entry type can be seen by the section of the results page, or the colored letter in the hit when you mouse over)?
Since we have multiple IPR definitions of the same family (family, domain, homologous superfamily), which of the various Entry types of this family is the most specific (In terms of function, evolutionary spread, etc)? Click through the links on the right hand side of the table with the accession numbers (in the domain section, only use the IPR hit which covers most of the protein) - and the pages inside to see. Look at the definitions, the subfamilies if there are, the number of sequences, the range of species...and give details in your answer).

Now lets define all of the "sub" hits in the Domain section. You can mouse over and answer the following:

For each of the hits (except for the unintegrated, if there are) write which database found them, the name, accession number, and the actual positions on the protein. This should be done in a table in your answer file (don't hand in a separate file). If you find it easier, you can download the results in table format by going to the blue export button on top of the results table and saving as tsv. Tsv files can be opened in excel or numbers. There will be a lot of extra information here, so you will have to be careful while using the results this way. It may be preferable to do by hand.
Please summarize this information - how many domains do you think the protein has, what name would you give them, and from where to where are they?

Now we have to evaluate the results, and see how much evidence there is that each domain is real.

How many databases find each domain from question 4?
Which domain has looks problematic? Why? (use the previous questions to help answer this...)

Now we'll run this seqeunce against the Pfam database. It can be run here. Click on Sequence Search to get the input box.

How many hits are there (all PFAM-A hits)? (give details: name, accession number, location in protein - which here is called Alignment, and length of the domain. The match to the domain is given as HMM from-to and the length of the domain is given in HMM length).
Describe the hits (answer this for each hit separately, look at the alignments by clicking the Show button at the end of the line, the scores...)
According to this search, how many domains do you think the protein has? Explain.

Click on the family name link of the significant hits.

Click on the "HMM logo" link from the left side-bar.

Which amino acid is most conserved (give position and amino acid)? Which part of the domain is more conserved (make sure you scroll to the side so you can see the whole domain)?

Click on the "Domain organisation" link from the left side-bar. Answer all questions from the first page that comes up only.

How many different architectures have only this domain (ignoring transmembrane domains, and others with no name)? List them.
Click on show next 200 on the bottom of the page. How many have the same combination of domains as our protein according to interpro (and no other additional domains) (hint - do a text search on the page with the name of the other domain)?

Now we'll go to Prosite, for three reasons:
A) We got the most hits in Interpro from Prosite, so its a good idea to look at the original source
B) This is the only database which found the domain in question 6, and we may be able to get more information on it
C) We can look for common patterns that occur often, which are defined only in Prosite

Open a Prosite window. Go to the "Quick Scan mode of ScanProsite" header, (or look for the big white box) and paste the sequence in. As we want as much information as possible, we'll use the option for frequent sites (in other words, uncheck the box under the sequence entry box). Press the button "Scan".

Answer the following questions (in the following questions, motifs refers to both patterns and profiles):

How many hits are there?
How many different profiles/patterns did the program find?
What general type of motif is the most common (in terms of the number of motifs and hits together)?
Why?
Which motif(s) is (or are) most likely to represent a family of proteins?

Now we'll try to understand the answer to question 6, the problematic domain. First we have to learn more about this family (All of the remaining questions deal with this family only!).

Enter the PS##### link of the profile from that family - from the banner line at the top of the hit with the profile name). The page that opens up should describe the whole family. (and have "Documentation PDOC#####" on top).

Read through the whole page to learn about the family and the various motifs that were defined.

What is the function of this domain?
What information do we have about the important amino acid residues (the critical ones for domain function)?
How many signatures (patterns or profiles) does this document link to (don't forget to give accession numbers and names)?
Go back to the search results - How many of the patterns or profiles from the previous question were found in our protein? How good of a match is the one that was found?

In the problematic domain, not all the patterns were found. We are going to try to understand why, and how reliable the results are.

Enter the PS#### of the profile that WAS found in our protein (from the PDOC page, on the bottom). Look at the numerical results section

Based on this section, how good are the chances that our protein belongs to this family? Why? (use the numbers to prove your point, don't just quote the statistics)

Now go back to the family description page (PDOC), to the patterns that are missing in our problem region. Click on the PS###### of the pattern to enter the page (if there is more than one, do it for all).

Please write down the consensus patterns that were missed here.
Why didn't the pattern(s) find the domain in our protein? Compare the pattern(s) (otherwise known as the signature, from the previous question) to your input sequence. (you can also go back to your original results for help in this)
Look at the numerical results sections of the patterns, and the answers to the previous two questions. How good are the chances that our protein belongs to this family? Why? (give a detailed answer)

Note - in prosite [ ] means "any of the amino acids inside", ( ) is the number of the amino acid directly before it, { } means "any amino acid but this one" and x is "any amino acid"

Now you go back to the lab to try to plan some experiments.....

Hand in the report with all the answers as Assignment #6