Assignment #3

Pairwise Alignment

In this assignment you should prepare a report with the answers to various questions.
This report can be written in any text editor or word proccessor.

NOTE - Do NOT hand in the program output - just answer the questions!

Add your name and e-mail address on the front page!!!

Don't forget! Read through the assignment first, so you get a better picture of what its all about!

In this assignment, we will be working with the Needle, Matcher and Water programs (from the EMBOSS package) to compare two sequences on the DNA and protein levels. We will also play with the parameters, to understand better how pairwise alignment works.

NOTE: The order of the sequences makes a difference! Make sure you are consistent whenever you copy paste!

NOTE: The links in the assignment will open new windows, so if you need a new one for a given program or database, go back to where it was first given, and it will open another one for you.

(Emiliania huxleyi)

Your lab has been looking for the gene that causes the ocean smell - the reaction has been known for a while, an enzyme (DMSP lyase) has been found in bacteria, but not in eukaryotes. The bacterial activity isn't enough to explain all of the DMS (the product of the reaction) found in the ocean. You suspect that the algae Emiliania huxleyi is a major producer, but there is no homolog of the bacterial gene. You perform biochemical fractionation of the cells in order to isolate the active fraction. After a lot of work, including transcriptomics and proteomics, the enzyme is found, and named Alma1. (for more details see: https://pubmed.ncbi.nlm.nih.gov/26113722/)

The new question: is Emiliania huxleyi the only DMS producer in the ocean, or are there other eukaryotic species that have homologs?

You perform various database searches, and find an interesting candidate in a coral symbiont, Symbiodinium. Now we have to compare the sequences, and see how similar it really is.

Take the Emiliania huxleyi protein sequence (accession number: AKO62592.1) from NCBI in fasta format (if you give it a simple name, like the species, it will make it easier to tell things apart afterwards. You can change the name after you paste, or you can take the sequence without the header line and create your own). Make sure to keep a window with each sequence open, you will have to copy/paste them a few times.

The Symbiodinium sequence comes from the TSA database (next generation transcript assemblies), and the translation isn't in NCBI. You can get the translation here symbio.pep

First of all, we'd like to see how similar these sequences are overall, so we'll run a global alignment program, Needle. We'll use the EBI website for this NEEDLE.

Make sure you the program is set on protein, and check the parameters (but use the defaults):

How similar are these sequences?
Over what length?
How many gaps are there (percentage)?
From where to where (give amino acid coordiates) are the hits (the aligned segments) in both sequences?
Are there any areas that you think are aligned better? From where to where (give approximate amino acid coordinates) in the two sequences?

Now we will try the same sequences in a local alignment program, Water.

Open a window at EBI by clicking here

Make sure you choose protein. Paste in the sequences, and run using the default parameters (Open them so that you know what they are).

How many hits are there? (Note, if there is more than one hit, answer the following questions separately for each one!)
How similar are these sequences?
Over what length?
How many gaps are there (percentage)?
From where to where (give amino acid coordiates) are the hits (the aligned segments) in both sequences?
Are there any areas that you think are aligned better? From where to where (give approximate amino acid coordinates) in the two sequences?

Now we'll look at another local alignment program, Matcher. It is slightly different, and so are its parameters (and yes, this can effect the alignment, but we are more interested in another parameter at this point.)

Open a Matcher window here

Make sure you choose protein, and in the parameters, set alternative matches to 3.

How many hits are there? (Note, if there is more than one hit, answer the following questions separately for each one!)
How similar are these sequences?
Over what length?
How many gaps are there (percentage)?
From where to where (give amino acid coordiates) are the hits (the aligned segments) in both sequences?
Compare your answer here to the last answers of the previous two questions. Did all the programs find the same best region?

Now we'll start changing the parameters.

Go back to Matcher, and try to get the parameters as close as possible to the others (Gap open of 10, gap extend of 1).

What happened to the alignments? Describe based on the types of information you had to collect before.

Now we'll really play with the parameters using Water

Start by trying to make the parameters more stringent, like the Matcher default: Gap open of 15, gap extend of 5.

What happened to the alignment? How does the two Water runs compare with the two Matcher runs?

Lets go even higher and lower: Run Water twice more, once with the Gap open as 50, gap extend of 5, and once with gap open of 1, gap extend of 0.1

What happens now? Which direction has a greater effect? Why?

Now we'll look at the DNA sequences, and both are available on NCBI (accession numbers: KR703620.1 and GAKY01006775.1)

We'll use all three programs with the default parameters. Make sure the programs are set on DNA!

The standard questions for each program...

How identical are these sequences?
How similar are they?
How long are the alignments?
Describe the alignments. Are there regions which are a 'stronger' match? Do these areas match from program to program? Is this similar or different from the protein alignments with the default parameters?
How can you improve the alignment (and what would you consider an improvement)?

Now we'll compare the various aligments, DNA and protein and the various programs:

In your opinion, are these proteins related? (Yes or No isn't enough, explain!)
Explain the advantages and disadvantages of a global vs. a local search? DNA vs. protein?

(for anyone interested in the real answer, you can read the paper mentioned in the beginning....)