Assignment #1

Assignment 1: Text Searching in Web-based Databases

In this assignment you should prepare a report with the answers to various questions.
This report can be written in any text editor or word proccessor.

In the report JUST give the answers - DO NOT print out the questions and write the answers in.

READ THROUGH THE WHOLE ASSIGNMENT BEFORE YOU START. It will help you understand the purpose of the questions!!!!

Add your name and e-mail address on top of the page!!!

(Johann Sebastian Bach, aged 61, by EG Haussmann)

Welcome! You've just joined a lab looking for genes related to musical talent. One gene that caught the attention of the group is Bach1. Your job as the new person in the lab is to find out what you can about this protein, and report to the group.

We know that the gene is called bach1, so we'll start our search with that.

Enter the NCBI web site (this will open a new window): http://www.ncbi.nlm.nih.gov/

Enter the search term bach1 and search all databases

How many hit are there in PubMed? Gene? Nucleotide? Protein?

Thats a bit much for us to deal with, and we know that we want to concentrate on only the mouse sequences. In the search window on top of the page, type bach1 AND mouse

How many hits are there now in the various databases (the same ones you looked at earlier)?

Thats still a lot of sequences. Lets go into the Gene database to see whats going on.

Which organisms are the entries from (open as a tree, give a general description, not an exact list)? Why did this happen?
Click on the human BACH1 entry. What is the status of this gene? What is the function of this gene? What additional information do you see here that can partially explain the search results?

Go back to the main search page (you can also click the 'All Databases' link from the small black bar in the top part of the page). Now modify the search term to read bach1 AND mouse[organism]

Now how many hits are there (for the same 4 databases)?
Which database didn't change? Why?

Go back to the Gene database.

Are all the hits from mouse?

We still have a number of hits here, and we're only interested in one gene. Lets try to understand what happened in our search. Use only the results from mus musculus.

How many genes have been called bach1? What are their gene symbols? Which is our gene of interest?
How many other genes (aside from the one in the previous question) are there, and why did they show up (hint: look inside the entries)?

Now lets look at our gene - click on the gene symbol of the correct gene.

What are the various names of this gene?
What is the accession number of its mRNA RefSeq? its protein RefSeq? Which chromosome is it on?
What is the function of this protein? Where in the cell is it located?
What domain or domains does it have (give name, accession number, and location of the domain in the protein)?
How many mRNAs are linked to this gene? What are their accession numbers?

Now we'll open a new window, and look at the nucleotide hits. http://www.ncbi.nlm.nih.gov/

Run a nucleotide search with bach1 AND mus musculus[organism]. (hint: you can double check yourself to see if you ran it correctly if you get the same number you wrote down above....)

Change the viewing options so that you can see all of the mRNAs on one page. Now we are going to classify them.

Look at the accession numbers:

Classify all of the entries in the following categories (how many of each are there? Note: every sequence should be put into one group)

RefSeq
MGC
mRNAs from high-throughput full length cloning projects
EST
Genomic sequence WGS, GSS (not RefSeq)
regular mRNA/cDNA/DNA and others (whatever is left)

(if you're not sure what category something belongs to, open it up and read the annotation!)

Now we're going to try a different type of classification on the same search results:

Which other genes were found? Why?
Are all of the mRNAs in the Entrez Gene entry (question 14) found here? Why or why not?

Enter the mouse RefSeq mRNA entry for our gene of interest:

What is the accession number?
What version of the sequence is this?
What is the status of this RefSeq record?
Which sequence(s) is the RefSeq based on?
How long are the 5'UTR, coding region, and 3'UTR?
How many references are there? What can we learn from them about our protein (give details)?

Go to the Revision History (or use any other method of your choice to answer these questions...there are several ways to do it.....)

When did this sequence first appear in genbank?
What sequences are the previous sequence versions (do NOT go into all of the text versions!) of this RefSeq based on?
What happened to the nucleotide length in the various versions? How did the sequence change?
What happened to the protein length in the various versions?

Now we'll look at proteins, and compare RefSeq with UniProt (a bit).

Enter UniProt to search for protein information (this will open in a new window): http://www.uniprot.org

Go to the Text Search box on the top of the page and enter bach1 AND mouse

How many hits were found in Swiss-Prot? TrEMBL?
Compare this result (from Swiss-Prot only) to the result we got from the Entrez Gene search (questions 8 and 9). Don't forget to explain the surprises....

Enter the mouse entry from Swiss-Prot for our gene of interest

What is the entry name? accession number?
How long is the protein?
Which mRNA sequences is it linked to?
What kind of posttranslational modifications does this protein have? Where are they located (if known)?
How many isoforms does this protein have? What are the differences between them?
What version of the sequence is this?
How long did it take to get from trembl to swiss-prot (in years)?
What is the function of this gene?
Compare the previous answer to your answers to questions #4 and #12. What are the differences and similiarites (don't forget to discuss the ease of finding information, sources of information, species....).

You now go back to the lab for years of research, and try to figure out the connection to music......

Hand in the report with all the answers Assignment #1