Assignment 1: Text Searching in Web-based Databases
This assigment will give you a 'guided tour' of the NCBI Gene and Nucleotide databases, with a bit of UniProt for comparison at the end. It is very detail oriented (apologies), but will take you through the perils and pitfalls of text searching.

In this assignment you should prepare a file (word, pdf) with the answers to various questions.
This report can be written in any text editor or word proccessor.

In the file JUST give the answers - DO NOT print out the questions and write the answers in.

READ THROUGH THE WHOLE ASSIGNMENT BEFORE YOU START. It will help you understand the purpose of the questions!!!!

Add your name and e-mail address on top of the page!!!


The Israeli Organization for stxbp1 disorders

You've just found out that some one you know has a child who was diagnosed with stxbp1 disorder. As the 'local expert' you're asked to find out as much about the gene as you can. It is known that the mutation is sporadic (not inherited) and that it causes seizures, developmental delay, and motor difficulties. The gene is involved in communication between neurons. You would like to make a mouse model of the disease, but first, you have to find out about what sequences are available in mouse.

We know that the gene is called STXBP1, so we'll start our search with that.

Enter the NCBI web site (this will open a new window): http://www.ncbi.nlm.nih.gov/

Enter the search term stxbp1 and search all databases

  1. How many hit are there in PubMed? Gene? Nucleotide? Protein?

Thats a bit much for us to deal with, and we know that we want to build an animal model, we want to concentrate only on the mouse sequences. In the search window on top of the page, type stxbp1 AND mouse

  1. How many hits are there now in the various databases (the same ones you looked at earlier)?

Thats still a lot of sequences. Lets go into the Gene database to see whats going on.

  1. Which organisms are the entries from (open as a tree, give a general description, not an exact list)? Why did this happen?
  2. Click on the human STXBP1 entry. What is the status of this gene? What is the function of this gene?

Go back to the main search page (or you can click the 'All Databases' link from the pull down menu on the gray bar in the top part of the page). Now modify the search term to read stxbp1 AND mouse[organism]

  1. Now how many hits are there (for the same 4 databases)?
  2. Which database didn't change? Why?

Go back to the Gene database.

  1. Are all the hits from mouse?
We still have a number of hits here, and we're only interested in one gene. Lets try to understand what happened in our search. Use only the results from mus musculus.
  1. How many other genes (aside from ours) are there, and why did they show up (hint: if you're not sure, check out some of the other entries)?
Now lets look at our gene - click on the gene symbol of the correct gene.
  1. What are the various names of this gene?
  2. Which chromosome is the gene on? How many isoforms (RefSeq) does it have? What is the difference between the isoforms?
  3. For each isoform, what is the accession number of its mRNA RefSeq? its protein RefSeq?
  4. What is the function of this protein? Where in the cell is it located?
  5. What domain or domains does it have (give name, accession number, and location of the domain in the protein)?
  6. How many mRNAs are linked to this gene? What are their accession numbers?

Now we'll open a new window, and look at the nucleotide hits. http://www.ncbi.nlm.nih.gov/

Run a nucleotide search with stxbp1 AND mus musculus[organism]. (hint: you can double check yourself to see if you ran it correctly if you get the same number you wrote down above....)

Change the viewing options so that you can see all of the mRNAs on one page (if necessary).

  1. How many sequences do you get? Which other gene(s) were found? Why?
  2. How many of the mRNAs in the Entrez Gene entry (question 14) found here? Why?

Now we'll expand the search a bit. Try entering syntaxin binding protein 1 AND mus musculus[organism]

  1. Now what happened to the results, and why?

As we want to classify the results, we'll try something else "syntaxin binding protein 1" AND mus musculus[organism]

  1. Now what happened to the results and why (compare to the previous searches)? How many of the sequences from question #14 do you have? Why not all of them?

Look at the accession numbers:

  1. Classify all of the entries in the following categories (how many of each are there? Note: every sequence should be put into one group)
(if you're not sure what category something belongs to, open it up and read the annotation! The answer can be zero for a given category. Don't forget you can use the filters on the side to make this easier)

Enter the mouse RefSeq mRNA entry of transcript variant 2 for our gene of interest:

  1. What is the accession number?
  2. What version of the sequence is this?
  3. What is the status of this RefSeq record?
  4. Which sequence(s) is the RefSeq based on?
  5. How long are the 5'UTR, coding region, and 3'UTR?
  6. How many references are there? What can we learn from them about our protein (give details)?

Go to the Revision History (or use any other method of your choice to answer these questions...there are several ways to do it.....)

  1. When did this sequence first appear in genbank?
  2. What happened to the nucleotide length in the various versions? How did the sequence change?
  3. What happened to the protein length in the various versions?

Now we'll look at proteins, and compare RefSeq with UniProt (a bit).

Enter UniProt to search for protein information (this will open in a new window): http://www.uniprot.org

Go to the Text Search box on the top of the page and enter syntaxin binding protein 1 AND mouse[organism]

  1. How many hits were found in Swiss-Prot? TrEMBL?
  2. Which species do we have hits from in Swiss-Prot? What happened??

Go back up to the search bar and click on "advanced" (at the end of the bar). Type in stxbp1 in the top box, and in the second line, choose AND in the first pull down menu, organism from the second pull down menu and type in mouse. Throw away the others.

Enter the mouse entry from Swiss-Prot for our gene of interest

  1. What is the entry name? accession number?
  2. How long is the protein?
  3. What is the structure based on?
  4. What kind of posttranslational modifications does this protein have? Where are they located (if known)?
  5. How many isoforms does this protein have? What are the differences between them?
  6. Which isoform is equivalent to which NCBI isoform?
  7. What version of the sequence is this?
  8. What is the function of this gene?
  9. Compare the previous answer to your answers to questions #4 and #12. What are the differences and similiarites (don't forget to discuss the ease of finding information, sources of information, species....).

You now go back to the lab for years of research.....


Hand in the report with all the answers Assignment #1