Experiment 5: Bioinformatics

Experiment 5: Bioinformatics

Description

 

 

You are going to investigate a gene found in humans: its function, structure, relationship to other genes in humans and in other species. You will be using various online tools in National Center for Biotechnology Information (NCBI) and European Bioinformatics Institute (EBI).  I Attached a link of the website below

https://www.ncbi.nlm.nih.govInside the database the gene you will be looking for is: o51e1_human

Bioinformatics/genomics Lab for BIOS 308 01/2016 You are going to investigate a gene found in humans: its function, structure, relationship to other genes in humans and in other species. You will be using various online tools in National Center for Biotechnology Information (NCBI) and European Bioinformatics Institute (EBI). This information sheet will guide you through the process. I am going to work through this exercise with a specific gene (acm1_human) so you can see what must be done. Each student with be assigned a unique human gene. I want you to complete the answer sheet, replacing the sample answers and screenshots with your own data. 1 The genes we will investigate belong to a superfamily called G-protein coupled receptors (GPCRs). Why are they interesting? Over 40% drugs on the market target GPCRs. Mutations and misregulations in GPCRs can cause various diseases including cancers. A bioinformatics database GPCRDB compiled a list of GPCR protein identification (ID) and accession (AC) numbers: (http://files.gpcrdb.org/uniprot_mapping.txt). These files can be searched at both NCBI and EBI. We will focus on GPCRs from humans. 2 Every protein has a unique ID and AC. The first number is the AC, for computer use only. The second number is the ID, such as acm1_human. The first part, acm1, indicates function. The second part, human, indicates species. . Q1: What is your protein’s ID and AC? A1: ID is acm1_human and AC is P11229 3 Part I: Using the NCBI database tool We are going to search acm1_human at NCBI website. Go to http://www.ncbi.nlm.nih.gov/, select Protein from the pull-down list, and copy/paste acm1_human, and hit Search. 4 You will see a page showing the GenPept format, which provides a lot of information of the searched protein. A more general name of GenPept format is GenBank format, which defines what information should be in what field. See http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html for details. Q2: What is the name of your protein? A2: Muscarinic acetylcholine receptor M1 5 Answer the following questions from the GenPept page. Q1: What are your protein’s ID and AC? Q3: What is the protein length? A3: 460 aa ID is also known as LOCUS ID at NCBI. One protein can have other identifiers, such as VERSION and GI. They are all unique. 6 Protein is encoded by a gene, and the Gene page will show all information about the gene. On the GenPept page, scroll to “Gene” in the right hand menu. Answer the following questions from the Gene page. Scroll to required sections, as needed. 7 Q4: What is the Gene name and what are the synonyms? A4: Gene name is CHRM1 and synonyms include M1, HM1 and M1R Why are there synonyms? One gene could be studied by many different labs and published in different research papers, where it was given different names. 8 Scroll to Genomic context What is cytogenetic location: Q5: What is the cytogenetic location of the gene? http://ghr.nlm.nih.gov/handbook/howge A5: 11q13 neswork/genelocation Q6: What is the chromosomal location of the gene? A6: NC_000011.10 (62908679..62921861, complement) Q7: On what strand is the gene located? A7: Negative strand DNA has two strands, but when the sequence is deposited into the database, only the sequence of one strand is given. Complement means the gene is located on the negative strand (reverse strand of the given DNA in the database). Q8: What are the Gene names of the immediate upstream and downstream genes? A8: Upstream LOC105369333 and downstream SLC3A2 Note the gene is on negative strand 9 Scroll to Genomic regions, transcripts, and products Q9: In NCBI annotation, how many known transcripts does the gene have? Q10: How many exons does the gene have? • The answers to these two questions change with time as new evidence may come out against older answers and different database may give different answers. • Here we see two database answers, one from NCBI annotation, the other from Ensembl annotation (a database from Europe’s EBI). For the human genome, Ensembl is considered to have better annotation. We will look at only the NCBI annotation here. This window shows the gene structure annotation. Click here to open a separate window. 10 How to understand the genome browser: Zoomed in region: where the transcripts are located on the chromosome Chromosome scale Gene regions (green dots) Transcripts: Green: protein coding gene. Coding regions (exons) are in darker green, and untranslated regions are in light green. NCBI annotation: two transcripts are shown. 1st transcript (mRNA ID: XM_011544742.1, protein ID: XP_011543044.1) has 2 exons, 2nd transcript (mRNA: NM_000738.2, protein: NP_000729.2) has 2 exons; the first does not overlap with any exons of transcript one and the second exon of the two transcripts overlaps. Total exons: 3 Click on one of the NCBI transcripts to expand it. 11 The following information is color-coded: Black box: expanded view of NCBI annotation Green: gene Blue: mRNA transcript Red: protein product 12 Q9: In NCBI annotation, how many known transcripts does the gene have? A9: There are two transcripts (XM_011544742.1 and NM_000738.2). Q10: In NCBI, how many exons does the gene have? Include a screenprint of the transcripts and use arrows to indicate the exons. (Dark green may not be evident for small exons; only the light green may be shown.) A10: Three exons 13 Q11: How many PubMed publications are associated with this gene? A11: 111 Q12: What is the citation info of the earliest publication? Come back to the Gene page and scroll to Bibliography: there are a total of 108 publications sorted in chronological order. To find the earliest one, click on the link After the human genome was sequenced in 2000, determining all the genes’ functions needs much more effort from biologists in the form of research publications. 14 The link opened the pubmed page. Click on Sort by, select Publication date, then click on Last Scroll to the end of the list; the earliest paper was published in Science in 1987. Q12: What is the citation info of the earliest publication? A12: Bonner et al., Science. 1987 Jul 31;237(4814):527-32 15 Q13: What disease/phenotype information is available for this gene? Come back to the Gene page: scroll to the Phenotypes section and click on the link. eQTL (expression quantitative trait loci) is a statistical tool to link phenotype (e.g. disease) to genotype (gene locus in the form of mutations, e.g., point mutations/SNPs and structural variations). 16 In this page, scroll to the Association Results table. Mutations occurring in the intergenic regions of CHRM1 (our gene) are associated with the following traits (diseases). Click on the traits to open a pop-up window to explain the trait. Q13: What disease/phenotype information is available for this gene? A13: Myocardial Infarction, Heart Failure, Stroke Such eQTL analysis is based on large scale population genomic data analysis, i.e., the genome sequences of thousands of human individuals that identify mutations which are then linked to the health information of these individuals (i.e., association analysis). 17 Q14: What sequence variation types are known for this gene? Q15: How many SNPs are known for this gene? Go back to the Gene page and scroll to the Variation section. Click on the links: dbVar for structural variations and SNP Geneview Report for point mutations. 18 This is the dbVar page. dbVar is NCBI’s database of genomic structural variations. Variation type is in the 2nd column. To note the count of a given type, see the left hand menu. Click here to get the expanded list. 19 Q14: What sequence variation types are known for this gene? A14: There are 27 copy number variations, 10 inversions, 5 insertions 3 translocations and 1 tandem duplication known for this gene. 20 Return to the Gene page and scroll again to the Variation section. Select See SNP Geneview Report. This is the dbSNP page. dbSNP is NCBI’s database of point mutations. This gene has two mRNA transcripts according to NCBI’s annotation. This page shows the first one (XM_011544742.1), click here to show the second one. There are 203 SNPs (point mutations) found in the coding region of XM_011544742.1. Green regions show synonymous mutations, while red are other mutations. 21 Q15: How many SNPs are known for this gene? A15: Transcript XM_011544742.1 has 203 SNPs in the coding region and transcript NM_000738.2 also has 203 SNPs in the coding region. 22 Go back to the Gene page; scroll to the Interactions section to find the “Other Gene” column. Like humans, different proteins may work together in protein complexes or interact with each other in regulatory or metabolic pathways. Q16: What other genes work/interact with our gene? A16: GPRASP1, GPRASP2, GNAI2, CDC14A and Dlg4 23 Q17: What is the best hit protein in mouse? Go back to the GenPept page and scroll to Analyze this sequence in the right hand menu. Hit run BLAST 24 In the BLAST page, the query protein AC is automatically filled in the top box. Scroll to Choose Search Set and type “Mus musculus” as the organism. A drop down box will populate with different taxid selections. Select Mus musculus (taxid:10090) to search against all mouse proteins in the nr database. Then go the bottom of the page to hit the BLAST button. 25 In the BLAST result page, scroll to the Descriptions section. The “best” hit protein in mouse will be the first one on the list, indicating the closest identity to our query protein: NP_031724.2, which is 99% identical to the query. Clicking on the description will take you to the alignment. 26 This is the alignment. Query is the human protein, subject (Sbjct) is the mouse hit protein. 27 Q17: What is the best hit protein in mouse? A17: The best mouse hit protein is NP_031724.2, which is 99% identical to the query. 28 Q18: What are the homologous proteins in other species? Go to the Gene page again, and scroll to the General gene information section. Click on the first link under Homology. 29 HomoloGene is NCBI’s database for homologous proteins. Proteins of different organisms are grouped based on BLAST research results. In the HomoloGene column we see the pre-computed homologs of the query gene in the HomoloGene database. 30 Scroll to Protein Alignments and click on Multiple Alignments. You will see the precomputed multiple sequence alignments of these homologous proteins in numerous species. 31 Q18: What are the homologous proteins in other species? A18: XP_508508.2 (CHRM1) in P.troglodytes NP_001028117.1 (CHRM1) in M.mulatta XP_005631654.1 (CHRM1) in C.lupus NP_001231538.1 (CHRM1) in B.taurus NP_001106167.1 (Chrm1) in M.musculus NP_542951.1 (Chrm1) in R.norvegicus NP_726440.1 (mAcR-60C) in D.melanogaster XP_314486.1 (GPRMAC1) in A.gambiae NP_001024236.1 (gar-3) in C.elegans XP_004913718.1 (LOC100490434) in X.tropicalis 32 Part II: Using the Ensembl database tool In addition to NCBI, a gene can be described in other websites/databases. The See related links provide access to these external resources. Go back to the Gene page and click on the Ensembl link under the Summary section— OR—scroll to Links to other resources in the right hand menu if not under Summary. 33 This is the Ensembl page of the CHRM1 gene. The bottom of the page shows the Ensembl annotated transcripts for the gene. Note that these transcripts are marked protein-coding. These are the mRNA transcripts. As we saw earlier in the NCBI diagram, there were 2 mRNAs with a total of three non-overlapping exons. 34 This is the same diagram as on the former page, but here I point to the non-coding RNA gene called an antisense gene that overlaps with our protein-coding genes. Note that one of its exons is not overlapping with the four protein-coding exons. This antisense gene is transcribed along with the other three protein-coding genes. What is its purpose? It regulates protein-coding gene expression. There are a number of different transcripts that may also be transcribed with the proteincoding genes. Some include: additional genes, pseudogenes, and miRNA. 35 Q19: In Ensembl annotation, how many known transcripts does the gene have? A19: There are three transcripts (ENST00000306960, ENST00000543973 and ENST00000536524). Q20: In Ensembl, how many exons does the gene have? A20: There are four exons. 36 • Return to the top of the page, left-hand menu, and select Sequence to answer the next question: Q21: What are the intron sequences (if there are any)? 37 You must determine which sequences are the protein-coding sequences for your gene (the exons and the introns) and ignore the exons on any additional genes or noncoding RNA segments. Code format: Exons: Dark red letters shaded pink Introns: Gray letters Non-coding RNA exon: Gray letters shaded pink 38 Exon 4 Here I only show the exon regions (shaded areas) Exon 1 Exon 2 Noncoding RNA exon Exon 3 The gene is on reverse strand, so we have to work backwards 39 Introns are regions between two adjacent exons. There are four exons so there are three introns in between. We should ignore the exon from the non-coding RNA gene as it is on a different strand and is a different gene. For your own genes you will also ignore any other transcripts except the protein-coding transcripts. Exon 1 Exon 2 Copy regions in between exon 1 and 2 to get the intron 1 sequence Q21: What are the intron sequences (if there are any)? A21: For intron sequences, paste the intron 1 sequence here, and do this repeatedly for all introns. Format: Intron 1: sequence Intron 2: sequence … 40 Splicing junctions are nucleotide bases where the splicing takes place (the exon-intron boundary). Each intron has two ends (5’ and 3’); two bases at each end defines a splicing junction. Exon 1 Intron 1 junctions Exon 2 Q22: What are the splicing junctions? A22: Intron 1: 5’GT and AG3’ Intron 2: … … 41 Q23: In what body parts is the gene expressed? At the Ensembl page, click on Gene expression 42 In the Ensembl expression page, we see the expression profile of our gene in different experiments across all major body parts. Here we only look at the 53 GTEx experiment. If 53 GTEx isn’t available, use whatever experiment is listed Mouse over here to see expression values Click here to download a tsv file. (It can be opened it in Excel on some computers; on others, only with Notepad.) Male, female and brain Large scale whole genome gene expression experiments have been done by many different labs around the world and the data is deposited in an online gene expression databases and analyzed and presented on the web by NCBI and EBI. 43 Q23: In what body parts is the gene expressed? A23: According to the 53 GTEx experiment, the Tissue and Value expression of this gene is listed below. Assemble your answer from the Excel file, or manually create your list, using the information from the Ensembl chart. 44