Author: Benison P. Zerrudo
November 28, 2021
BIOL 480 – Bioinformatics (CSUSM) with Prof. Sujal Phadke, Ph.D
Homologous proteins can be detected by the process of sequence similarity searching usually using the NCBI BLAST (Pearson, 2013). One of the goals of searching for sequence similarities is determining common ancestors. Additionally, phylogenetic studies explore the relationship between different organisms. Some importance of phylogeny include discovering new species, learning evolutionary history, determining common ancestors, and studying species distribution. However, finding homologous protein and phylogenetic analysis can sometimes be an intimidating and complex process that needs significant expertise and experience (Hall, 2013). The Molecular Evolutionary Genetics Analysis (MEGA-X) software is a tool used to compare gene sequences from different types of species to determine evolutionary relationships and patterns of DNA/protein evolution (Kumar, 2018). MEGA-X also includes tools for visualizing results in the form of phylogenetic trees and evolutionary distance matrices. Here, we utilized tools of MEGA-X to determine phylogenetic relationships between different species and to align nucleotide/protein sequences. We also utilized NCBI BLAST to determine homologous proteins of MEGA-X aligned protein sequences and determine species name for a set of gene sequences.
We retrieved three gene sequences from NCBI nucleotide database for each of the 5 species using the following accession numbers: Gene 1 – AY582799, XM_042480808, NM_001101683, NM_031144, BC138614; Gene 2 – BC047869, XM_042465885, NM_001361490, J05425, AH003587; Gene 3 – SU13680, U28410, M22585, U07177, X04752. Start codon for each gene sequence was determined using the coding sequence start number included in the gene sequence information of the NCBI database. Prior to multiple sequence alignment, we manually located the start codon for each gene sequence and removed nucleotide sequences upstream of the start codon using the MEGA-X alignment explorer. For gene 1, multiple sequence alignment of the 5 species was performed using the ClustalW (gap opening penalty = 3.00, gap extension penalty = 0.50). For gene 2, multiple sequence alignment of the 5 species was performed using ClustalW (gap opening penalty = 3.00, gap extension penalty = 0.20). For gene 3, multiple sequence alignment of the 5 species was performed using ClustalW (gap opening penalty = 3.00, gap extension penalty = 0.50). We used ClustalW as a plugin in MEGA-X. Gene tree constructions of nucleotide sequences were performed for all 3 multiple nucleotide sequence alignment using the MEGA-X phylogeny analysis tool (type = maximum likelihood tree, test of phylogeny = bootstrap method, number of bootstrap replication = 100, model = general time reversible). The three genes of each species were manually concatenated using a text file editor to produce a longer gene sequence which was assumed to be the entire genome. Multiple sequence alignment of the 5 genomes was performed using ClustalW (gap opening penalty = 3.00, gap extension penalty = 0.5). Species tree construction was performed for the genome multiple sequence alignment using MEGA-X phylogeny analysis tool (type = maximum likelihood tree, test of phylogeny = bootstrap method, number of bootstrap replication = 100, model = general time reversible). Prior to protein sequence alignment, genes that had nucleotide sequences upstream of start codon removed were imported into the MEGA-X alignment explorer. For gene 1, all 5 species-specific genes were translated into protein sequences and were aligned using ClustalW (gap opening penalty = 3.00, gap extension penalty = 0.5). For both gene 2 and gene 3, all 5 species-specific genes were translated into protein sequences and were aligned using ClustalW (gap opening penalty = 3.00, gap extension penalty = 0.2). Gene tree constructions of protein sequences were performed for all 3 multiple protein sequence alignment using the MEGA-X phylogeny analysis tool (type = maximum likelihood tree, test of phylogeny = bootstrap method, number of bootstrap replication = 100, model = JTT). All 3 multiple protein sequence alignments were exported into FASTA formatted files. Consensus protein sequences were produced using each of the 3 FASTA files in EMBOSS cons (default settings). We imported the consensus protein sequence files into NCBI BLASTP to determine the homologous protein of each set of the original 3 genes. We searched the homologous protein ID in the UNIPROT database to determine the functional role of the protein. The 5 genomes from concatenated genes were used in NCBI BLASTN to identify the species name.
The maximum likelihood tree construction for gene 1 nucleotide sequences shows bootstrap support of 100% for two clades M. musculus/R. norvegicus and O. cuniculus/H. sapiens with S. undulatus being the most distant from the other 4 species (Figure 1). As for the gene 1 protein sequences, bootstrap support of 82% was observed for clade R. norvegicus/S. undulatus and M. musculus with H. sapiens being the most distant from the other 4 species (Figure 2). The maximum likelihood tree construction for both gene 2 nucleotide and protein sequences shows bootstrap support of 100% for clade O. cuniculus/M. Musculus (Figure 3 and 4). For gene 3 nucleotide sequences, bootstrap support for both clades R. norvegicus/M. musculus and S. undulatus/O. cuniculus is 100% (Figure 5) while for gene 3 protein sequences, bootstrap support of 100% was observed only for clade R. norvegicus/M. Musculus (Figure 6). The maximum likelihood tree construction for the assumed 5 genomes showed that H. sapiens is more closely related to O. cuniculus with bootstrap support of 100% (Figure 7).
Consensus protein sequences for each set of genes were produced and aligned using NCBI BLASTP. Gene 1 consensus protein sequence homology analysis points highly on the ACTB protein which polymerizes to create the network of filaments in the cytoplasm of a cell (Figure 8). Gene 2 consensus protein sequence homology analysis points highly on the cytochrome c oxidase subunit 4 isoform 1 which takes part in the process of oxidative phosphorylation in mitochondria (Figure 9). Gene 3 consensus protein sequence homology analysis points highly on L-lactate dehydrogenase which functions as a converter of lactate into pyruvate (Figure 10). A summary of this analysis is shown in Table 1.
The BLASTN result for each assumed genome from concatenated gene are as follow: species 1 – Homo sapiens (QC = 57%, PI = 100%), species 2 – Sceloporus undulatus (QC = 41%, PI = 100%), species 3 – Oryctolagus cuniculus (QC = 43%, PI = 100%), species 4 – Rattus norvegicus (QC = 69%, PI = 100%), species 5 – Mus musculus (QC = 55%, PI = 100%). See figure 11 to 15 for BLASTN result pages.
We have determined all the homologous proteins of each gene set and found it to be accurate after performing multiple alignments on the protein sequences. After using BLASTP to find the homologous protein, gene 1 set points to the actin beta protein, gene 2 set points to the cytochrome c oxidase subunit 4 isoform, and gene 3 set points to the L-lactate dehydrogenase for either the A or C chain. The result from BLASTP of the gene 2 set is of particular significance as the top 12 results show only 12 percent query coverage; however, all these significant alignment points to a single homologous protein. We speculated that the gene sequence for cytochrome c oxidase subunit 4 isoform is the most conserved region of all the gene sequences. All BLASTP results suggest that the three gene sequences that we have are highly conserved throughout different species.
Using the assumed genomes from concatenated genes, each species was determined using BLASTN with high accuracy. Although the sequence result may vary in terms of gene names, a query using concatenated genes from a single individual may determine its species name with high percent identity. We were able to build phylogenetic trees with ease for each gene set from both nucleotide and protein sequences; however, the comparison between the phylogenetic tree from nucleotide and protein sequence does not show significant similarities. We expected this result because the phylogenetic trees were constructed using different models. The General Time Reversible model was used to construct phylogenetic trees from nucleotide multiple sequence alignment while Jones-Taylor-Thornton model was used for protein multiple sequence alignment. Another speculation is that the gene sequences we used are highly conserved throughout different species which may find the MEGA-X tool difficult to determine phylogenetic relationship between different species. This also reflects on the phylogenetic trees from the other gene sequences.
The phylogenetic tree constructed from concatenated gene sequences shows that M. musculus is closely related to R. norvegicus and that H. sapiens is the farthest relative to the other four species. However, this result may not be accurate because we were using gene sequences that are highly conserved throughout different species. For a more accurate species level relatedness, we suggest using the entire genome instead of only three concatenated gene sequences of each species to build this phylogenetic tree. Overall, we demonstrated the ease and accuracy of using MEGA-X and NCBI BLAST to perform homologous protein determination and phylogenetic analysis. These tools do not require complex computer knowledge and programming skills. Biologists and research scientists may not need significant time to learn these tools as most of the calculations and models are built-in. MEGA-X and NCBI BLAST are two of the sophisticated tools used to analyze nucleotide and protein sequence data and yet very user-friendly for a wide range of life science researchers.
Hall BG. Building phylogenetic trees from molecular data with MEGA. Mol Biol Evol. 2013. May 30(5):1229-35. doi: 10.1093/molbev/mst012. Epub 2013 Mar 13. PMID: 23486614.
Kumar, S., Nei, M., Dudley, J., & Tamura, K. (2008). MEGA: a biologist-centric software for evolutionary analysis of DNA and protein sequences. Briefings in bioinformatics, 9(4), 299–306. https://doi.org/10.1093/bib/bbn017
Pearson W. R. (2013). An introduction to sequence similarity (“homology”) searching. Current protocols in bioinformatics, Chapter 3, Unit3.1. https://doi.org/10.1002/0471250953.bi0301s42