Objective:

In most organisms, about 10 percent of the coding genome is devoted to transport proteins. As a result, it is important to determine and then subsequently analyze all of the possible transporters encoded within the genome of an organism of interest. The quantities, types and characteristics of the transport proteins can reveal a wealth of information about the organism. In many cases a unique life style correlates with the types of genes present. For example, parasitic and symbiotic organisms usually have reduced genome sizes. The fish parasite Ichthyophthirius multifiliis (ICH) has many members of the Voltage-gated Ion Channel Superfamily (VIC) that lack voltage sensors. These attributes can vary substantially between even closely related organisms such as Tetrahymena thermophila and Paramecium tetraurelia. In summary, our goal is to use various methods to first find potential transporter proteins of an organism of interest and then analyze and organize them. We consistently take steps to remove errors, find trends, and identify unique proteins.

 

Setup:

The analysis begins by finding the most up-to-date proteome in FASTA format for the organism in question. Finding the most recent proteome for some organisms can be a challenging task; however a good method is simply googling the organism’s name. For example, the most up to date proteome for Paramecium tetraurelia can be found at http://paramecium.cgm.cnrs-gif.fr/. This site presents the most current information about Paramercium tertraureila with numerous links and resources including a download link for the proteome. To differentiate between the genome and proteome, look for keywords such as nucleotide and peptide that indicate if the file represents the genome or proteome, respectively. If such a site does not exist for the query organism, NCBI can be used to generate a list of known proteins. This can be done by inputting the organism’s full name in the search box at the top of the page (example: Paramecium tetraurelia). This will bring up a list of databases. Click on the link: “protein”. This will usually bring up an enormous list of proteins. Click on “send to” and then select “FASTA format”. A program called Gblast is then used to sift through and extract all of the potential transporters that are homologous to proteins found in TCDB.

 

Using Gblast:

Once a good proteome is obtained in FASTA format, Gblast can be used to extract all potential transport proteins from the list. In order to do this, a program called “fugu” is used to connect to the server with the glbast program and transfer the proteome to the correct directory. Click on fugu (icon shaped like a puffer fish) and log into server 132.239.144.51 that contains the program (ask a senior member of the Saier lab for the password). Once you are logged in, click on a directory named “genome” and copy and paste your proteome into the folder. Remember to delete your proteome from the directory once the Gblast search is complete. The next step is to execute Gblast by using a program called “terminal”. Open the terminal program (black box icon with “>_” text) and type in “ssh 132.239.144.51”. Enter the password (again, you must obtain it from a senior member of the lab) and then type in “cd genome”. To run the program, you must finally type in “perl Gblast.pl <your input file name>”. The program can take anywhere from 1 hour to a day depending on the size of your proteome. Once the program is finished, use fugu to log in to the correct server, and then go to “genome” folder followed by the “results” folder. Your results will be in a file called “clean.tsv”. Copy this file (rename it to something relevant) to the appropriate area on your computer (hopefully a folder with your name on it in “put your files here”). Open this file with a program called “bbedit” first and then save it again. Open this “new save” with excel. If excel can not open the file,  use bbedit. Your results will be presented as shown in the table below.

 

Gblast information:

A program called Gblast is used to blast your proteome against the TCDB database. The results are tabulated into an excel file that shows the query (protein from your proteome of interest) with the top hit from TCDB. The query name and description derives from the proteome. The query TMSs are obtained using the WHAT program that predicts hydrophobicity and amphipathicity along the length of the protein using a window of 19 residues for ⍺-helicies and 9 for ϐ-strands. All of the information regarding the hit protein is obtained from TCDB. The following table illustrates the information obtained.

 

Query Name of the protein according to the query proteome
Hit Accession Number of the top hit in TCDB
TCID TCDB hit protein ID number
Qry. Description The description of the query protein obtained via the query proteome
Hit Description The description of the hit protein in TCDB
Qry. TMS The number of predicted TMSs in the query protein using the WHAT program
Hit TMS The number of predicted TMSs in the hit protein in TCDB, also using the WHAT program
Qry. Length Length of the query protein (# of aas)
Hit length Length of the hit protein (# of aas)
Qry. Region The region of the query protein that is similar to the hit protein
Hit Region The region of the hit protein that is similar to the query protein
Percent Aln. The aligned region of the hit as a percentage of hit protein length
e-Value The e-value obtained for the alignment
Family Name The family name of the hit protein according to TCDB
Matching Qry. TMSs Any query TMSs found to match the hit
Matching Hit TMSs Any hit TMSs found to match the query
Query TMS Region The length of the hydrophobic region in the query sequence
Query Hit Region The length of the hydrophobic region in the hit sequence
TM Domain(s) Any domain(s) predicted to be transmembrane
Additional Domains Any domains not found to be transmembrane, but recognized by CDD
Notes Any unusual or interesting details about the results

 

Analysis:

The results of Gblast should now be presented in an excel file as shown above. As mentioned previously, the file displays the query protein against the top hit protein from TCDB as well as all of the relevant data. With these data, we can begin our analyses. Our goals is to find out the location of homology that exists between the query and hit protein, and if the homology is within the transmembrane domain. We can do this by finding out what transmembrane segments (TMSs), if any, are shared by the hit and query proteins. This is important because integral transport proteins are found in membranes and therefore should contain TMSs (Transport systems may contain proteins that have no TMSs, but facilitate the transport of substrates). This step can be crucial in determining if the query protein retains transport qualities. We can accomplish this by using a program called “WHAT” or “HMMTOP” located in biotools that predicts hydropathy of the protein in question.

The next step is to see what domains are shared by the query and hit proteins. A position-specific iterated BLAST (PSI-BLAST) search is used to find out what domains, if any, are shared. This can confirm whether the query protein contains the corresponding TMSs. For example, the only homology shared between two proteins maybe a CAP_ED domain (as revealed by CDD) that (may be) codes for a cAMP binding domain. In a few cases, there will be no recognized domains shared between the proteins; however, it is possible that the domain has not yet been characterized. Results with an e-value greater than 0.01 can be removed, yet keep in mind that excel rounds very small e-values to 0.