pHMM-Tree (c) Yin Lab (NIU) and Zhang Lab (NKU) 2016 citation: Please cite (https://www.ncbi.nlm.nih.gov/pubmed/28062446) Huo et al. Bioinformatics 2017 if you used our tool Contents: I. Installation II. Input III. Running IV. Output V. Examples VI. FAQ ========================================================================= I. Installation Compiling pHMM-Tree follows the standard GNU installation process. pHMM-Tree requires a C++11 compliant compiler. The general process is that 'configure' should be run, then 'make', and then 'make install'. See INSTALL for further instructions. Example: tar zxf SOFTWARE.tar.gz cd pHMM-Tree #: change to installation folder sudo ./configure #: if you are a root/superuser ./configure --prefix=/some_path/ #: if you are not a root/superuser make make install NOTE: Please make sure that the needed programs have the executable permission and in the right path. Prerequisites: 1.HMMER http://hmmer.org/ 2.USEARCH http://www.drive5.com/usearch/download.html Please rename the binary file to 'usearch' and set it to your PATH variable or just put the renamed binary file into the same folder of pHMM-Tree. mv ./some_path/usearch-xxx-xxxx ./usearch or sudo mv ./some_path/usearch-xxx-xxxx /usr/bin/usearch 3.MAFFT http://mafft.cbrc.jp/alignment/software/linux.html needed by '-uals' run style in both mode. 4.PRC http://supfam.org/PRC/ needed by '-prc' mode. Please rename the program to 'prc' and put it into the same folder with the binary file of pHMM-Tree or set it to your PATH variable. mv ./some_path/prc-xxx-xxxx ./prc or sudo mv ./some_path/prc-xxx-xxxx /usr/bin/prc 5.hhsuite https://github.com/soedinglab/hh-suite needed by '-hhms' mode. Please move all the excecutable files in the bin folder into the same folder with the binary file of pHMM-Tree or set it to your PATH variable. ------------------------------------------------------------------------ II. Input pHMM-Tree supports different kinds of input: 1. Unaligned sequences (uals) The input file must be ended with '.fasta' in FASTA format, and contains at least 9 sequences. 2. Aligned sequences (als) There must be at least 3 input files each ended with '.fasta' in FASTA format. 3. Profile HMMs in HMMER format (hmms) There must be at least 3 input files each ended with '.hmm' in HMMER2.x or HMMER3.x format. 4. Proile HMMs in HHM format (hhms) There must be at least 3 input files each ended with '.hhm' in HHsuite format. See https://github.com/soedinglab/hh-suite/blob/master/hhsuite-userguide.pdf. 5. Mixture of aligned sequences and profile HMMs: als + hmms (als_phmms) 6. Mixture of aligned sequences and profile HMMs: als + hhms (als_phhms) ------------------------------------------------------------------------- III. Running pHMM-Tree has several required arguments, in addition to some optional ones. ./Program_name <mode> <option> [-id] [-prc_hit] [hmm_acc] [file_path / file_path_name] <mode>: -prc: Use PRC software to compare the profile HMMs in HMMER format. <mode>: -hhsuite: Use hhsuite software to compare the profile HMMs in HHM format. <option>: -uals: The input file has to have at least 9 unaligned sequences in FASTA format This mode requires an argument '-id' between 0.1 and 1.0 to set the usearch identity in cluster process. And the file name must be included in argument [file_path_name]. Users can also set the PRC parametor by giving [prc_hit] option a non negative value; the default value for pHMM-Tree is 10; set it be 0 if you want to use the PRC default value (100). Users may need to set [hmm_acc] option to use the 'ACC' key in the matrixs and tree files if the HMM files have 'ACC' keys; the default key is 'NAME'. Use '-lib' to run in PRC library style, the default is pairwise style. Cmd Examples: ./Program_name -prc -uals -id 0.9 ./some_path/a.fasta ./Program_name -prc -uals -id 0.9 -prc_hit 15 ./some_path/a.fasta ./Program_name -prc -uals -id 0.9 -prc_hit 15 -acc ./some_path/a.fasta ./Program_name -hhsuite -uals -id 0.9 ./some_path/a.fasta <option>: -als: The input data should be a folder with at least three files containing aligned sequences in FASTA format [prc_hit] and [hmm_acc] option also works in this mode, please read the '-uals' option part to find the details. Cmd Examples: ./Program_name -prc -als ./some_path/ ./Program_name -prc -als -prc_hit 15 ./some_path/ ./Program_name -prc -als -prc_hit 15 -acc ./some_path/ ./Program_name -hhsuite -als ./some_path/ <option>: -hmms: The input should be a folder with at least three profile HMM files in HMMER2.x or HMMER3.x format The options are same as the '-als' option mode. Cmd Examples: ./Program_name -prc -hmms ./some_path/ ./Program_name -prc -hmms -prc_hit 15 ./some_path/ ./Program_name -prc -hmms -prc_hit 15 -acc ./some_path/ <option>: -hhms: The input data should be a folder with at least three profile HMM files in HHM format The options are same as the '-als' option mode. Cmd Examples: ./Program_name -hhsuite -hhms ./some_path/ <option>: -als_phmms: The input data should be two folders, one folder with profile HMM files, the other with files containing aligned sequences in FASTA format, in total > 3 files in the two folders The options are same as the '-als' option mode. Cmd Examples: ./Program_name -prc -als_phmms ./some_hmms_path/ ./some_als_path/ ./Program_name -prc -als_phmms -prc_hit 15 ./some_hmms_path/ ./some_als_path/ ./Program_name -prc -als_phmms -prc_hit 15 -acc ./some_hmms_path/ ./some_als_path/ <option>: -als_phhms: The input data should be two folders, one folder with profile HHM files, the other with files containing aligned sequences in FASTA format, in total > 3 files in the two folders The options are same as the '-als' option mode. Cmd Examples: ./Program_name -hhsuite -als_phhms ./some_hhms_path/ ./some_als_path/ -------------------------------------------------------------------------- IV. Output pHMM-Tree outputs several folders: <option>: -uals: 1.Folder 'tree_files' contains the tree data files: (1) 'f-m_fitch_outtree' and 'f-m_fitch_outfile' are the output tree files by 'fitch' program of PHYLIP by 'Fitch-Margoliash' method (2) 'f-m_kitsch_outtree' and 'f-m_kitsch_outfile' are the output tree files by 'kitsch' program of PHYLIP by 'Fitch-Margoliash' method (3) 'min_fitch_outtree' and 'min_fitch_outfile' are the output tree files by 'fitch' program of PHYLIP by 'Minimum Evolution' method (4) 'min_kitsch_outtree' and 'min_kitsch_outfile' are the output tree files by 'kitsch' program of PHYLIP by 'Minimum Evolution' method (5) 'neighbor_neighbor_outtree' and 'neighbor_neighbor_outfile' are the output tree files by 'neighbor' program of PHYLIP by 'Neighbor Join' method (6) 'upgma_upgma_outtree' and 'upgma_upgma_outfile' are the output tree files by 'neighbor' program of PHYLIP by 'UPGMA' method 2.Folder 'matrixs' contains the matrix files (1) 'file_dist_matrix_out_phylip.txt': the output matrix file of PHYLIP matrix format (2) 'old_new_names_list.txt': PHYLIP programs trim the sequence name if the word lengths are longer than 10 letters; this file contains the before cut names and the after cut names in pairs (3) 'file_dist_matrix_out_mega.meg' matrix file in 'Mega' format 2.Folder 'unalign_seqs' contains the valid unaligned clusters of the sequences 3.Folder 'invalid_clusters' contains the invalid unaligned clusters of the sequences 4.Folder 'aligned' contains the valid aligned clusters of the sequences 5.Folder 'hmms' contains the Profile HMM files in HMMER2.x format 6.Folder 'hmmer3' contains the Profile HMM files in HMMER3.x format if HMMER3.x is installed in one's computer or the input files in HMMER3.x format <option>: -als: see above for details 1.Folder 'tree_files' contains the tree data files: 2.Folder 'matrixs' contains the matrix files 3.Folder 'hmms' contains the Profile HMM files in HMMER2.x format 4.Folder 'hmmer3' contains the Profile HMM files in HMMER3.x format if HMMER3.x is installed in one's computer or the input files in HMMER3.x format <option>: -hmms: see above for details 1.Folder 'tree_files' contains the tree data files: 2.Folder 'matrixs' contains the matrix files 3.Folder 'hmms' contains the Profile HMM files in HMMER2.x format 4.Folder 'hmmer3' contains the Profile HMM files in HMMER3.x format if HMMER3.x is installed in one's computer or the input files in HMMER3.x format <option>: -hhms: see above for details 1.Folder 'tree_files' contains the tree data files: 2.Folder 'matrixs' contains the matrix files 3.Folder 'hhms' contains the Profile HMM files in HHM format <option>: -als_phmms: see above for details 1.Folder 'tree_files' contains the tree data files: 2.Folder 'matrixs' contains the matrix files 3.Folder 'hmms' contains the Profile HMM files in HMMER2.x format 4.Folder 'hmmer3' contains the Profile HMM files in HMMER3.x format if HMMER3.x is installed in one's computer or the input files in HMMER3.x format 5.Folder 'hmms_from_als' contains the Profile HMM files built from als input folder <option>: -als_phhms: see above for details 1.Folder 'tree_files' contains the tree data files: 2.Folder 'matrixs' contains the matrix files 3.Folder 'hhms' contains the Profile HMM files in HHM format 4.Folder 'hhms_from_als' contains the Profile HMM files in HHM format built from als input folder -------------------------------------------------------------------------- V. Examples This package includes five examples folders (if you have pHMM-Tree installed in the pHMM-Tree folder, the example folder is one level up): The unaligned sequences: ../examples/unalign_seqs/1471-2229-9-99-s3105.fasta Run command: ./pHMM-Tree –prc -uals -id 0.9 ../examples/unalign_seqs/1471-2229-9-99-s3105.fasta ./pHMM-Tree -hhsuite -uals -id 0.9 ../examples/unalign_seqs/1471-2229-9-99-s3105.fasta The aligned sequences folder: ../examples/aligned_seqs/ Run command: ./pHMM-Tree -prc -als ../examples/aligned_seqs/ ./pHMM-Tree -hhsuite -als ../examples/aligned_seqs/ The profile HMM files with 'ACC' key in a folder: ../examples/profile_hmms_acc/ Run command: ./pHMM-Tree -prc -hmms -acc ../examples/profile_hmms_acc/ The profile HMM files without 'ACC' key in a folder: ../examples/profile_hmms_name/ Run command: ./pHMM-Tree -prc -hmms ../examples/profile_hmms_name/ The Profile HMM files in HHM format in a folder: ../examples/hhms/ Run command: ./pHMM-Tree -hhsuite -hhms ../examples/hhms/ The profile HMM folder and the aligned sequence folder: Run command: ./pHMM-Tree -prc -als_phmms ../examples/profile_hmms_name/ ../examples/aligned_seqs/ ./pHMM-Tree -hhsuite -als_phhms ../examples/hhms/ ../examples/aligned_seqs/ -------------------------------------------------------------------------- VI. FAQ 1. About USEARCH USEARCH 32-bit is freely available and its 64-bit version is not free. According to http://www.drive5.com/usearch/download.html, the 32-bit version can work on both 32-bit and 64-bit operating systems, while it has a maximum of 4Gb RAM limitation. For most users, it should be able to meet their needs. 2. About PRC PRC has two running modes: pairwise mode and library mode. pHMM-Tree uses pairwise mode, so each pHMM is compared to every other pHMM in the input to compute a score, which is further converted to the distances. By default, we use the default parameters for all the pre-requisite programs except PRC. To save time we use the '-hits' option to limit the running time, as we don't need all of the PRC result message, but users can reset it by '-prc_hit' option for a more accurate result (time-consuming). 3. About HHsuite HHsuite is software package with many programs. pHMM-Tree uses the hhalign program to compare each pHMM against every other pHMM (i.e. equivalent to PRC's pairwise mode) in the input to compute a score, which is further converted to the distances. In contrast, Pfam uses hhsearch, which searches each pHMM against all other pHMMs (i.e. the library mode) to compute E-values, and uses clanviewer to represent the relationships among families of a clan (e.g. http://pfam.xfam.org/clan/CL0014#tabview=tab3). 4. About PRC vs. HHsuite We have two reasons to use PRC as the default in pHMM-Tree: (i) our own experience is that PRC (pairwise mode) is faster than hhsuite (hhalign) in comparing pHMMs; (ii) PRC can process HMMER pHMM format, which Pfam uses and is more popular than hhsuite pHMM format (HHM). We do have realized that hhsearch is considered to be faster than PRC library mode: https://toolkit.tuebingen.mpg.de/hhpred/help_ov, which we do not use in pHMM-Tree. 5. About phylogeny graphs On our web server, we use an R package ggtree to plot the newick format tree file as tree graph (https://bioconductor.org/packages/release/bioc/html/ggtree.html). However, users can choose other tree graph programs (e.g. iTOL and MEGA), which may generate better looking graphs. Sometimes we see the tree graphs are clipped in ggtree generated images. We have tried our best to prevent that by modifying our ggtree R script, but in case the users still see clipped images, they can always use the newick format file and other graph programs to generate nice tree graphs. 6. About Phylip We use three different distance programs of Phylip to compute newick trees with a pHMM distance matrix: fitch, kitsch, and neighbor. Each program has implemented two different algorithms. And therefore there are six trees computed for each input matrix: f-m_fitch_outtree, f-m_kitsch_outtree, min_fitch_outtree, min_kitsch_outtree, neighbor_neighbor_outtree, upgma_upgma_outtree. 7. About negative branch length Sometimes users see strange trees with negative branch lengths in Minimum Evolution method (min_fitch_outtree, min_kitsch_outtree). This has been discussed in http://evolution.genetics.washington.edu/phylip/doc/kitsch.html. 8. About running time The rate-limiting step when running pHMM-tree is the pHMM comparison step. We have recorded the time use of running pHMM-tree on the 254 Pfam clans. Basically when the number of pHMMs increases, the time use will go exponentially. For example, while it takes 2 minutes to compute the CL0058 clan (Glyco_hydro_tim, 53 families) tree, it takes 6 hours to compute a tree for the CL0123 clan (HTH, 254 families). We can extrapolate that pHMM-Tree will take about a week to compute a 500 pHMM tree, which could be the theoretical maximum that pHMM-tree can handle. See our paper for details. 9. About the unaligned sequence input mode pHMM-tree is not a protein subclassification tool and the third type input will just do a simple similarity-based clustering and each cluster is treated as a subfamily. For specialized protein family subclassification tools, please see SCI-PHY and GeMMA. Subfamilies generated from these programs could be each aligned with MAFFT and then feed into pHMM-Tree. In the '-uals' mode, please be sure there are enough sequences for the program to run normally. To build a profile HMM the program needs at least 3 sequences in one cluster, and therefore to build a normal tree, the program need at least 9 sequences. Similarly, at least 3 aligned files each containing at least 3 sequences are needed in '-als' mode. At least 3 profile HMM files are needed in '-hmms' mode and '-hhsuite' mode.