Orthologs are homologs separated by speciation events. Paralogs are
homologs separated by duplication events. Detection of orthologs is
becoming much more important with the rapid progress in genome
sequencing. OrthoMCL is a genome-scale algorithm for grouping orthologous protein
sequences. It provides not only groups shared by two or more
species/genomes, but also groups representing species-specific gene
expansion families. So it serves as an important utility for automated
eukaryotic genome annotation. OrthoMCL starts with reciprocal best
hits within each genome as potential in-paralog/recent paralog pairs
and reciprocal best hits across any two genomes as potential ortholog
pairs. Related proteins are interlinked in a similarity graph. Then
MCL (Markov Clustering algorithm, Van Dongen 2000; www./mcl)
is invoked to split mega-clusters. This process is analogous to the
manual review in COG construction. MCL clustering is based on weights
between each pair of proteins, so to correct for differences in
evolutionary distance the weights are normalized before running MCL. OrthoMCL is similar to the INPARANOID algorithm (Remm, Storm et al.
2001), but is extended to cluster orthologs from multiple
species. OrthoMCL clusters are coherent with groups identified by EGO
(Lee, Sultana et al. 2002), and an analysis using EC number suggests
a high degree of reliability (Li, Stoeckert et al. 2003). In a recent assessment (Chen, et al. 2007), the performance of seven
widely used orthology detection algorithms, representing three
kinds of strategies (phylogeny-based, evolutionary distance-based
and BLAST-based), are evaluated using the statistical
technique Latent Class Analysis (LCA). LCA is useful when there
are large data sets available but no gold standard. The results
show an overall trade-off between sensitivity and specificity
among these algorithms, with INPARANOID and OrthoMCL as the two
best methods having both False Positive (FP) and False Negative
(FN) error rates lower than 20%. 安裝和使用統(tǒng)一配置環(huán)境變量,一勞永逸 把export PERL5LIB=${PERL5LIB}:~/perl5lib/ 加到~/.bashrc 中 把export PATH=${PATH}:~/bin 加到 ~/.bashrc 中 環(huán)境變量配置:在系統(tǒng)中新建目錄 ~/bin ,將其完整路徑加入到環(huán)境變量。 PERL5LIB配置:在系統(tǒng)中新建目錄 ~/perl5lib ,將其完整路徑加入到環(huán)境變量。 更新環(huán)境變量配置 source ~/.bashrc
mcl安裝 wget http://www./mcl/src/mcl-latest.tar.gz tar xvzf mcl-latest.tar.gz cd mcl-latest ./configure --prefix=`pwd`/../mcl_bin make make install ln -s `pwd`/../mcl_bin/bin/* ~/bin/
orthoMCL安裝 wget http:///common/downloads/software/v2.0/orthomclSoftware-v2.0.9.tar.gz tar xvzf orthomclSoftware-v2.0.9.tar.gz cd orthomclSoftware-v2.0.9 ln -s `pwd`/bin/* ~/bin/ ln -s `pwd`/lib/perl/* ~/perl5lib
配置Mysql數(shù)據(jù)庫(kù) 新建名字為orthomcl 的數(shù)據(jù)庫(kù) CREATE DATABASE orthomcl; 新建用戶orthomcl ,密碼為152108 , 該用戶對(duì)數(shù)據(jù)庫(kù)orthomcl 有完全操作
權(quán)限 GRANT SELECT,INSERT,UPDATE,DELETE,CREATE VIEW, CREATE,INDEX,DROP on orthomcl.* TO 'orthomcl'@'localhost' IDENTIFIED BY '152108';
FLUSH PRIVILEGES;
若啟動(dòng)失敗,查看log文件 /var/log/mysqld.log 中的錯(cuò)誤信息。 /usr/libexec/mysqld: Can't change dir to [Error code 13]
確保datadir的所有上層目錄有x 屬性 若依然啟動(dòng)不了,在終端運(yùn)行setenforce 0 關(guān)閉SELINUX 查看mysql服務(wù) service mysqld status 關(guān)掉mysql服務(wù) service mysqld stop 移動(dòng)數(shù)據(jù)庫(kù)目錄到目標(biāo)位置 mkdir ~/mysql; chown mysql:mysql ~/mysql
mv /var/lib/mysql/* ~/mysql/
在/etc/my.cnf 文件中修改datadir 為~/mysql mysql -uroot 登錄mysql數(shù)據(jù)庫(kù)
在mysql操作界面依次輸入sql語(yǔ)句 SET PASSWORD=PASSWORD("passwd");
FLUSH PRIVILEGES;
yum install mysql mysql-server
安裝mysql數(shù)據(jù)庫(kù) 設(shè)置mysql根用戶的密碼 因?yàn)镺rthoMCL運(yùn)行時(shí)需要較大的存儲(chǔ)空間,而我的根目錄下空間不夠,
因此需要更換數(shù)據(jù)庫(kù)目錄;如果根目錄下空間足夠,則不需要這部分操作。 修改/etc/my.cnf 配置文件 [mysqld]
datadir=~/mysql
#[OPTIMIZATION]
##Set this value to 50% of available RAM if your environment permits.
myisam_sort_buffer_size=60G
##[OPTIMIZATION]
##This value should be at least 50% of free hard drive space. Use
#caution if setting it to 100% of free space however. Your hard disk
#may fill up!
myisam_max_sort_file_size=200G
##[OPTIMIZATION]
##Our default of 2G is probably fine for this value. Change this value
#only if you are using a machine with little resources available.
read_buffer_size=2G
啟動(dòng)mysql服務(wù) service mysqld start 新建用戶和數(shù)據(jù)庫(kù) centos7中使用mariadb取代了mysql, 但所有命令的執(zhí)行相同 (忽略掉這一段) yum install mariadb mariadb-server systemctl start mariadb ==> 啟動(dòng)mariadb systemctl enable mariadb ==> 開機(jī)自啟動(dòng) mysql_secure_installation ==> 設(shè)置 root密碼等相關(guān) mysql -uroot -pPASSWD ==> 測(cè)試登錄!
配置OrthoMCL工作文件 orthomclInstallSchema orthomcl.config inst_schema.log species
建一個(gè)目錄 (~/orthmcl_work),存儲(chǔ)OrthoMCL配置文件 拷貝orthomclSoftware-v2.0.9/doc/OrthoMCLEngine/Main/orthomcl.config.template
到~/orthmcl_work,重命名為orthomcl.config 修改內(nèi)容為: # this config assumes a mysql database named 'orthomcl'.
# Adjust according to your situation.
dbVendor=mysql
#Databsename: orthmcl
dbConnectString=dbi:mysql:orthomcl
#Database username
dbLogin=orthomcl
#Database password
dbPassword=152108
# Change strings as you like
similarSequencesTable=SimilarSequences
orthologTable=Ortholog
inParalogTable=InParalog
coOrthologTable=CoOrtholog
#Standards
interTaxonMatchView=InterTaxonMatch
percentMatchCutoff=50
evalueExponentCutoff=-5
oracleIndexTblSpc=NONE
生成數(shù)據(jù)表:
創(chuàng)建OrthoMCL輸入文件 orthomclFilterFasta orthlMCL 10 20
OrthoMCL的輸入文件為fasta格式文件,其中fasta序列的名字格式為>taxoncode|unique_prot_id 。序列名稱為空格或下劃線分開的兩列,
第一列為3到4個(gè)字母的物種代碼,第二列為蛋白序列的唯一ID。 通常一個(gè)基因選擇一條代表性蛋白序列。 這些文件使用統(tǒng)一后綴.fasta ,并存儲(chǔ)于同一文件夾orthlMCL 下
(這個(gè)文件夾下只能存儲(chǔ)fasta格式序列,不然運(yùn)行 orthomclBlastParser 時(shí)會(huì)報(bào)錯(cuò))。 序列過(guò)濾,允許最短的蛋白長(zhǎng)度為10,stop codons最大比例為20%,
默認(rèn)得到goodProteins.fasta 。 將得到的goodProteins.fasta 與orthoMCL的數(shù)據(jù)合并,
得到orthoMCL.fa 。 通常我們需要準(zhǔn)備研究物種及其多個(gè)近緣或者有代表性物種的蛋白質(zhì)序列
,因此可不與orthoMCL數(shù)據(jù)庫(kù)中的蛋白質(zhì)序列合并,直接用我們的goodProteins.fasta 作為orthoMCL.fa 。
序列BLAST makeblastdb -in orthoMCL.fa -dbtype prot -title orthomcl \ -out orthomcl -logfile orthomcl.log` blastp -db orthomcl -query goodProteins.fasta -seg yes \ -out orthomcl.blastout -evalue 1e-5 -outfmt 7 -num_threads 70`
略卻其它步驟,都整合到一個(gè)bash腳本中。 整合的分析腳本orthoMcl.sh
Usage:
/MPATHB/self/NGS/orthoMcl.sh options
Function:
This script is used to perform orthoMcl analysis using MySql, MCL and orthomcl.
Before running this script, one must have one mysql database and a mysql user which can perform operation on this database.
OPTIONS: -d Mysql database name (using user_name as prefix to avoid duplication) [Necessary] -u Mysql database username [Necessary] -p Mysql database password [Necessary] -s Target species of this analysis (Any representing string is OK, the shorter the better) [Necessary] -D A directory containing FASTA files for all proteins. [Necessary] -S Sequences downloaded from orthMCL website. [Optional, not used anymore] -t Number of threads for blast. [Default 50]
Program description:
This is designed to parse orthmcl results.
Input file format:
cluster_name<colon><any blank>spe1<vertical_line>prot1<any blank>spe2<verticial_line>prot2<any blank>.....
C10000: Aco|Aco000153.1 Aco|Aco004369.1 Aco|Aco010005.1
C10001: Aco|Aco000153.1 Cla|Cla004369.1 Dec|Dec010005.1
Tasks:
1. Get a matrix showing the number of proteins in each cluster.
2. Extract single gene clusters and their sequences in all given
species. In the output nucleotide file, ending stop codon (TAA,
TAG, TGA) will be removed for compatible with
`translatorx_vLocal.pl` and `trimal`.
3. Extract species specific clusters for given species.
4. Extract gene-expansion clusters for given species.
5. Extract multiple-species specific clusters.
Usage: parseOrthoMclResult.py -i file
Options:
-h, --help show this help message and exit
-i FILEIN, --input-file=FILEIN
Output of `orthomclMclToGroups`.
-t MAIN_SPE, --target-species=MAIN_SPE
Specify the `species` name used for extracting species
specific clusters or specially expanded clusters.
-E EXCLUDE_WHEN_READING, --exclude-all=EXCLUDE_WHEN_READING
Comma or blank separated strings representing species
excluded when reading in the result. It will affect
all tasks. Default including all species.
-e EXCLUDE_SINGLE_CONSERVE, --exclude-2=EXCLUDE_SINGLE_CONSERVE
Comma or blank separated strings representing species
should not be considered when performing task <2>.
Default including all species.
-s SPECIFIC_MULTIPLE, --specific-multiple-5=SPECIFIC_MULTIPLE
Comma or blank separated strings representing multiple
species used for task <5>. Default muting task 5.
-P DIR_PROT, --directory-prot=DIR_PROT
Directory containing all protein sequences used for
`orthoMcl.sh`. All sequences have a suffix `.fasta`.
-N DIR_NUCL, --directory-nucl=DIR_NUCL
Directory containing all nucleotide sequences used for
`orthoMcl.sh`. All sequences have a suffix `.fasta`.
-o OUTP, --output-prefix=OUTP
Prefix for output files.
-v, --verbose Show process information
-d, --debug Debug the program
|