日韩黑丝制服一区视频播放|日韩欧美人妻丝袜视频在线观看|九九影院一级蜜桃|亚洲中文在线导航|青草草视频在线观看|婷婷五月色伊人网站|日本一区二区在线|国产AV一二三四区毛片|正在播放久草视频|亚洲色图精品一区

分享

Cell:20種宏基因組學(xué)物種分類工具大比拼

 昵稱QAb6ICvc 2019-08-12
宏基因組學(xué)物種分類工具評(píng)測(cè)

Benchmarking Metagenomics Tools for Taxonomic Classification

Cell, [36.216]

2019-08-08 Review

DOI: https:///10.1016/j.cell.2019.07.010

全文可開(kāi)放獲取 https://www./cell/fulltext/S0092-8674(19)30775-5

第一作者:Simon H. Ye1,2,*

通訊作者:Simon H. Ye1,2,*

其它作者:Katherine J. Siddle, Daniel J. Park, Pardis C. Sabeti

作者單位:

1麻省理工學(xué)院,哈佛-麻省理工健康科學(xué)與技術(shù)中心(Harvard-MIT Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA)

2麻省理工學(xué)院和哈佛大學(xué)博德研究所(Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA)

日?qǐng)?bào)

有多種軟件可用于宏基因組數(shù)據(jù)的物種分類,但缺少系統(tǒng)的評(píng)估;

本文介紹了當(dāng)前主流宏基因組分析方法,并對(duì)20個(gè)分類軟件進(jìn)行了系統(tǒng)評(píng)估;

同時(shí)介紹了評(píng)估的關(guān)鍵指標(biāo),為更多分類軟件的評(píng)測(cè)提供了框架;

對(duì)數(shù)據(jù)庫(kù)建索引步驟的資源消耗評(píng)估,有助于用戶選擇自建索引或使用同行已建索引;

對(duì)軟件運(yùn)行中內(nèi)存、線程數(shù)和時(shí)間使用的評(píng)估,有利于根據(jù)自身硬件條件選擇合適的軟件和分析方案,預(yù)估項(xiàng)目所需時(shí)間。

主編評(píng)語(yǔ):宏基因組測(cè)序正在徹底改變微生物物種的檢測(cè)和表征,但目前軟件太多,令同行選擇非常困難。近日Cell雜志發(fā)文對(duì)物種分類軟件系統(tǒng)進(jìn)行了系統(tǒng)的評(píng)估,此文結(jié)果對(duì)同行根據(jù)自己實(shí)際情況選擇最符合自身硬件條件的分析方案提供指導(dǎo),以便獲得較優(yōu)結(jié)果。同時(shí)也為開(kāi)發(fā)相關(guān)軟件的同行,提供了一套系統(tǒng)評(píng)估軟件性能的框架。

摘要

宏基因組測(cè)序正在徹底改變微生物組中物種的檢測(cè)和表征,并且有多種軟件工具可用于對(duì)這些數(shù)據(jù)進(jìn)行分類學(xué)分類。這些工具的快速發(fā)展和宏基因組數(shù)據(jù)的復(fù)雜性使得研究人員能夠?qū)ζ湫阅苓M(jìn)行基準(zhǔn)測(cè)試非常重要。在這里,我們回顧了當(dāng)前的宏基因組分析方法,并使用模擬和實(shí)驗(yàn)數(shù)據(jù)集評(píng)估了20個(gè)宏基因組分類器的性能。我們描述了用于評(píng)估性能的關(guān)鍵指標(biāo),為其他分類器的比較提供了框架,并討論了宏基因組數(shù)據(jù)分析的未來(lái)。

Metagenomic sequencing is revolutionizing the detection and characterization of microbial species, and a wide variety of software tools are available to perform taxonomic classification of these data. The fast pace of development of these tools and the complexity of metagenomic data make it important that researchers are able to benchmark their performance. Here, we review current approaches for metagenomic analysis and evaluate the performance of 20 metagenomic classifiers using simulated and experimental datasets. We describe the key metrics used to assess performance, offer a framework for the comparison of additional classifiers, and discuss the future of metagenomic data analysis.

主要結(jié)果圖1. 從宏基因組樣本到物種組成

Figure 1 Processing Steps to Go from a Complex Metagenomic Sample to an Abundance Profile of Sample Content

圖2. 評(píng)估分類表現(xiàn)的重要指標(biāo)

Figure 2 Metrics Used for Evaluating Classifier Performance

AUPR(area under the precision-recallcurve, 準(zhǔn)確-召回曲線下的面積)和L2(straight-line distance between the observed and true abundance vectors,實(shí)際與預(yù)測(cè)間的直線距離)距離是兩個(gè)互補(bǔ)的指標(biāo),分別提供對(duì)分類器準(zhǔn)度-召回和豐度估計(jì)準(zhǔn)確性的評(píng)估。綜合以上指標(biāo),它們提供了易于解釋的分類器性能圖,可用于比較分類器。

AUPR and L2 distance are two complementary metrics that provide insight into the accuracy of a classifier’s precision-recall and abundance estimates, respectively. Considered together, they provide a readily interpretable picture of classifier performance and can be used to compare classifiers.

表1. 分類器評(píng)估指標(biāo)匯總

Table 1 A List of Benchmarked Classifiers and Their Various Characteristics

主要包括數(shù)據(jù)庫(kù)是否可定制,能否產(chǎn)生豐度組成長(zhǎng),內(nèi)存消耗,時(shí)間消耗等。

“自定義數(shù)據(jù)庫(kù)”是指最終用戶創(chuàng)建自定義數(shù)據(jù)庫(kù)的能力。時(shí)間和內(nèi)存要求是基于一個(gè)570萬(wàn)個(gè)序列的數(shù)據(jù)集,數(shù)據(jù)庫(kù)和輸入文件已經(jīng)緩存在內(nèi)存中。某些方法(標(biāo)記為“變化”)能夠靈活地降低其內(nèi)存使用量(以運(yùn)行時(shí)間的大量增加為代價(jià))。

a最新版本的PathSeq現(xiàn)在允許用戶創(chuàng)建和指定自定義數(shù)據(jù)庫(kù),但在執(zhí)行基準(zhǔn)測(cè)試時(shí),此選項(xiàng)不可用; 因此,它被排除在這些分析之外。

“Custom databases” refers to the ability for the end user to create a custom database. The time and memory requirements are for a 5.7 million-read dataset with the database and input already cached in memory. Some methods (marked as “varies”) have the ability to flexibly decrease their memory usage (at the cost of a massive increase in run time).

aThe latest version of PathSeq now allows the user to create and specify a custom database, but this option was not available when benchmarking studies were performed; thus, it was excluded from those analyses.

圖3. 評(píng)估AUPR得分

Figure 3 Benchmark AUPR Scores

(A)物種水平上每個(gè)分類器的準(zhǔn)確-召回率曲線(AUPR)得分下的面積(更高的值更好)。每個(gè)繪圖點(diǎn)代表(分類器,數(shù)據(jù)集組合)的得分。分類器按其目標(biāo)類進(jìn)行分組和著色(藍(lán)色為DNA,橙色為蛋白,紅色為DNA標(biāo)記)。

(B)AUPR用于統(tǒng)一的RefSeq CG數(shù)據(jù)庫(kù)而不是默認(rèn)數(shù)據(jù)庫(kù)。RefSeq CG圖上缺少條目是無(wú)法創(chuàng)建自定義數(shù)據(jù)庫(kù)的分類器。可以看到,在相同數(shù)據(jù)庫(kù)下,各軟件表現(xiàn)結(jié)果差異并不大。有關(guān)其他信息,請(qǐng)參見(jiàn)圖S1-S4。

(A) Area under the precision-recall curve (AUPR) scores for each classifier at the species level (a higher value is better). Each plot point represents the score for a (classifier, dataset combination). Classifiers are grouped and colored by their target class.

(B) AUPR for the uniform RefSeq CG database instead of default databases. Missing entries on the RefSeq CG plot are classifiers that cannot create custom databases.

For additional information, see Figures S1–S4.

圖4. 評(píng)估L2距離

Figure 4 Benchmark L2 Distances

(A)每個(gè)分類器的物種豐度分布與真實(shí)組合物之間的距離(較低的值更好)。每個(gè)繪圖點(diǎn)表示(分類器,數(shù)據(jù)集)組合的L2距離。分類器按其目標(biāo)類進(jìn)行分組和著色。

(B)使用統(tǒng)一的RefSeq CG數(shù)據(jù)庫(kù)的豐度距離。缺少的條目是無(wú)法創(chuàng)建自定義數(shù)據(jù)庫(kù)的分類器。

(C)跨模擬數(shù)據(jù)集的分類器之間的中位數(shù)成對(duì)L2標(biāo)準(zhǔn)豐度的層級(jí)聚類。非黑色簇對(duì)應(yīng)顏色是0.09相似度閾值的組。彩色框?qū)?yīng)于方法類型:DNA,蛋白質(zhì)和標(biāo)記分類器?!発”注釋表示基于k-mer方法。有關(guān)其他信息,請(qǐng)參見(jiàn)圖S6。

(A) Distance between the species abundance profile for each classifier compared with the true composition (a lower value is better). Each plot point represents the L2 distance for a (classifier, dataset) combination. Classifiers are grouped and colored by their target class.

(B) Abundance distance using the uniform RefSeq CG database. Missing entries are classifiers that cannot create custom databases.

(C) Median pairwise L2 abundance norms between classifiers across simulated datasets, hierarchically clustered. Non-black cluster link colors are groups at a 0.09 similarity threshold. Colored boxes correspond to the method type: DNA, protein, and marker classifiers. The “k” annotation indicates k-mer-based methods.

For additional information, see Figure S6.

圖5. 種水平分類比例

Figure 5 Proportion of Abundance Classified at the Species Rank

(A)用默認(rèn)數(shù)據(jù)庫(kù)分類物種水平的樣本豐度比例。

(B)使用統(tǒng)一的RefSeq CG數(shù)據(jù)庫(kù)。僅顯示允許自定義數(shù)據(jù)庫(kù)的程序。有關(guān)其他信息,請(qǐng)參見(jiàn)圖S5。

(A) Proportion of sample abundance classified at the species rank with default databases.

(B) Using uniform RefSeq CG databases. Only programs allowing custom databases are shown.

For additional information, see Figure S5.

圖6. 在ATCC均勻樣本數(shù)據(jù)集中檢測(cè)到的物種數(shù)量與最小豐度閾值的關(guān)系

Figure 6 Number of Species Classified versus Minimum Abundance Threshold Detected in ATCC Even Sample Datasets

每種0.05豐度的20種物種的真實(shí)豐度被描繪為黑色虛線。有關(guān)其他信息,請(qǐng)參見(jiàn)圖S7-S9。

The truth abundance of 20 species at 0.05 abundance each is depicted as a black dotted line.

圖7. 計(jì)算資源消耗評(píng)測(cè)

Figure 7 Benchmark of Computational Resources

(A)處理含有570萬(wàn)條序列樣本所需的時(shí)間,而不是第一次運(yùn)行后的第二次運(yùn)行所需的時(shí)間。對(duì)于許多分類器,第二次運(yùn)行更快,因?yàn)闃颖拘蛄泻蛿?shù)據(jù)庫(kù)文件緩存在內(nèi)存中。Bracken沒(méi)有繪制,因?yàn)樗枰臅r(shí)間和內(nèi)存可以忽略不計(jì)。

(B)每個(gè)分類器在執(zhí)行期間使用的最大內(nèi)存,磁盤(pán)上數(shù)據(jù)庫(kù)大小以及32個(gè)可用CPU的平均使用數(shù)。

(C)使用各種方法創(chuàng)建RefSeq CG數(shù)據(jù)庫(kù)所花費(fèi)的時(shí)間和內(nèi)存。分類器按照增加的時(shí)間排序。MMseqs2和DIAMOND在數(shù)據(jù)庫(kù)構(gòu)建期間不對(duì)基因組進(jìn)行索引,而是在樣本分類期間即時(shí)索引。

(A) Time required to process a sample containing 5.7 million reads versus a second run immediately after the first. This second run is faster for many classifiers because sample reads and database files are cached in memory. Bracken is not plotted because it requires negligible time and memory.

(B) The maximum memory utilized by each classifier during execution, the on-disk database size, and average number of CPUs utilized of 32 available.

(C) Time taken and memory used to create the RefSeq CG database using various methods. Classifiers are sorted by increasing time taken. MMseqs2 and DIAMOND do not index the genomes during database construction but, rather, index on the fly during sample classification.

Reference

https://www./cell/fulltext/S0092-8674(19)30775-5

Ye, S.H., Siddle, K.J., Park, D.J., and Sabeti, P.C. (2019). Benchmarking Metagenomics Tools for Taxonomic Classification. Cell 178, 779-794.

寫(xiě)在后面

學(xué)習(xí)16S擴(kuò)增子、宏基因組科研思路和分析實(shí)戰(zhàn),關(guān)注“宏基因組”

    本站是提供個(gè)人知識(shí)管理的網(wǎng)絡(luò)存儲(chǔ)空間,所有內(nèi)容均由用戶發(fā)布,不代表本站觀點(diǎn)。請(qǐng)注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購(gòu)買(mǎi)等信息,謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請(qǐng)點(diǎn)擊一鍵舉報(bào)。
    轉(zhuǎn)藏 分享 獻(xiàn)花(0

    0條評(píng)論

    發(fā)表

    請(qǐng)遵守用戶 評(píng)論公約

    類似文章 更多