Differential gene expression analysis:差異表達基因分析 Differentially expressed gene (DEG):差異表達基因 Volcano Plot:火山圖 差異倍數(shù)(fold change)fold change翻譯過來就是倍數(shù)變化,假設(shè)A基因表達值為1,B表達值為3,那么B的表達就是A的3倍。一般我們都用count、TPM或FPKM來衡量基因表達水平,所以基因表達值肯定是非負數(shù),那么fold change的取值就是(0, +∞). 為什么我們經(jīng)常看到差異基因里負數(shù)代表下調(diào)、正數(shù)代表上調(diào)?因為我們用了log2 fold change。當expr(A) < expr(B)時,B對A的fold change就大于1,log2 fold change就大于0(見下圖),B相對A就是上調(diào);當expr(A) > expr(B)時,B對A的fold change就小于1,log2 fold change就小于0。通常為了防止取log2時產(chǎn)生NA,我們會給表達值加1(或者一個極小的數(shù)),也就是log2(B+1) - log2(A+1). 【需要一點對數(shù)函數(shù)的基礎(chǔ)知識】 為什么不直接用表達之差,差直接有正負啊?假設(shè)A表達為1,B表達為8,C表達為64;直接用差B相對A就上調(diào)了7,C就相對B上調(diào)了56;用log2 fold change,B相對A就上調(diào)了3,C相對B也只上調(diào)了3. 通過測序觀察我們發(fā)現(xiàn),不同基因在細胞里的表達差異非常巨大,所以直接用差顯然不合適,用log2 fold change更能表示相對的變化趨勢。 雖然大家都在用log2 fold change,但顯然也是有缺點的:一、到底是5到10的變化大,還是100到120的變化大?二、5到10可能是由于技術(shù)誤差導(dǎo)致的。所以當基因總的表達值很低時,log2 fold change的可信度就低了,尤其是在接近0的時候。A disadvantage and serious risk of using fold change in this setting is that it is biased[7] and may misclassify differentially expressed genes with large differences (B ? A) but small ratios (B/A), leading to poor identification of changes at high expression levels. Furthermore, when the denominator is close to zero, the ratio is not stable, and the fold change value can be disproportionately affected by measurement noise. 差異的顯著性(P-value)這就是統(tǒng)計學(xué)的范疇了,顯著性就是根據(jù)假設(shè)檢驗算出來的。 假設(shè)檢驗首先必須要有假設(shè),我們假設(shè)A和B的表達沒有差異(H0,零假設(shè)),然后基于此假設(shè),通過t test(以RT-PCR為例)算出我們觀測到的A和B出現(xiàn)的概率,就得到了P-value,如果P-value<0.05,那么說明小概率事件出現(xiàn)了,我們應(yīng)該拒絕零假設(shè),即A和B的表達不一樣,即有顯著差異。 顯著性只能說明我們的數(shù)據(jù)之間具有統(tǒng)計學(xué)上的顯著性,要看上調(diào)下調(diào)必須回去看差異倍數(shù)。 這里只說了最基本的原理,真正的DESeq2等工具里面的算法肯定要復(fù)雜得多。 這張圖對q-value(校正了的p-value)取了負log,相當于越顯著,負log就越大,所以在火山圖里,越外層的巖漿就越顯著,差異也就越大。 只需要看懂DEG結(jié)果的可以就此止步,想深入了解的可以繼續(xù)。 下面會討論的問題有:
前言做生物生理生化生信數(shù)據(jù)分析時,最常聽到的肯定是“差異(表達)基因分析”了,從最開始的RT-PCR,到基因芯片microarray,再到RNA-seq,最后到現(xiàn)在的single cell RNA-seq,統(tǒng)統(tǒng)都在圍繞著差異表達基因做文章。 (開個腦洞:再下一步應(yīng)該會測細胞內(nèi)特定空間內(nèi)特定基因的動態(tài)表達水平了) 表達量:我們假設(shè)基因轉(zhuǎn)錄表達形成的mRNA的數(shù)量反映了基因的活性,也會影響下游蛋白和代謝物的變化。我們關(guān)注的是基因的表達,不是結(jié)構(gòu),也是不是isoform。 為什么差異基因分析這么流行?一是中心法則得到了確立,基因表達是核心的一個環(huán)節(jié),決定了下游的蛋白組和代謝組;二是建庫測序的普及,獲取基因的表達水平變得容易。 在生物體內(nèi),基因的表達時刻都在動態(tài)變化,不一定服從均勻分布,在不同時間、發(fā)育程度、組織和環(huán)境刺激下,基因的表達肯定會發(fā)生變化。 差異基因分析主要應(yīng)用在:
目前我們對基因和轉(zhuǎn)錄組的了解到什么程度了? 基本的建庫方法?建庫直接決定了我們能測到什么序列,也決定了我們能做什么分析! 基因表達的normalization方法有哪些? 第一類錯誤、第二類錯誤是什么? 多重檢驗的校正?FDR 10x流程解釋 The mean UMI counts per cell of this gene in cluster i The differential expression analysis seeks to find, for each cluster, genes that are more highly expressed in that cluster relative to the rest of the sample. Here a differential expression test was performed between each cluster and the rest of the sample for each gene. The Log2 fold-change (L2FC) is an estimate of the log2 ratio of expression in a cluster to that in all other cells. A value of 1.0 indicates 2-fold greater expression in the cluster of interest. The p-value is a measure of the statistical significance of the expression difference and is based on a negative binomial test. The p-value reported here has been adjusted for multiple testing via the Benjamini-Hochberg procedure. In this table you can click on a column to sort by that value. Also, in this table genes were filtered by (Mean UMI counts > 1.0) and the top N genes by L2FC for each cluster were retained. Genes with L2FC < 0 or adjusted p-value >= 0.10 were grayed out. The number of top genes shown per cluster, N, is set to limit the number of table entries shown to 10000; N=10000/K^2 where K is the number of clusters. N can range from 1 to 50. For the full table, please refer to the 'differential_expression.csv' files produced by the pipeline. 不同單細胞DEG鑒定工具的比較 For data with a high level of multimodality, methods that consider the behavior of each individual gene, such as DESeq2, EMDomics, Monocle2, DEsingle, and SigEMD, show better TPRs. 這些工具敏感性高,就是說不會漏掉很多真的DEG,但是會包含很多假的DEG。 time-course DEG analysis Comparative analysis of differential gene expression tools for RNA sequencing time course data 參考: Question: How to calculate 'fold changes' in gene expression? Exact Negative Binomial Test with edgeR Differential gene expression analysis 相關(guān)文章: ggplot的boxplot添加顯著性 | Add P-values and Significance Levels to ggplots | 方差分析 |
|