友情提示:由于涉及概念的內(nèi)容較多故文中藍色區(qū)域的內(nèi)容建議重點參考TFtranscription factor, TF, 轉(zhuǎn)錄因子, 是一種蛋白, 通過特異性結(jié)合調(diào)控區(qū)域的 DNA 序列來調(diào)控基因的轉(zhuǎn)錄過程, 一個轉(zhuǎn)錄因子可以同時調(diào)控多個基因: In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence. TFs are key regulators of biological processes that function by binding to transcriptional regulatory regions (e.g., promoters, enhancers) to control the expression of their target genes.
人類基因組中可編碼2000+個TFs transcription factor binding site, TFBS, 轉(zhuǎn)錄因子結(jié)合位點, 是與轉(zhuǎn)錄因子結(jié)合的 DNA 序列, 長度通常在5~20bp,同一個轉(zhuǎn)錄因子在不同的基因上的結(jié)合位點具有一定的保守性,不完全相同: Transcription factor binding motifs (TFBMs) are genomic sequences that specifically bind to transcription factors. The consensus sequence of a TFBM is variable, and there are a number of possible bases at certain positions in the motif, whereas other positions have a fixed base.
transcription factor binding motif, TFBM, 轉(zhuǎn)錄因子結(jié)合域, binding site 和 binding motif 常被混淆使用,對于其區(qū)別,參照一篇文獻: 文中有描述如下: A single TF can recognize dozens to hundreds of DNA binding site sequences over a range of binding affinities. Hence, the TF binding specificity (i.e., preferential binding of specific sequences) cannot be adequately represented using any one DNA sequence. Instead, TF binding specificities are often represented as binding site motifs, which summarize the collection of preferentially bound sequences. These motifs can be used to scan sequences of interest (e.g., genomic regions) to predict TF binding sites.
即,motif匯總了一個TF所有可能的結(jié)合位點(TFBS),并用于描述結(jié)合位點的特異性。 motifmotifs are a more practical representation of consensus elements in biological sequences, allowing for a more detailed description of the variability at each site. Common types of motifs that are responsible for binding to DNA can be found in different transcription factors. Each TF typically recognizes a collection of similar DNA sequences, which can be represented as binding site motifs using models such as position weight matrices (PWMs)
motif 可以用多種方法、模型去表示。舉個例子,某個轉(zhuǎn)錄因子的結(jié)合位點序列如下: 
最基本的表達方式是一致性序列 (consensus sequences): A collection of DNA binding sites, typically referred to as a DNA binding motif, can be represented by a consensus sequence. Given a set of sequences, a consensus sequence (also called canonical sequence) is the sequence obtained by taking the most frequent residues of nucleic acids / amino acids at each position.
即,從給定的一組序列中,選擇由每個位點出現(xiàn)頻率最高的堿基組成的一段序列,本例中為AAGAAA https://www.commonlounge.com/discussion/912b207972304bf3a337e5473eca32ac 雖然簡單,但是很明顯,這樣的表達方式是以犧牲準確性為代價的,有點以偏概全的意思… 由最終序列,無法得到某個位點可能出現(xiàn)的其他堿基,當然,你可以使用 IUPAC 編碼方式去表示可能出現(xiàn)的兩種或多種堿基,例如第二個位點可能出現(xiàn)A或者T,在 IUPAC 編碼中以W來表示,但是仍然無法表示某種堿基出現(xiàn)的概率等信息! http://www.bioinformatics.org/sms2/iupac.html 故,需要更準確的模型來更好的表示motif 1、Position Frequency Matrices(PFMs, 位置頻率矩陣),又被稱為Position Count Matrix (PCM),矩陣中的數(shù)值是所有序列中,每個位點出現(xiàn)某堿基的頻數(shù): 列數(shù)等于序列長度,每列加和為6(共計6條序列),如所有序列的第一個堿基都是A,故在表中第一列A為6,其余堿基出現(xiàn)次數(shù)均為0! 2、Position Probability Matrix (PPM),矩陣中的數(shù)值是某堿基出現(xiàn)的頻率(堿基出現(xiàn)次數(shù)/列總和): 
每列加和為1,不同列之間相互獨立?;诿總€位點出現(xiàn)某堿基的可能性,可以推斷出現(xiàn)某序列的可能性,例如AAGAAA的可能性約15%(=1*0.67*0.5*0.83*0.83*0.66)。如果起始序列數(shù)比較少,則會在PPM矩陣中出現(xiàn)較多的0值,可以增加個假值來矯正... 3、Position Weight Matrix (PWM, 位置權重矩陣),又被稱為position-specific weight matrix (PSWM)、position-specific scoring matrix (PSSM)、logodds scoring matrix (LSM)。PWM矩陣由Score值組成: Each column provides a score per nucleotide representing the relative preference for the given base at that position in the binding site.
最常用的Score計算方法是基于背景堿基 (隨機出現(xiàn)) 頻率,對真實的堿基頻率進行矯正,并取log對數(shù)轉(zhuǎn)換: 
基于該公式可知,當某個特定堿基出現(xiàn)的可能性高于背景時,Score會為正值,否則為負值。假設每個堿基的背景概率均為0.25,則本例中PWM矩陣為: 
以第二個位置的T堿基Score值為例,Score = log2(2/6/0.25) ≈ 0.415 同理,可以計算某個特定的序列的Score值,每個位置Score值相加即可: In order to score a sequence, add up the score for the letters at the specific positions
如序列AAGAAA: Score = 2+1.425+1+1.737+1.737+1.415 = 9.314 與PPM矩陣類似,顯而易見的是矩陣中包含較多負無窮值-Inf,導致某些特定序列最終Score值也為負無窮(如AAAAAA),進而排除該序列出現(xiàn)可能性,可能會丟失關鍵信息...所以,同樣可以對ProbN使用假值矯正: 
由此可知,上示幾種矩陣模型可以方便的進行轉(zhuǎn)換!
TFs調(diào)控基因在確定了TF的motif并將其表示為PWM之后,人們通常還希望進一步識別受該TF調(diào)節(jié)的基因。潛在的靶基因可以通過識別基因啟動子區(qū)域是否含有該TF結(jié)合的motif來確定: In addition to determine the sequence speci?cities of a TF and represent this speci?cities as a PWM, one usually wants to identify genes being regulated by this TF. Putative targets of a TF can be determined by ?nding genes whose promoter region contains the motif bound by that TF.

啟動子區(qū)域示意圖: 
In genetics, a promoter is a region of DNA that initiates transcription of a particular gene. Promoters are located near the transcription start sites (TSS) of genes, on the same strand and upstream on the DNA.
啟動子區(qū)域的定位是相對于轉(zhuǎn)錄起始位點TSS的,一般定義為其上游2kb: As promoters are typically immediately adjacent to the gene in question, positions in the promoter are designated relative to the transcriptional start site, where transcription of DNA begins for a particular gene (i.e., positions upstream are negative numbers counting back from -1, for example -100 is a position 100 base pairs upstream). Promoters can be about 100–1000 base pairs long.
https://en.wikipedia.org/wiki/Promoter_(genetics)
|