日韩黑丝制服一区视频播放|日韩欧美人妻丝袜视频在线观看|九九影院一级蜜桃|亚洲中文在线导航|青草草视频在线观看|婷婷五月色伊人网站|日本一区二区在线|国产AV一二三四区毛片|正在播放久草视频|亚洲色图精品一区

分享

GWAS | 原理和流程 | 全基因組關(guān)聯(lián)分析 曼哈頓圖 Manhattan_plot | QQ p...

 生物狗在求救 2021-05-07

名詞解釋和基本問題:

關(guān)聯(lián)分析:就是AS的中文,全稱是GWAS。應(yīng)用基因組中數(shù)以百萬計(jì)的單核苷酸多態(tài);SNP為分子遺傳標(biāo)記,進(jìn)行全基因組水平上的對照分析或相關(guān)性分析,通過比較發(fā)現(xiàn)影響復(fù)雜性狀的基因變異的一種新策略。在全基因組范圍內(nèi)選擇遺傳變異進(jìn)行基因分析,比較異常和對照組之間每個(gè)遺傳變異及其頻率的差異,統(tǒng)計(jì)分析每個(gè)變異與目標(biāo)性狀之間的關(guān)聯(lián)性大小,選出最相關(guān)的遺傳變異進(jìn)行驗(yàn)證,并根據(jù)驗(yàn)證結(jié)果最終確認(rèn)其與目標(biāo)性狀之間的相關(guān)性。

連鎖不平衡:LD,P(AB)= P(A)*P(B)。不連鎖就獨(dú)立,如果不存在連鎖不平衡——相互獨(dú)立,隨機(jī)組合,實(shí)際觀察到的群體中單倍體基因型 A和B 同時(shí)出現(xiàn)的概率。P (AB) = D + P (A) * P (B) 。D是表示兩位點(diǎn)間LD程度值。

曼哈頓圖:在生物和統(tǒng)計(jì)學(xué)上,做頻率統(tǒng)計(jì)、突變分布、GWAS關(guān)聯(lián)分析的時(shí)候,我們經(jīng)常會看到一些非常漂亮的manhattan plot,能夠?qū)蜻x位點(diǎn)的分布和數(shù)值一目了然。位點(diǎn)坐標(biāo)和pvalue。map文件至少包含三列——染色體號,SNP名字,SNP物理位置。assoc文件包含SNP名字和pvalue。haploview即可畫出。

SNP的本質(zhì)屬性是什么?廣義上講是變異:most common type of genetic variation,平級的還有indel、CNV、SV。Each SNP represents a difference in a single DNA building block, called a nucleotide. 狹義上講是標(biāo)記:biological markers,因?yàn)镾NP是單堿基的,所以SNP又是一個(gè)位點(diǎn),標(biāo)記了染色體上的一個(gè)位置。大部分人的基因組,99%都是一模一樣的,還有些SNP的位點(diǎn),就是一些可變的位點(diǎn),在人群中有差異。這些差異/標(biāo)記可以用于疾病的分析,根據(jù)統(tǒng)計(jì)學(xué)原理,找出與疾病最相關(guān)的位點(diǎn),從而確定某個(gè)疾病的risk allele。

SNP array是如何工作的?SNP array測得不是單個(gè)堿基,而是allele。所以GWAS的結(jié)果是三種:(1 - AA; 2 - AB; 3 - BB),也可能是0、1、2.

linkage disequilibrium (LD)和 pairwise correlation的區(qū)別?

如何鑒定Somatic vs Germline Mutations?In multicellular organisms, mutations can be classed as either somatic or germ-line。必須做通常需要trios或healthy tissue的測序才能確定。最顯然的是cancer里大部分都是somatic的variations。

SNP、variant和mutation有什么區(qū)別?SNP是中性的,mutation顯然和疾病相關(guān);其次就是頻率,頻率很高的是SNP,mutation則很低。variant和variation是同義詞,因此和SNP是等價(jià)的。

為什么還需要haplotype?HapMap計(jì)劃的動(dòng)機(jī)是什么?The HapMap is valuable by reducing the number of SNPs required to examine the entire genome for association with a phenotype from the 10 million SNPs that exist to roughly 500,000 tag SNPs.

common variant和rare variant是根據(jù)什么來區(qū)別的?paper 怎么理解這里的common和rare?variant就是SNP,”常見的變異“,SNP就是位點(diǎn),一個(gè)位點(diǎn)怎么能說常見和不常見呢?這里是有點(diǎn)反直覺的。這里的common說的是minor allele,就是the second most common allele。比如一個(gè)SNP:rs78601809,它的位置可知,在不同人群中的allele frequency可知,總體的MAF是0.39 (T)。一個(gè)SNP的MAF<1%,那就是rare variant。直覺理解就是這個(gè)位點(diǎn)的堿基在人群中很少發(fā)生變化。rare variants (MAF < 0.05) appeared more frequently in coding regions than common variants (MAF > 0.05) in this population

Genetic variants that are outside the reach of the most statistically powered association studies [13] are thought to contribute to the missing heritability of many human traits, including common variants (here denoted by minor allele frequency [MAF] >5%) of very weak effect, low-frequency (MAF 1–5%) and rare variants (MAF <1%) of small to modest effect, or a combination of both, with several possible scenarios all deemed plausible in simulation studies [14]. 

為什么genetic這么執(zhí)著于MAF?

因?yàn)閺倪M(jìn)化角度,risk allele更有可能是minor allele,自然選擇。不絕對,但可以說是富集??次恼拢篈re minor alleles more likely to be risk alleles?

common variants together account for a small proportion of heritability estimated from family studies,common variants通常都在非編碼區(qū),占總variants的很小一部分,同時(shí)effect size也比較低。

SNP的small effect和large effect是什么意思?effect size

極其容易搞混的術(shù)語:SNP、mutation、variant、allele、genotype。Allele frequency、Genotype frequency,alternative allele frequency、MAF。一定要能快速區(qū)分這些術(shù)語的差異,否則你做的就是假的統(tǒng)計(jì)遺傳學(xué)。

gene-based rare-variant burden tests是用來干什么的?Increased Burden of Rare Variants Among S-HSCR。

epistatic effects是什么?

為什么說L-HSCR是autosomal dominant?很難說是完全的線性,顯隱性的關(guān)系是非常復(fù)雜的,存在不完全和劑量效應(yīng)。

DNA序列角度如何看待等位基因,顯隱性的關(guān)系?關(guān)于Allele(等位基因)的理解,allele在基因上的組合,傳統(tǒng)的等位基因是非常抽象的概念。Dominant vs. Recessive 我們是兩倍體,對每個(gè)基因來說,我們都有兩個(gè)等位基因,雜合的話,這兩個(gè)基因序列就不同,表達(dá)出來的蛋白也就不同,而且兩個(gè)等位基因有復(fù)雜的顯隱性關(guān)系。所以說我們傳統(tǒng)的基因表達(dá)分析其實(shí)是很粗糙的,最好要做到isoform層次的表達(dá),畢竟基因離蛋白還是有一段距離?,F(xiàn)在之所以還沒做到isoform水平,大部分原因是我們對蛋白的研究還不夠。

一個(gè)新的課題,全球范圍內(nèi),人種是如何逐步分化到今天,哪些核心的遺傳因素決定了人種的表型差異;其次,不同的人種在某些疾病上為什么會出現(xiàn)顯著的頻率差異,為什么亞洲人的HSCR發(fā)病率會更高?遺傳因素在其中發(fā)揮了什么作用?

遺傳效應(yīng):Additive genetic effects occur when two or more genes source a single contribution to the final phenotype, or when alleles of a single gene (in heterozygotes) combine so that their combined effects equal the sum of their individual effects.[1][2] Non-additive genetic effects involve dominance (of alleles at a single locus) or epistasis (of alleles at different loci). 就是risk allele的數(shù)量和患病率之間成正比。

人類基因組里有多少個(gè)variant/SNP? 1000 genome里的數(shù)據(jù)是84.4 million,這是保守?cái)?shù)據(jù),因?yàn)橹话?504個(gè)人,相當(dāng)于每個(gè)population只測了100個(gè)人,雖然具有一定的代表,性,但實(shí)際肯定更多,那就保守估計(jì)一下300 million吧,那就真是百分之一了,也就是100個(gè)堿基里就有一個(gè)variant。算到個(gè)體,就是3 million左右,也就是萬分之一。

先從直覺上理解一下GWAS的原理:

核心就是SNP與表型的關(guān)聯(lián),對于每一個(gè)genome位點(diǎn),如果某個(gè)SNP總是與某疾病同時(shí)出現(xiàn) SNP與phenotype這兩個(gè)維度協(xié)同變化,那我們就可以推測這個(gè)SNP極有可能與此phenotype(疾?。┫嚓P(guān)。

規(guī)范點(diǎn)講就是看某個(gè)SNP在case和control兩個(gè)population間是否有allel frequency的顯著差異。

而現(xiàn)實(shí)情況是,我們樣本數(shù)有限,而且有時(shí)候control和case樣本不平衡,樣本還分男女、人群,而且我們需要對3億個(gè)堿基位點(diǎn)都做統(tǒng)計(jì)檢驗(yàn)。

我們應(yīng)該設(shè)計(jì)哪些指標(biāo)來評價(jià)一個(gè)snp與表型的關(guān)聯(lián)呢?

思考:如果一個(gè)位點(diǎn)有多個(gè)SNP,而只有其中的一個(gè)SNP與疾病相關(guān)怎么辦?錯(cuò)誤認(rèn)知,一個(gè)基因組位點(diǎn)只能有一個(gè)SNP,可以有很多種allele。

圖片

圖片

牢記:曼哈頓圖中的點(diǎn)代表的不是樣品,而是SNP。

思考:曼哈頓圖中,顯著的SNP并不是鶴立雞群的冒出來,而是似乎被捧出來的,就像高樓大廈一樣,從底下逐步冒出來的。這一座大廈其實(shí)就是連鎖在一起的SNP,具有很高的LD score。

思考:雖然曼哈頓圖里每個(gè)點(diǎn)是SNP,但是通常都會把最顯著的SNP指向某個(gè)基因,因?yàn)榇蠹易铌P(guān)注的還是SNP的致病根源,但這樣找出來的只有編碼區(qū)的SNP。

注意:最突出的SNP極有可能不是causal SNP,它只是near the causal SNP。問題就來了,怎么找causal SNP呢?fine mapping

基本背景

什么是SNP?進(jìn)化過程中隨機(jī)產(chǎn)生的單點(diǎn)突變,并能穩(wěn)定的在群體中遺傳。

什么是allele frequency in population?每一個(gè)genome位點(diǎn)都有兩個(gè)或多個(gè)allele,不同allel之間有明顯的頻率上的差異,簡單點(diǎn)理解就是A和a兩個(gè)性質(zhì)的頻率,但這里是堿基位點(diǎn),而不是性狀基因。

GWAS分析的前提

sample size足夠,學(xué)過統(tǒng)計(jì)的都知道sample size會影響power,沒有足夠的power是得不出正確結(jié)論的,GWAS通常需要大量的樣本,幾千是標(biāo)配,幾百就太少,現(xiàn)在有的都達(dá)到了幾萬幾十萬級別;

一個(gè)大誤區(qū)就是GWAS會測全基因組WGS,其實(shí)不是的,那太貴了,大部分是做DNA chip DNA芯片(專業(yè)的叫SNP array),只包含了常見的10^6個(gè)SNP。稍微有錢的就會上WES,就會得到所有編碼區(qū)的SNP;最有錢的就是WGS了,全部檢測,編碼非編碼,常見罕見,1000genome就是靠這個(gè)才NB的。

大致原理已經(jīng)講了,其實(shí)還有統(tǒng)計(jì)原理,暫時(shí)略過,先看實(shí)操。

怎么用PLINK來做GWAS?油管視頻:GWAS in Plink 里面有paper、示例數(shù)據(jù)、代碼下載,可以跑跑熟悉一下。

參考:

Analysis of Microarray Data

Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances

Genotype Calling (CRLMM) and Copy Number Analysis tool for Affymetrix SNP 5.0 and 6.0 and Illumina arrays

Discriminating somatic and germline mutations in tumor DNA samples without matching normals

The impact of rare and low-frequency genetic variants in common disease


發(fā)表了paper的,GWAS pipeline:A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis。

github地址

一下著重講解一下這個(gè)流程的操作細(xì)節(jié):

主要是四方面的分析:

  1. All essential GWAS QC steps along with scripts for data visualization.

  2. Dealing with population stratification, using 1000 genomes as a reference.

  3. Association analyses of GWAS data.

  4. Polygenic risk score (PRS) analyses.

先看下PLINK的文本文件格式:

ped:行是個(gè)體,列是表型和SNP的基因型數(shù)據(jù);

map:snp的特征數(shù)據(jù);

二進(jìn)制有三個(gè)格式:

主要就是把ped拆成了fam和bed,map變成了bim。

通常要做covariate分析,所以還有個(gè)covariate文件。

圖片

QC:

StepCommandFunction
1: Missingness of SNPs and individuals‐‐genoExcludes SNPs that are missing in a large proportion of the subjects. In this step, SNPs with low genotype calls are removed.
‐‐mindExcludes individuals who have high rates of genotype missingness. In this step, individual with low genotype calls are removed.
2: Sex discrepancy‐‐check‐sexChecks for discrepancies between sex of the individuals recorded in the dataset and their sex based on X chromosome heterozygosity/homozygosity rates.
3: Minor allele frequency (MAF)‐‐mafIncludes only SNPs above the set MAF threshold.
4: Hardy–Weinberg equilibrium (HWE)‐‐hweExcludes markers which deviate from Hardy–Weinberg equilibrium.
5: HeterozygosityFor an example script see https://github.com/MareesAT/GWA_tutorial/Excludes individuals with high or low heterozygosity rates
6: Relatedness‐‐genomeCalculates identity by descent (IBD) of all sample pairs.
‐‐minSets threshold and creates a list of individuals with relatedness above the chosen threshold. Meaning that subjects who are related at, for example, pi‐hat >0.2 (i.e., second degree relatives) can be detected.
7: Population stratification‐‐genomeCalculates identity by descent (IBD) of all sample pairs.
‐‐cluster ‐‐mds‐plot kProduces a k‐dimensional representation of any substructure in the data, based on IBS.

fine mapping

一個(gè)常識就是GWAS是2007年才出現(xiàn)得,所以2017年才出了篇有名的綜述ten years of GWAS,fine mapping是GWAS后才出現(xiàn)得。

實(shí)驗(yàn)室很早就開始研究fine mapping了:2009 - Fine mapping of the 9q31 Hirschsprung’s disease locus

看一下introduction,什么是fine mapping?

目的很簡單:GWAS找到的大多不是causal variants,fine mapping就是就fill這個(gè)gap。

GWAS得到大體的SNP后,必須做兩方面的深入分析:

第一步就是對SNP給一個(gè)概率上的causality,這就是fine-mapping;第二步就是根據(jù)功能注釋來確定該SNP確實(shí)能導(dǎo)致某個(gè)基因。

The first is to assign well-calibrated probabilities of causality to candidate variants, known as fine-mapping. The second step is to try to connect these variants to likely genes whose perturbation leads to altered disease risk by functional annotation. 

基本原理:

Strategies for fine-mapping complex traits - 

Integrative modeling of eQTLs and cis-regulatory elements suggests mechanisms underlying cell type specificity of eQTLs

Although eQTLs are increasingly used to provide mechanistic interpretations for human disease associations, the cell type specificity of eQTLs presents a problem. Because the cell type from which a given physiological phenotype arises may not be known, and because eQTL data exist for a limited number of cell types, it is critical to quantify and understand the mechanisms generating cell type specific eQTLs. For example, if a GWAS identifies a set of SNPs associated with risk of type II diabetes, the researcher must choose a target cell type to develop a mechanistic model of the molecular phenotype that causes the gross physiological change. One can imagine that the relevant cell type might be adipose tissue, liver, pancreas, or another hormone-regulating tissue. Furthermore, if the GWAS SNP produces a molecular phenotype (i.e., is an eQTL) in lymphoblastoid cell lines (LCLs), it is not necessarily the case that the SNP will generate a similar molecular phenotype in the cell type of interest. Furthermore, there are many examples of cell types with particular relevance to common diseases, for example dopaminergic neurons and Parkinson's disease, that lack comprehensive eQTL data or catalogs of CREs. The utility of eQTLs for complex trait interpretation will therefore be improved by a more thorough annotation of their cell type specificity.

eQTL最大的問題還是celltype的特異性不夠,關(guān)鍵還是要celltype的定義足夠精準(zhǔn)!


現(xiàn)在GWAS已經(jīng)屬于比較古老的技術(shù)了,主要是碰到嚴(yán)重的瓶頸了,單純的snp與表現(xiàn)的關(guān)聯(lián)已經(jīng)不夠,需要具體的生物學(xué)解釋,這些snp是如何具體導(dǎo)致疾病的發(fā)生的。

而且,大多數(shù)病找到的都不是個(gè)別顯著的snp,大多數(shù)都找到了很多的snp,而且snp都落在非編碼區(qū)了,這就導(dǎo)致對這些snp的解讀非常的困難。

經(jīng)典解讀看這篇新英格蘭雜志上的文章:FTO Obesity Variant Circuitry and Adipocyte Browning in Humans

GWAS的核心結(jié)果就兩個(gè),曼哈頓圖和QQ-plot,看懂就夠了。

單純會跑GWAS pipeline已經(jīng)沒什么價(jià)值了,現(xiàn)在重在下游的分析,有幾個(gè)熱點(diǎn):

  • Polygenic risk score (PRS) analyses

  • meta-analysis

The International HapMap Project (http://hapmap. ncbi.nlm.nih.gov/; Gibbs et al., 2003) described the patterns of com- mon SNPs within the human DNA sequence whereas the 1000 Genomes (1KG) project (http://www./; Altshuler et al., 2012) provided a map of both common and rare SNPs.

common和rare就是根據(jù)allele frequency來界定的,但是似乎沒有明確界限。

HapMap用的是array,所有測得都是一些人為挑的點(diǎn),所以就是common snps;而1000 genomes是WGS,所以包含了所有的點(diǎn),所以有common和rare一起。

GWAS和核心就是LD,目前大部分的GWAS都是測得array,因?yàn)楸阋恕?/p>

GWAS會漏掉很多點(diǎn),所以才會有fine-mapping,根據(jù)haplotype來做一些imputation。

Linkage disequilibrium (LD)連鎖不平衡:不同基因座位的各等位基因在人群中以一定的頻率出現(xiàn)。在某一群體中,不同座位某兩個(gè)等位基因出現(xiàn)在同一條染色體上的頻率高于預(yù)期的隨機(jī)頻率的現(xiàn)象。(就是孟德爾的分離不是隨機(jī)的,在染色體上越靠近的allele越傾向于綁在一起,屬于物質(zhì)性的限制。)

例如兩個(gè)相鄰的基因A B, 他們各自的等位基因?yàn)閍 b. 假設(shè)A B相互獨(dú)立遺傳,則后代群體中觀察得到的單倍體基因型 AB 中出現(xiàn)的P(AB)的概率為 P(A) * P(B). 實(shí)際觀察得到群體中單倍體基因型 AB 同時(shí)出現(xiàn)的概率為P(AB)。計(jì)算這種不平衡的方法為:D = P(AB)- P(A) * P(B).

事實(shí)上,可以檢測遍布基因組中的大量遺傳標(biāo)記位點(diǎn)snp,或者候選基因附近的遺傳標(biāo)記來尋找到因?yàn)榕c致病位點(diǎn)距離足夠近而表現(xiàn)出與疾病相關(guān)的位點(diǎn),這就是等位基因關(guān)聯(lián)分析或連鎖不平衡定位基因的基本思想。

待看的paper:Strategies for fine-mapping complex traits

assign well-calibrated probabilities of causality to candidate variants, known as fine-mapping.

還有一些非常重要的概念:

effect size:效應(yīng)量

power:功效,power analyses

Underestimated Effect Sizes in GWAS: Fundamental Limitations of Single SNP Analysis for Dichotomous Phenotypes

在語境里理解:One explanation of the missing heritability is that complex diseases are caused by a large number of causal variants with small effect sizes.

PRS combines the effect sizes of multiple SNPs into a single aggregated score that can be used to predict disease risk

 haplotype phasing單倍體分型

Positions with 00 and 11 are called homozygous positions. Positions with 10 or 01 are called heterozygous positions. We note that the reference genome is neither the paternal nor the maternal genome but the genome of an un-related human (or more precisely the mixture of genomes of a few individuals). An individual’s haplotype is the set of variations in that individual’s chromosomes. We note that as any two human haplotypes are 99.9% similar, the mapping problem can be solved quite easily.

Haplotype phasing is the problem of inferring information about an individual’s haplotype. To solve this problem, there are many methods.

Lecture 10: Haplotype Phasing - Community Recovery


參考:PLINK | File format reference

vcftools

plink的主要功能:數(shù)據(jù)處理,質(zhì)量控制的基本統(tǒng)計(jì),群體分層分析,單位點(diǎn)的基本關(guān)聯(lián)分析,家系數(shù)據(jù)的傳遞不平衡檢驗(yàn),多點(diǎn)連鎖分析,單倍體關(guān)聯(lián)分析,拷貝數(shù)變異分析,Meta分析等等。

首先必須了解plink的三種格式:bed、fam和bim。(注意:這里的bed和我們genome里的區(qū)域文件bed完全不同)

plink需要的格式一般可以從vcf文件轉(zhuǎn)化而來 (順便了解一下ped和map兩種格式):

PED: Original standard text format for sample pedigree information and genotype calls. Normally must be accompanied by a .map file. 譜系信息和基因型信息。每一行是一個(gè)人。

MAP: Variant information file accompanying a .ped text pedigree + genotype table. 變異信息。每一行是一個(gè)變異 | snp。

1

2

3

4

5

6

7

# PED

     1 1 0 0 1  0    G G    2 2    C C

     1 2 0 0 1  0    A A    0 0    A C

     1 3 1 2 1  2    0 0    1 2    A C

     2 1 0 0 1  0    A A    2 2    0 0

     2 2 0 0 1  2    A A    2 2    0 0

     2 3 1 2 1  2    A A    2 2    A A

1

2

3

4

# MAP

     1 snp1 0 1

     1 snp2 0 2

     1 snp3 0 3

1

2

3

4

# vcf轉(zhuǎn)ped和map

plink --vcf file.vcf --recode --out file

# ped和map轉(zhuǎn)bed、bim和fam

plink --file test --make-bed --out test

三種格式的官方介紹

bed文件(真實(shí)的bed文件是二進(jìn)制的,比較難讀)

bed:Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. Loaded with --bfile; generated in many situations, most notably when the --make-bed command is used. Do not confuse this with the UCSC Genome Browser's BED format, which is totally different. 基因型信息。所以轉(zhuǎn)換后就是一個(gè)matrix,每一行是一個(gè)個(gè)體,每一列就是一個(gè)變異。其中0、1、2分別對應(yīng)了aa、Aa或aA和AA。不考慮堿基型,因?yàn)槲覀儾魂P(guān)注ATGC的變化。

fam:Sample information file accompanying a .bed binary genotype table. 樣本信息。每一行就是一個(gè)樣本。

bim:Extended variant information file accompanying a .bed binary genotype table. 每一行是一個(gè)變異,及其注釋信息。

1

2

3

4

5

6

             rs4970383 rs3748592 rs9442373 rs1571150 rs6687029

2431:NA19916         2         0         0         0         1

2424:NA19835         1         0         1         2         0

2469:NA20282         1         0         1         0         1

2368:NA19703         0         0         0         2         0

2425:NA19901         1         0         1         2         2

1

2

3

4

OR

# xxd -b test.bed

0000000001101100 00011011 00000001 11011100 00001111 11100111 l.....

0000000600001111 01101011 00000001 .k.

  • First two bytes 01101100 00011011 for PLINK v1.00 BED file

  • Third byte is 00000001 (SNP-major) or 00000000 (individual-major)

  • Genotype data, either in SNP-major or individual-major order

  • New 'row' always starts a new byte

  • Each byte encodes up to 4 genotypes

  • 10 indicates missing genotype, otherwise 0 and 1 point to allele 1 or allele 2 in the BIM file, respectively

  • Bits in each byte read in reverse order

fam文件

1

2

3

4

5

1 2431 NA19916  0  0  1

2 2424 NA19835  0  0  2

3 2469 NA20282  0  0  2

4 2368 NA19703  0  0  1

5 2425 NA19901  0  0  2

1

2

3

4

5

6

7

OR

1 1 0 0 1 0

1 2 0 0 1 0

1 3 1 2 1 2

2 1 0 0 1 0

2 2 0 0 1 2

2 3 1 2 1 2

bim文件

1

2

3

4

5

1  1 rs4970383  0  828418  A

2  1 rs3748592  0  870101  A

3  1 rs9442373  0 1052501  C

4  1 rs1571150  0 1464167  A

5  1 rs6687029  0 1508931  C

1

2

3

4

OR

1       snp1    0       1       G       A

1       snp2    0       2       1       2

1       snp3    0       3       A       C

跑跑PLINK工具

1

2

plink --bfile  --pheno  --pheno-name t16 --linear hide-covar --covar  --covar-name

 AGE,SEX,PC1,PC2,PC3,PC4 --ci 0.95 --out

1

2

3

4

5

6

--bfile  將snp文件變成二進(jìn)制格式

--pheno 這里導(dǎo)入我們剛剛處理的性狀文件

--pheno-name t16 要處理的性狀名字是t16

--linear hide-covar 使用線性模型,hide-covar指的是不要對我沒加入的協(xié)變量進(jìn)行分析

--covar  --covar-name AGE,SEX,PC1,PC2,PC3,PC4 把我們選取的協(xié)變量加入線性回歸模型中,我們選的協(xié)變量有:AGE,SEX,PC1,PC2,PC3,PC4

--ci 0.95 設(shè)置置信區(qū)間

SNP過濾問題

1

2

3

4

5

6

7

8

9

10

11

12

13

14

使用vcftools過濾:

1. MAF<0.05

vcftools --vcf test.vcf --maf 0.05 --out XX

2.完整度大于90%

vcftools --vcf test.vcf  --max-missing 0.9 --OUT XX

3.平均深度大于5

vcftools --vcf test.vc --min-meanDP 5 --out xx

注:

使用--gvcf更為快捷

使用plink過濾

1.vcf轉(zhuǎn)化plink格式

vcftools --vcf test.vcf --plink --out  xxx

2.plink --noweb --file plink --geno 0.05 --maf 0.05 --hwe 0.0001 --make-bed

跟一個(gè)官網(wǎng)的教學(xué),無需寫代碼,教學(xué)材料:Resources available for download 非常通俗,容易入門。

ped文件:譜系信息和基因型;

Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file.

The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on.

前6行就和fam文件一樣,家庭id,家庭內(nèi)id,性別,表型。

后面兩個(gè)一組,比如第7和第8就是map中第一個(gè)snp的等位基因(人有兩條染色體,每條DNA都是雙鏈的,不考慮雙鏈,因?yàn)橛谢パa(bǔ)配對)。

fam文件:樣本信息;

  1. Family ID ('FID')

  2. Within-family ID ('IID'; cannot be '0')

  3. Within-family ID of father ('0' if father isn't in dataset)

  4. Within-family ID of mother ('0' if mother isn't in dataset)

  5. Sex code ('1' = male, '2' = female, '0' = unknown)

  6. Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

map文件:突變信息;

  1. Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.

  2. Variant identifier

  3. Position in morgans or centimorgans (optional; also safe to use dummy value of '0')

  4. Base-pair coordinate

bim文件:額外的突變信息;

  1. Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name

  2. Variant identifier

  3. Position in morgans or centimorgans (safe to use dummy value of '0')

  4. Base-pair coordinate (normally 1-based, but 0 ok; limited to 231-2)

  5. Allele 1 (corresponding to clear bits in .bed; usually minor)

  6. Allele 2 (corresponding to set bits in .bed; usually major)

MAF, Minor allele frequency: SNPs with a minor allele frequency of 0.05 or greater were targeted by the HapMap project. 最小等位基因頻率

QC

The SNPs are currently coded according to NCBI build 36 coordinates on the forward strand. 

Data quality control in genetic case-control association studies

plink可以對snp進(jìn)行QC過濾,根據(jù)一些指標(biāo),比如MAF。。。

plink的結(jié)果必須要有了解,

1. 將文本的ped和map文件轉(zhuǎn)化為二進(jìn)制的bed、bim和fam文件;

2. 關(guān)聯(lián)分析的結(jié)果,其實(shí)就是給每個(gè)人賦值一個(gè)表型,然后就做關(guān)聯(lián)分析,得到每一個(gè)snp與表型的相關(guān)性,用p-value來表示,最終可以畫曼哈頓圖;

來源:https://www.cnblogs.com/leezx/p/9013615.html

    本站是提供個(gè)人知識管理的網(wǎng)絡(luò)存儲空間,所有內(nèi)容均由用戶發(fā)布,不代表本站觀點(diǎn)。請注意甄別內(nèi)容中的聯(lián)系方式、誘導(dǎo)購買等信息,謹(jǐn)防詐騙。如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容,請點(diǎn)擊一鍵舉報(bào)。
    轉(zhuǎn)藏 分享 獻(xiàn)花(0

    0條評論

    發(fā)表

    請遵守用戶 評論公約

    類似文章 更多