The NHGRI-EBI Catalog of published genome-wide association studies EBI負(fù)責(zé)維護(hù)的一個(gè)收集已發(fā)表的GWAS研究的數(shù)據(jù)庫 Catalog stats
基本的搜索方法 搜索表型:如breast carcinoma,會(huì)得到相關(guān)的非常規(guī)范的表型信息,EFO,就像GO一樣,是一套表型分類規(guī)則。然后還會(huì)得到表型相關(guān)的基因。 搜索SNP:如rs7329174,會(huì)得到變異的詳細(xì)信息,和對(duì)應(yīng)的基因。 搜索人名:Yao,會(huì)得到相關(guān)的文獻(xiàn) 搜索染色體位置:如2q37.1,Cytogenetic region 搜索基因:如HBS1L 搜索區(qū)域:如6:16000000-25000000
說是數(shù)據(jù)庫,其實(shí)就是一個(gè)table,從這里下載,不過100MB 表里面有這些數(shù)據(jù): DATE ADDED TO CATALOG* +: Date a study is published in the catalog PUBMEDID* +: PubMed identification number FIRST AUTHOR* +: Last name and initials of first author DATE* +: Publication date (online (epub) date if available) JOURNAL* +: Abbreviated journal name LINK* +: PubMed URL STUDY* +: Title of paper DISEASE/TRAIT* +: Disease or trait examined in study INITIAL SAMPLE DESCRIPTION* +: Sample size and ancestry description for stage 1 of GWAS (summing across multiple Stage 1 populations, if applicable) REPLICATION SAMPLE DESCRIPTION* +: Sample size and ancestry description for subsequent replication(s) (summing across multiple populations, if applicable) REGION*: Cytogenetic region associated with rs number CHR_ID*: Chromosome number associated with rs number CHR_POS*: Chromosomal position associated with rs number REPORTED GENE(S)*: Gene(s) reported by author MAPPED GENE(S)*: Gene(s) mapped to the strongest SNP. If the SNP is located within a gene, that gene is listed. If the SNP is intergenic, the upstream and downstream genes are listed, separated by a hyphen. UPSTREAM_GENE_ID*: Entrez Gene ID for nearest upstream gene to rs number, if not within gene DOWNSTREAM_GENE_ID*: Entrez Gene ID for nearest downstream gene to rs number, if not within gene SNP_GENE_IDS*: Entrez Gene ID, if rs number within gene; multiple genes denotes overlapping transcripts UPSTREAM_GENE_DISTANCE*: distance in kb for nearest upstream gene to rs number, if not within gene DOWNSTREAM_GENE_DISTANCE*: distance in kb for nearest downstream gene to rs number, if not within gene STRONGEST SNP-RISK ALLELE*: SNP(s) most strongly associated with trait + risk allele (? for unknown risk allele). May also refer to a haplotype. SNPS*: Strongest SNP; if a haplotype it may include more than one rs number (multiple SNPs comprising the haplotype) MERGED*: denotes whether the SNP has been merged into a subsequent rs record (0 = no; 1 = yes;) SNP_ID_CURRENT*: current rs number (will differ from strongest SNP when merged = 1) CONTEXT*: SNP functional class INTERGENIC*: denotes whether SNP is in intergenic region (0 = no; 1 = yes) RISK ALLELE FREQUENCY*: Reported risk/effect allele frequency associated with strongest SNP in controls (if not available among all controls, among the control group with the largest sample size). If the associated locus is a haplotype the haplotype frequency will be extracted. P-VALUE*: Reported p-value for strongest SNP risk allele (linked to dbGaP Association Browser). Note that p-values are rounded to 1 significant digit (for example, a published p-value of 4.8 x 10-7 is rounded to 5 x 10-7). PVALUE_MLOG*: -log(p-value) P-VALUE (TEXT)*: Information describing context of p-value (e.g. females, smokers). OR or BETA*: Reported odds ratio or beta-coefficient associated with strongest SNP risk allele. Note that if an OR <1 is reported this is inverted, along with the reported allele, so that all ORs included in the Catalog are >1. Appropriate unit and increase/decrease are included for beta coefficients. 95% CI (TEXT)*: Reported 95% confidence interval associated with strongest SNP risk allele, along with unit in the case of beta-coefficients. If 95% CIs are not published, we estimate these using the standard error, where available. PLATFORM (SNPS PASSING QC)*: Genotyping platform manufacturer used in Stage 1; also includes notation of pooled DNA study design or imputation of SNPs, where applicable CNV*: Study of copy number variation (yes/no) ASSOCIATION COUNT+: Number of associations identified for this study
一些問題: 什么是Genotyping technology? 什么是Experimental Factor Ontology trait? 什么是Cytogenetic region?karyotype 什么是trait + risk allele?這里要分清SNP和allele的概念,SNP是位點(diǎn),而allele則是該位點(diǎn)上堿基。考慮一下DNA雙鏈,以及多倍體。 什么是risk/effect allele frequency? odds ratio在GWAS里是個(gè)什么指標(biāo)?wiki The odds ratio is the ratio of two odds, which in the context of GWA studies are the odds of case for individuals having a specific allele and the odds of case for individuals who do not have that same allele. As an example, suppose that there are two alleles, T and C. The number of individuals in the case group having allele T is represented by 'A' and the number of individuals in the control group having allele T is represented by 'B'. Similarly, the number of individuals in the case group having allele C is represented by 'X' and the number of individuals in the control group having allele C is represented by 'Y'. In this case the odds ratio for allele T is A:B (meaning 'A to B', in standard odds terminology) divided by X:Y, which in mathematical notation is simply (A/B)/(X/Y). When the allele frequency in the case group is much higher than in the control group, the odds ratio is higher than 1, and vice versa for lower allele frequency. Additionally, a P-value for the significance of the odds ratio is typically calculated using a simple chi-squared test. Finding odds ratios that are significantly different from 1 is the objective of the GWA study because this shows that a SNP is associated with disease.[18] 什么是MAF?the frequency of the minor allele GWAS數(shù)據(jù)可以有哪些注釋?phenotype annotation、population and linkage disequilibrium (LD) information 什么是CP loci?an effective region associated with at least two phenotypes 什么是genotype-calling? GWAS的最基本的QC有哪些? Quality Control Procedures for Genome Wide Association Studies Data quality control in genetic case-control association studies
什么是Experimental Factor Ontology? 什么是LD information (r2 and D’ values)? Mathematical properties of the r2 measure of linkage disequilibrium the square of the correlation coefficient between two indicator variables – one representing the presence or absence of a particular allele at the first locus and the other representing the presence or absence of a particular allele at the second locus. the frequency dependence of r2,也就是r2是MAF的函數(shù)。 Introduction to different measures of linkage disequilibrium (LD) and their calculation 兩種常見的計(jì)算方法
NLM Catalog NCBI和本數(shù)據(jù)庫里的期刊名字都是縮寫,如何轉(zhuǎn)化為全名呢? 在NCBI數(shù)據(jù)庫里下載對(duì)應(yīng)的信息,NLM 用sublime處理一下格式即可得到對(duì)應(yīng)的關(guān)系
怎么計(jì)算這些變異在特定群里里面的LD score? 有現(xiàn)成的數(shù)據(jù)庫可以用,LDlink LDlink is a suite of web-based applications designed to easily and efficiently interrogate linkage disequilibrium in population groups 還有R包可以直接調(diào)用,LDlinkR: Access LDlink API with R 問題:
如何學(xué)會(huì)提出問題,并用統(tǒng)計(jì)和simulation來檢驗(yàn)問題? 一個(gè)最重要的問題:我們觀測(cè)到的結(jié)果是不是隨機(jī)的? 這里就需要將我們的observe作simulation和shuffling。 這部分非常重要,也非常有意思。
如何過濾千人基因組里的SNP quality control (QC)
1000 genome數(shù)據(jù)庫里使用的是VCF4.1的格式 快速批量下載ftp目錄里的文件:
vcf轉(zhuǎn)plink格式:
plink文檔 - Whole genome association analysis toolset
其中AF(Estimated allele frequency in the range (0,1))就是整天的MAF 如何隨機(jī)讀取VCF:Introduction to vcfR
其實(shí)上面那三個(gè)指標(biāo)沒有那么簡(jiǎn)單,需要自己計(jì)算: Minor allele frequency (MAF) is the frequency at which the second most common allele occurs in a given population. They play a surprising role in heritability since MAF variants which occur only once, known as "singletons," drive an enormous amount of selection. Single nucleotide polymorphisms (SNPs) with a minor allele frequency of 0.05 (5%) or greater were targeted by the HapMap project. How can I get the allele frequency of my variant? you can calculate your frequency by dividing AC (allele count) by AN (allele number).
這是個(gè)perl腳本,太老了,跑出來的結(jié)果不太好,所以不用,折騰了我好久,還是用上面官方的方便。
官網(wǎng)推薦的方法:這里是它的網(wǎng)頁版本
需要把基本的統(tǒng)計(jì)遺傳學(xué)知識(shí)串一下了: 參考基因組 比對(duì) SNP allele Population Allele frequency Genotype frequency Allele frequency和MFA聯(lián)系和區(qū)別
接下來: 如何獲取SNP的HWE p-value? A genome-wide study of Hardy–Weinberg equilibrium with next generation sequence data Evolution and the tree of life - 不錯(cuò)的遺傳學(xué)公開課 Allele Frequencies and Hardy‐Weinberg Equilibrium - 關(guān)于HWE講得比較透徹 Quality control analysis of the 1000 Genomes Project Omni2.5 genotypes - 實(shí)操借鑒 It is therefore of interest to test whether a population is in HWE at a locus. We will discuss the two most popular ways of testing HWE Hardy‐Weinberg Assumptions
用HWE來過濾,是不想選到過于離譜的SNP,也就是我們只想選出大致符合HWE假設(shè)的SNP
如何獲取missing genotype rate? non-missing genotypes (call rate), Call rates were calculated using PLINK1.90.
install.packages("LDlinkR") 必須是R3.5.2版本及以上才能安裝 最好用并行來做,不然這個(gè)真的是太慢了。
SNP的過濾一行命令搞定:
待續(xù)~
|
|