【原】肺腺癌單細(xì)胞數(shù)據(jù)集GSE189357復(fù)現(xiàn)(一)：數(shù)據(jù)下載整理、降維聚類與分群

健明 2024-10-07 發(fā)布于廣東

展開全文

前言

Hello小伙伴們大家好，我是生信技能樹的小學(xué)徒”我才不吃蛋黃“。繼胃癌單細(xì)胞復(fù)現(xiàn)系列完結(jié)之后，經(jīng)過了短暫的休整，我又回來了。之前每一期的推文，我都有認(rèn)真準(zhǔn)備。推文發(fā)出之后，也仔細(xì)閱讀了每一條留言。非常感謝大家的鼓勵(lì)。

接下來的一段時(shí)間里，我將再次開啟一個(gè)新的學(xué)徒分享系列，給大家系統(tǒng)整理肺腺癌單細(xì)胞測序的代碼。

文章的具體內(nèi)容，二由老師已在生信菜鳥團(tuán)公眾號發(fā)布，請大家移步閱讀：單細(xì)胞測序+空間轉(zhuǎn)錄組描繪從癌前病變到浸潤性肺腺癌的動(dòng)態(tài)演變。

本系列包括但不限于以下內(nèi)容：數(shù)據(jù)下載與讀??；質(zhì)控和去批次；降維聚類；分群注釋；差異分析；富集分析；亞群細(xì)分；inferCNV;擬時(shí)序分析；細(xì)胞通訊；SCENIC轉(zhuǎn)錄因子分析等。

為了加深對代碼的理解，除了閱讀推文，可以加入單細(xì)胞月更群，也可以配合觀看生信技能樹B站系列教學(xué)視頻。

如果你是的單細(xì)胞新手，推薦報(bào)名參加生信馬拉松課程（生信入門&數(shù)據(jù)挖掘線上直播課10月班），讓你快速掌握生信數(shù)據(jù)挖掘技能。

本系列推文預(yù)計(jì)12篇左右，歡迎大家持續(xù)追更，關(guān)注公眾號，點(diǎn)贊+評論+收藏+在看，您的鼓勵(lì)將是我更新的動(dòng)力，同時(shí)歡迎大家批評指正。

數(shù)據(jù)處理流程

首先清除系統(tǒng)環(huán)境變量，加載R包：

rm(list=ls())
options(stringsAsFactors = F) 
getwd()
setwd("../GSE189357-LUAD-scRNA-ST")
source('scRNA_scripts/lib.R')
library(Seurat)
library(ggplot2)
library(clustree)
library(cowplot)
library(data.table)
library(dplyr)

當(dāng)我們在建立數(shù)據(jù)框的時(shí)候，R語言將會(huì)默認(rèn)把字符型（character）當(dāng)成因子（factor）。stringsAsFactors = F意味著，“在讀入數(shù)據(jù)時(shí)，遇到字符串之后，不將其轉(zhuǎn)換為factors，仍然保留為字符串格式”。

在R語言中，stringsAsFactors是一個(gè)全局選項(xiàng)，它控制著當(dāng)讀取數(shù)據(jù)時(shí)，字符串類型的列是否自動(dòng)轉(zhuǎn)換為因子(factors)類型。因子類型在R中是一種特殊的數(shù)據(jù)類型，用于表示分類數(shù)據(jù)。

在R的早期版本中，默認(rèn)情況下，讀取數(shù)據(jù)時(shí)字符串會(huì)被轉(zhuǎn)換為因子。但是，因子類型有其特定的屬性和行為，這在某些情況下可能不是用戶所期望的。例如，因子類型會(huì)將數(shù)據(jù)視為分類變量，并且會(huì)存儲數(shù)據(jù)的唯一值作為內(nèi)部整數(shù)編碼，這可能會(huì)增加內(nèi)存使用量，并且在進(jìn)行某些類型的數(shù)據(jù)分析時(shí)可能會(huì)引起混淆。

在單細(xì)胞數(shù)據(jù)分析中，數(shù)據(jù)通常包含大量的基因表達(dá)信息，這些信息是以字符串形式存在的。在進(jìn)行數(shù)據(jù)分析之前，設(shè)置stringsAsFactors = F的目的通常是為了：

避免不必要的內(nèi)存消耗：將字符串轉(zhuǎn)換為因子類型會(huì)增加內(nèi)存的使用，因?yàn)橐蜃有枰鎯σ粋€(gè)內(nèi)部的整數(shù)編碼和對應(yīng)的唯一值列表。在處理大規(guī)模的單細(xì)胞數(shù)據(jù)集時(shí)，這可能會(huì)成為一個(gè)問題。
保持?jǐn)?shù)據(jù)的一致性：在數(shù)據(jù)分析過程中，保持?jǐn)?shù)據(jù)類型的一致性是很重要的。如果某些操作期望數(shù)據(jù)是字符串類型，而數(shù)據(jù)被自動(dòng)轉(zhuǎn)換為因子，這可能會(huì)導(dǎo)致錯(cuò)誤或者不一致的結(jié)果。
提高代碼的可讀性和可維護(hù)性：明確地設(shè)置stringsAsFactors = F可以讓代碼的讀者知道數(shù)據(jù)不會(huì)被自動(dòng)轉(zhuǎn)換為因子，這有助于理解代碼的意圖和行為。
避免因子類型帶來的潛在問題：因子類型在某些統(tǒng)計(jì)分析中可能會(huì)引起問題，比如在進(jìn)行機(jī)器學(xué)習(xí)或者數(shù)據(jù)挖掘時(shí)，因子類型的數(shù)據(jù)可能需要額外的處理才能被正確使用。

因此，在讀取單細(xì)胞數(shù)據(jù)之前設(shè)置stringsAsFactors = F是一種良好的編程實(shí)踐，它有助于確保數(shù)據(jù)以正確的形式被處理，并且可以減少后續(xù)分析中可能出現(xiàn)的問題。

step1:導(dǎo)入數(shù)據(jù)

數(shù)據(jù)存儲在“GSE189357_RAW”文件夾中：

dir='GSE189357_RAW/' 
fs=list.files('GSE189357_RAW/','^GSM')
fs
library(tidyverse)
samples=str_split(fs,'_',simplify = T)[,1]

處理數(shù)據(jù)，將原始文件分別整理為barcodes.tsv.gz，features.tsv.gz和matrix.mtx.gz到各自的文件夾。批量將文件名改為 Read10X() 函數(shù)能夠識別的名字：

if(F){
lapply(unique(samples),function(x){
  # x = unique(samples)[1]
  y=fs[grepl(x,fs)]
  folder=paste0("GSE189357_RAW/", paste(str_split(y[1],'_',simplify = T)[,1:2], collapse = "_"))
  dir.create(folder,recursive = T)
  #為每個(gè)樣本創(chuàng)建子文件夾
  file.rename(paste0("GSE189357_RAW/",y[1]),file.path(folder,"barcodes.tsv.gz"))
  #重命名文件，并移動(dòng)到相應(yīng)的子文件夾里
  file.rename(paste0("GSE189357_RAW/",y[2]),file.path(folder,"features.tsv.gz"))
  file.rename(paste0("GSE189357_RAW/",y[3]),file.path(folder,"matrix.mtx.gz"))
})
}

自行創(chuàng)建函數(shù)function，內(nèi)置Read10X()函數(shù)和CreateSeuratObject()函數(shù)，批量讀取文件原始文件創(chuàng)建Seurat對象：

dir='GSE189357_RAW/'
samples=list.files( dir )
samples 
sceList = lapply(samples,function(pro){ 
  # pro=samples[1] 
  print(pro)  
  tmp = Read10X(file.path(dir,pro )) 
  if(length(tmp)==2){
 ct = tmp[[1]] 
  }else{ct = tmp}
  sce =CreateSeuratObject(counts =  ct ,
  project =  pro  ,
  min.cells = 5,
  min.features = 300 )
  return(sce)
})

這里我們得到的sceList實(shí)際上包含了9個(gè)樣本的Seurat對象，接著，我們需要使用Seurat包的merge函數(shù)，將九個(gè)Seurat合并成一個(gè)Seurat對象：

do.call(rbind,lapply(sceList, dim))
sce.all=merge(x=sceList[[1]],
  y=sceList[ -1 ],
  add.cell.ids = samples  ) 
names(sce.all@assays$RNA@layers)
#> [1] "counts.GSM5699777_TD1" "counts.GSM5699778_TD2" "counts.GSM5699779_TD3"
#> [4] "counts.GSM5699780_TD4" "counts.GSM5699781_TD5" "counts.GSM5699782_TD6"
#> [7] "counts.GSM5699783_TD7" "counts.GSM5699784_TD8" "counts.GSM5699785_TD9"

此時(shí)，雖然我們已經(jīng)完成了Seurat對象的創(chuàng)建，但是可以看到，九個(gè)樣本仍然有9個(gè)layers。如果不進(jìn)一步處理，后續(xù)在提取counts時(shí)數(shù)據(jù)不完整，分析會(huì)一直出錯(cuò)。因此我們需要使用JoinLayers函數(shù)對layers進(jìn)行合并。

sce.all[["RNA"]]$counts 
# Alternate accessor function with the same result
LayerData(sce.all, assay = "RNA", layer = "counts")

看看合并前后的sce變化：

sce.all
sce.all <- JoinLayers(sce.all)
sce.all
dim(sce.all[["RNA"]]$counts )

在合并Seurat和layers后，終于得到了一個(gè)完整的Seurat對象”sce.all“。我們可以查看”sce.all“內(nèi)部的一些信息，以此來檢查數(shù)據(jù)是否完整。

as.data.frame(sce.all@assays$RNA$counts[1:10, 1:2])
head(sce.all@meta.data, 10)
table(sce.all$orig.ident) 
length(sce.all$orig.ident)
# fivenum(sce.all$nFeature_RNA)
# table(sce.all$nFeature_RNA>800) 
# sce.all=sce.all[,sce.all$nFeature_RNA>800]
# sce.all

在成功構(gòu)建Seurat對象”sce.all“后，我們還需要給樣本添加meta.data分組信息，以便后續(xù)做不同分組之間的對比以及提取亞組后進(jìn)行進(jìn)一步分析。首先，我們查看現(xiàn)有的meta.data信息有哪些：

library(stringr)
phe = sce.all@meta.data
table(phe$orig.ident)
phe$patient = str_split(phe$orig.ident,'[_]',simplify = T)[,2]
table(phe$patient)

按照患者來源，肺腺癌（LUAD）從原位腺癌（AIS）進(jìn)展到微浸潤性腺癌（MIA）和隨后的浸潤性腺癌（IAC）

phe$sample = phe$patient
table(phe$sample)
phe$sample = gsub("TD5|TD7|TD8", "AIS", phe$sample) 
phe$sample = gsub("TD3|TD4|TD6", "MIA", phe$sample) 
phe$sample = gsub("TD1|TD2|TD9", "IAC", phe$sample) 
table(phe$sample)
sce.all@meta.data = phe

step2: QC質(zhì)控

質(zhì)控（quality control, QC）的目的是為了去除質(zhì)量較差細(xì)胞，低質(zhì)量的細(xì)胞會(huì)形成獨(dú)特的亞群，使分群結(jié)果變得復(fù)雜；在主成分分析過程中，前幾個(gè)主要成分將捕獲質(zhì)量差異而不是生物學(xué)差異，從而降低降維效果。因此，低質(zhì)量的細(xì)胞可能會(huì)導(dǎo)致下游分析出現(xiàn)誤導(dǎo)性結(jié)果。為了避免上述情況的發(fā)生，我們需要在下游分析開始前排除掉這些低質(zhì)量細(xì)胞。QC主要的指標(biāo)有nCount_RNA(每個(gè)細(xì)胞的UMI數(shù)目)和nFeature_RNA（每個(gè)細(xì)胞中檢測到的基因數(shù)量）以及"percent_mito"（表示細(xì)胞中線粒體基因的比例）這三個(gè)指標(biāo)。此外，還可以納入”percent_ribo“（核糖體基因比例）和”percent_hb“（紅血細(xì)胞基因比例）兩個(gè)指標(biāo)。細(xì)胞的UMI分子數(shù)和基因數(shù)反映細(xì)胞的質(zhì)量，數(shù)量太低可能是細(xì)胞碎片，太高則可能是兩個(gè)或多個(gè)細(xì)胞結(jié)團(tuán)的情況；線粒體基因高表達(dá)的細(xì)胞，可能是處于凋亡狀態(tài)或者裂解狀態(tài)的細(xì)胞；核糖體基因高表達(dá)的細(xì)胞，細(xì)胞內(nèi)出現(xiàn)RNA降解時(shí)；血紅蛋白基因高表達(dá)的細(xì)胞通常為紅細(xì)胞，而紅細(xì)胞本身是不含有細(xì)胞核的，且攜帶的基因少，其攜帶的基因與疾病以及生物發(fā)育等過程沒有太大的聯(lián)系，所以可以直接剔除掉。在這里，我們要利用PercentageFeatureSet函數(shù)分別計(jì)算每個(gè)細(xì)胞的"percent_mito"，”percent_ribo“和”percent_hb“，具體可見scripts代碼

sp='human'
dir.create("./1-QC")
setwd("./1-QC")
# 如果過濾的太狠，就需要去修改這個(gè)過濾代碼
source('../scRNA_scripts/qc.R')
sce.all.filt = basic_qc(sce.all)
print(dim(sce.all))
print(dim(sce.all.filt))
##細(xì)胞減少了一點(diǎn)
setwd('../')
getwd()
fivenum(sce.all.filt$percent_ribo)
table(sce.all.filt$nFeature_RNA> 5)

qc函數(shù)如下：

basic_qc <- function(input_sce){
  #計(jì)算線粒體基因比例
  mito_genes=rownames(input_sce)[grep("^MT-", rownames(input_sce),ignore.case = T)] 
  print(mito_genes) #可能是13個(gè)線粒體基因
  #input_sce=PercentageFeatureSet(input_sce, "^MT-", col.name = "percent_mito")
  input_sce=PercentageFeatureSet(input_sce, features = mito_genes, col.name = "percent_mito")
  fivenum(input_sce@meta.data$percent_mito)
  
  #計(jì)算核糖體基因比例
  ribo_genes=rownames(input_sce)[grep("^Rp[sl]", rownames(input_sce),ignore.case = T)]
  print(ribo_genes)
  input_sce=PercentageFeatureSet(input_sce,  features = ribo_genes, col.name = "percent_ribo")
  fivenum(input_sce@meta.data$percent_ribo)
  
  #計(jì)算紅血細(xì)胞基因比例
  Hb_genes=rownames(input_sce)[grep("^Hb[^(p)]", rownames(input_sce),ignore.case = T)]
  print(Hb_genes)
  input_sce=PercentageFeatureSet(input_sce,  features = Hb_genes,col.name = "percent_hb")
  fivenum(input_sce@meta.data$percent_hb)
  
  #可視化細(xì)胞的上述比例情況
  feats <- c("nFeature_RNA", "nCount_RNA", "percent_mito",
 "percent_ribo", "percent_hb")
  feats <- c("nFeature_RNA", "nCount_RNA")
  p1=VlnPlot(input_sce, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 2) + 
 NoLegend()
  p1 
  w=length(unique(input_sce$orig.ident))/3+5;w
  ggsave(filename="Vlnplot1.pdf",plot=p1,width = w,height = 5)
  
  feats <- c("percent_mito", "percent_ribo", "percent_hb")
  p2=VlnPlot(input_sce, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 3, same.y.lims=T) + 
 scale_y_continuous(breaks=seq(0, 100, 5)) +
 NoLegend()
  p2 
  w=length(unique(input_sce$orig.ident))/2+5;w
  ggsave(filename="Vlnplot2.pdf",plot=p2,width = w,height = 5)
  
  p3=FeatureScatter(input_sce, "nCount_RNA", "nFeature_RNA", group.by = "orig.ident", pt.size = 0.5)
  ggsave(filename="Scatterplot.pdf",plot=p3)
  
  #根據(jù)上述指標(biāo)，過濾低質(zhì)量細(xì)胞/基因
  #過濾指標(biāo)1:最少表達(dá)基因數(shù)的細(xì)胞&最少表達(dá)細(xì)胞數(shù)的基因
  # 一般來說，在CreateSeuratObject的時(shí)候已經(jīng)是進(jìn)行了這個(gè)過濾操作
  # 如果后期看到了自己的單細(xì)胞降維聚類分群結(jié)果很詭異，就可以回過頭來看質(zhì)量控制環(huán)節(jié)
  # 先走默認(rèn)流程即可
  if(F){
 selected_c <- WhichCells(input_sce, expression = nFeature_RNA > 500)
 selected_f <- rownames(input_sce)[Matrix::rowSums(input_sce@assays$RNA$counts > 0 ) > 3]
 input_sce.filt <- subset(input_sce, features = selected_f, cells = selected_c)
 dim(input_sce) 
 dim(input_sce.filt) 
  }
  
  input_sce.filt =  input_sce
  
  # par(mar = c(4, 8, 2, 1))
  # 這里的C 這個(gè)矩陣，有一點(diǎn)大，可以考慮隨抽樣 
  C=subset(input_sce.filt,downsample=100)@assays$RNA$counts
  dim(C)
  C=Matrix::t(Matrix::t(C)/Matrix::colSums(C)) * 100

  most_expressed <- order(apply(C, 1, median), decreasing = T)[50:1]
  
  pdf("TOP50_most_expressed_gene.pdf",width=14)
  boxplot(as.matrix(Matrix::t(C[most_expressed, ])),
 cex = 0.1, las = 1, 
 xlab = "% total count per cell", 
 col = (scales::hue_pal())(50)[50:1], 
 horizontal = TRUE)
  dev.off()
  rm(C)
  
  #過濾指標(biāo)2:線粒體/核糖體基因比例(根據(jù)上面的violin圖)
  selected_mito <- WhichCells(input_sce.filt, expression = percent_mito < 25)
  selected_ribo <- WhichCells(input_sce.filt, expression = percent_ribo > 3)
  selected_hb <- WhichCells(input_sce.filt, expression = percent_hb < 1 )
  length(selected_hb)
  length(selected_ribo)
  length(selected_mito)
  
  input_sce.filt <- subset(input_sce.filt, cells = selected_mito)
  #input_sce.filt <- subset(input_sce.filt, cells = selected_ribo)
  input_sce.filt <- subset(input_sce.filt, cells = selected_hb)
  dim(input_sce.filt)
  
  table(input_sce.filt$orig.ident) 
  
  #可視化過濾后的情況
  feats <- c("nFeature_RNA", "nCount_RNA")
  p1_filtered=VlnPlot(input_sce.filt, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 2) + 
 NoLegend()
  w=length(unique(input_sce.filt$orig.ident))/3+5;w 
  ggsave(filename="Vlnplot1_filtered.pdf",plot=p1_filtered,width = w,height = 5)
  
  feats <- c("percent_mito", "percent_ribo", "percent_hb")
  p2_filtered=VlnPlot(input_sce.filt, group.by = "orig.ident", features = feats, pt.size = 0, ncol = 3) + 
 NoLegend()
  w=length(unique(input_sce.filt$orig.ident))/2+5;w 
  ggsave(filename="Vlnplot2_filtered.pdf",plot=p2_filtered,width = w,height = 5) 
  return(input_sce.filt) 
  
}

可視化

step3: harmony整合多個(gè)單細(xì)胞樣品

細(xì)胞篩選之后，我們需要進(jìn)行harmony整合。第一期我們在創(chuàng)建總的Seurat對象時(shí)，使用了merge函數(shù)對多個(gè)Seurat進(jìn)行了簡單的合并。merge只是按照行和列進(jìn)行了合并，并未對數(shù)據(jù)進(jìn)行其他處理。

在拿到下游單細(xì)胞矩陣前，樣本經(jīng)歷了多個(gè)實(shí)驗(yàn)環(huán)節(jié)的處理，時(shí)間、處理人員、試劑以及技術(shù)平臺等變量都會(huì)成為混雜因素。以上因素混合到一起，就會(huì)導(dǎo)致數(shù)據(jù)產(chǎn)生批次效應(yīng)（batch effect）。為了盡可能避免混雜因素，我們可以嚴(yán)格把控測序的技術(shù)流程，同時(shí)也需要在下游分析中進(jìn)行事后補(bǔ)救（算法去批次）。目前單細(xì)胞測序常用的去批次算法有Harmony，CCA，RPCA,FastMNN,scVI等。在這里，我們使用Harmony進(jìn)行演示。

set.seed(10086)
table(sce.all.filt$orig.ident)
if(T){
  dir.create("2-harmony")
  getwd()
  setwd("2-harmony")
  source('../scRNA_scripts/harmony.R')
  # 默認(rèn) ScaleData 沒有添加"nCount_RNA", "nFeature_RNA"
  # 默認(rèn)的
  sce.all.int = run_harmony(sce.all.filt)
  setwd('../')
}

harmony函數(shù)如下：

run_harmony <- function(input_sce){
  
  print(dim(input_sce))
  input_sce <- NormalizeData(input_sce, 
  normalization.method = "LogNormalize",
  scale.factor = 1e4) 
  input_sce <- FindVariableFeatures(input_sce)
  input_sce <- ScaleData(input_sce)
  input_sce <- RunPCA(input_sce, features = VariableFeatures(object = input_sce))
  seuratObj <- RunHarmony(input_sce, "orig.ident")
  names(seuratObj@reductions)
  seuratObj <- RunUMAP(seuratObj,  dims = 1:15, 
  reduction = "harmony")
  
  # p = DimPlot(seuratObj,reduction = "umap",label=T ) 
  # ggsave(filename='umap-by-orig.ident-after-harmony',plot = p)
  
  input_sce=seuratObj
  input_sce <- FindNeighbors(input_sce, reduction = "harmony",
  dims = 1:15) 
  input_sce.all=input_sce
  
  #設(shè)置不同的分辨率，觀察分群效果(選擇哪一個(gè)？)
  for (res in c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5,0.8,1)) {
 input_sce.all=FindClusters(input_sce.all, #graph.name = "CCA_snn", 
 resolution = res, algorithm = 1)
  }
  colnames(input_sce.all@meta.data)
  apply(input_sce.all@meta.data[,grep("RNA_snn",colnames(input_sce.all@meta.data))],2,table)
  
  p1_dim=plot_grid(ncol = 3, DimPlot(input_sce.all, reduction = "umap", group.by = "RNA_snn_res.0.01") + 
   ggtitle("louvain_0.01"), DimPlot(input_sce.all, reduction = "umap", group.by = "RNA_snn_res.0.1") + 
   ggtitle("louvain_0.1"), DimPlot(input_sce.all, reduction = "umap", group.by = "RNA_snn_res.0.2") + 
   ggtitle("louvain_0.2"))
  ggsave(plot=p1_dim, filename="Dimplot_diff_resolution_low.pdf",width = 14)
  
  p1_dim=plot_grid(ncol = 3, DimPlot(input_sce.all, reduction = "umap", group.by = "RNA_snn_res.0.8") + 
   ggtitle("louvain_0.8"), DimPlot(input_sce.all, reduction = "umap", group.by = "RNA_snn_res.1") + 
   ggtitle("louvain_1"), DimPlot(input_sce.all, reduction = "umap", group.by = "RNA_snn_res.0.3") + 
   ggtitle("louvain_0.3"))
  ggsave(plot=p1_dim, filename="Dimplot_diff_resolution_high.pdf",width = 18)
  
  p2_tree=clustree(input_sce.all@meta.data, prefix = "RNA_snn_res.")
  ggsave(plot=p2_tree, filename="Tree_diff_resolution.pdf")
  table(input_sce.all@active.ident) 
  saveRDS(input_sce.all, "sce.all_int.rds")
  return(input_sce.all)
  
}

可視化

結(jié)語

本期我們整理了肺腺癌單細(xì)胞數(shù)據(jù)集GSE189357個(gè)的10X格式的單細(xì)胞測序數(shù)據(jù)，在批量讀取了10X文件后，進(jìn)行了合并并成功構(gòu)建Seurat對象，在此基礎(chǔ)上將患者的臨床信息添加到meta.data分組信息中。在此基礎(chǔ)上走Seurat V5標(biāo)準(zhǔn)流程，對Seurat對象進(jìn)行QC質(zhì)控、并利用harmony整合去批次、并按標(biāo)準(zhǔn)流程進(jìn)行降維聚類分群。下一期，我們將進(jìn)行單細(xì)胞測序數(shù)據(jù)分析最重要（個(gè)人認(rèn)為）的一步：細(xì)胞注釋。下一期咱們不見不散！