library(tidyverse)
library(limpa)
library(ruv)
library(ruvms)
library(matrixStats)
library(RColorBrewer)
library(DT)
library(ggvenn)
theme_set(theme_bw())
# utils
source("/Users/li.me/Documents/Proteomics/250128-utils.R")
Example analysis of two calibration datasets using limpa and limma
1 Overview
In-house global proteomics experiments were performed on total protein lysates from normal human adult bladder (BL), frontal lobe (FL) and cerebellum (CB) total protein (BioChain).
Pilot was prepared and processed in 2023 and the second experiment on pure tissue lysates were analysed in 2024. In particular, the pilot was acquired on timsTOF Pro (Bruker) and the second batch was acquired on the Orbitrap Astral. Both experiments were searched by DIA-NN (1.8).
The main aim of this document is to demonstrate the limpa-limma
workflow for differential expression (DE) analysis on mass spectrometry-based proteomics data. In the first part of this document, a standard limpa-limma
DE analysis is performed on the second batch of data to compare the proteomes of different tissues. A pathway analysis and GO analysis is also demonstrated.
In the second part of this document, since batch correction/normalisation is another often debated topic, I am also including some basic assessment and treatment strategies into the discussion here. What I present here is not and should not be considered as the “gold standard recipe” to follow for batch correction; it is a rather stringent (but safer) option, and I only aim to demonstrate some of the functionalities limpa
and limma
have to help investigate and deal with batch effects. This document entails only what I would do in a preliminary analysis given these datasets. It is important to bear in mind that every experiment is different and should be examined with the aid of expert domain knowledge.
Nevertheless, here are a few options that I would recommend to consider for normalisation and batch correction:
If clear batches present, include a
batch
covariate into the design matrix, and use thelimma::removeBatchEffect()
function for visualisation. This is one of the more reliable approaches when batches are clearly identified. Yet, you will see later in this document that this approach does not fit universally to every dataset. Some datasets require bespoke analysis.For experiments with larger sample sizes, it is often observed in the samples the instrument drifting effects, i.e., the sample run order effect. Based on my own experience,
limma::normalizeCyclicLoess()
is often effective enough to remove such technical variation.For some more challenging datasets, one may also consider
ruv::RUV4()
on the protein level.RUV4
was designed for microarray data but the idea of applying it on protein-level data is worthy of a try.Sometimes, there are no multiple batches but only some “bad quality samples”. In that case, estimating samples weights using
dpcDE(..., sample.weights = TRUE)
might be another good option. “Bad quality samples” will then be downweighted in the DE analysis.Relative-log-abundance (RLA) plots are useful in visualising batch effects and/or unwanted variation in data.
When your data look sensible enough, or if there are special biological contexts (e.g., with enrichments), there is no need to always force normalisation on your data.
On the wet lab side:
Talk to the statisticians and bioinformaticians before carrying out the very expensive experiments.
INCLUDE TECHNICAL REPLICATES!! when you know there are going to be multiple batches. That is the only way to evaluate your batch correction performances. Include (as many as possible) biological replicates when you can.
What I hope for this document/demo is to provoke new discussions on better practices for MS-based proteomics data analysis, and in particular, careful exploratory data analysis (EDA) is compulsory before jumping to any conclusions; and to point out that data normalisation/batch correction remains a challenging problem for MS proteomics data.
Disclaimers: The pilot data are proprietary hence not public. The pure tissue data will be publicly available on PRIDE soon.
To cite this: Li M, Cobbold SA, Smyth GK (2025). Quantification and differential analysis of mass spectrometry proteomics data with probabilistic recovery of information from missing values. bioRxiv. doi: https://doi.org/10.1101/2025.04.28.651125
Acknowlegements
Gordon Smyth (WEHI).
Pilot: Anna Quaglieri (Mass Dynamics) and Toby Dite (CSL Limited.)
Pure tissue by Orbitrap Astral: Simon Cobbold (WEHI).
The WEHI Proteomics Facility.
2 Load libraries
3 Part 1: human tissue proteomes measured by the Orbitrap Astral
3.1 Read data
<- "/Users/li.me/Documents/Proteomics/datasets/_mixture/data/2024-09-14_3_lysate_mix/B_CB_FL_combined_wMBR/"
path_pure <- readDIANN(paste0(path_pure, "report.tsv"))
pure
<- filterNonProteotypicPeptides(pure) |>
pure filterCompoundProteins()
dim(pure)
[1] 135528 15
1:2, 1:2] pure[
An object of class "EList"
$E
FL_1 FL_2
AAAAAAAAAAAPPAPPEGASPGDSAR3 15.17844 15.24111
AAAAAAAAAVSR2 NA NA
$genes
Protein.Group Protein.Names Genes
AAAAAAAAAAAPPAPPEGASPGDSAR3 Q8WXD9 CSKI1_HUMAN CASKIN1
AAAAAAAAAVSR2 Q96JP5 ZFP91_HUMAN ZFP91
Precursor.Id
AAAAAAAAAAAPPAPPEGASPGDSAR3 AAAAAAAAAAAPPAPPEGASPGDSAR3
AAAAAAAAAVSR2 AAAAAAAAAVSR2
Sample information
<- data.frame(Run = colnames(pure)) |>
pure_samples mutate(Condition = str_sub(Run, end = -3)) |>
arrange(Condition, Run) |>
mutate(Replicate = str_sub(Run, start = -1)) |>
mutate(Condition = ifelse(Condition == "B", "BL", Condition)) |>
mutate(Condition = factor(Condition, levels = c("BL", "CB", "FL")),
Batch = "Second",
User = "User3") |>
select(Run, User, Condition, Batch) |>
arrange(factor(Run, levels = colnames(pure)))
datatable(pure_samples, rownames = FALSE)
3.2 Basic exploratory analysis
Basic statistics
dim(pure)
[1] 135528 15
length(unique(pure$genes$Protein.Group))
[1] 9278
Overall proportion of missing values:
sum(is.na(pure$E)) / length(pure$E)
[1] 0.3346236
Missing values in each sample
# make cond.cols for consistency
<- brewer.pal(6, "Spectral")[1:3]
cond.cols
mutate(pure_samples, NAproportion = colMeans( is.na(pure$E) )) |>
ggplot(aes(x = Run, y = NAproportion, fill = Condition)) +
geom_col() +
facet_wrap(~ User, ncol = 1, scales = "free_x") +
labs(y = "Missing value proportion") +
scale_fill_manual(values = cond.cols) +
theme(axis.text.x = element_text(angle = 90))
MDS plot
plotMDS(pure,
bg = cond.cols[as.numeric(pure_samples$Condition)],
pch = 21, cex = 2.5,
main = "MDS plot, precursor level")
legend("topright", legend = levels(pure_samples$Condition),
col = cond.cols, pch = 16, pt.cex = 2, horiz = TRUE)
Box plots of log-intensities
boxplot(pure$E, outline = FALSE,
boxfill = cond.cols[as.numeric(pure_samples$Condition)], xaxt = "none",
main = "Box plots of log2-intensities")
legend("topleft",
legend = sort(unique(pure_samples$Condition)),
fill = cond.cols,
bty = "n", horiz = TRUE, x.intersp = 0.6)
FL and CB samples are more similar compared to FL.
Densities of log-intensities
plotDensities(pure$E, col = cond.cols[as.numeric(pure_samples$Condition)])
Again - we see a shift in the log-intensities in the bladder samples compared to the two brain tissues.
The relative-log-abundance (RLA) plot
RLA(t(pure$E),
repCol = cond.cols[as.numeric(pure_samples$Condition)],
repLabel = unique(pure_samples$Condition),
guides = seq(-3, 3, 1),
title = "RLA plot, precursor level"
)
RLA plots are a useful visualisation tool to inspect for batch effects and/or unwanted variation in data, although for proteomics data, because of the missing values, I prefer to use RLA plots as a guideline joint with other diagnostic plots, but not treat them as a definitive diagnostic criterion by themselves.
3.3 Detection probability curve (DPC)
<- dpc(pure)
pure_dpcfit plotDPC(pure_dpcfit)
$dpc pure_dpcfit
beta0 beta1
-4.9974474 0.4286703
Group-wise DPC
par(mfrow = c(2, 2))
for (gg in unique(pure_samples$Condition)) {
<- pure$E[, pure_samples$Condition == gg]
subset <- dpc(subset, b1.upper = 1.5)
dpc_subset plotDPC(dpc_subset, ylim = c(0, 1.1), main = gg)
title(sub = paste0(c("b0 = ", "b1 = "), round(dpc_subset$dpc, 2), collapse = ", "))
}
8717 peptides are completely missing in all samples.
16571 peptides are completely missing in all samples.
52342 peptides are completely missing in all samples.
The three tissues are in fact very different biologically. Here we observe bigger but similar slopes in group-wise DPCs. Based our past experiences on a few in-house biological experiments from the Orbitrap Astral, I will perform a bespoke analysis for this dataset where I will specify a DPC slope of 0.8 in the next section. However, note that DPC-Quant results are quite robust to the DPC.
3.4 Protein quantification by DPC-Quant
<- dpcQuant(pure, dpc.slope = 0.8) pureProt
DPC intercept estimated as -9.005
Estimating hyperparameters ...
Quantifying proteins ...
Proteins: 1000 Peptides: 14268
Proteins: 2000 Peptides: 32667
Proteins: 3000 Peptides: 50954
Proteins: 4000 Peptides: 67969
Proteins: 5000 Peptides: 82912
Proteins: 6000 Peptides: 94015
Proteins: 7000 Peptides: 106315
Proteins: 8000 Peptides: 117511
Proteins: 9000 Peptides: 131293
Proteins: 9278 Peptides: 135528
MDS plot after protein quantification
plotMDSUsingSEs(pureProt,
bg = cond.cols[as.numeric(pure_samples$Condition)],
pch = 21, cex = 2.5,
main = "MDS plot, DPC-Quant")
legend("topright", legend = levels(pure_samples$Condition),
col = cond.cols, pch = 16, pt.cex = 2, horiz = TRUE)
3.5 DE analysis
Design matrix
<- model.matrix(~ 0 + Condition, data = pure_samples)
pure_design colnames(pure_design) <- str_remove(colnames(pure_design), "Condition")
<- makeContrasts(CBvsFL = CB - FL,
pure_contr CBvsBL = CB - BL,
FLvsBL = FL - BL,
levels = pure_design)
<- dpcDE(pureProt, pure_design) fit
<- contrasts.fit(fit, contrasts = pure_contr)
fit <- eBayes(fit, robust = TRUE)
fit
summary(fit$df.prior)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3535 6.4637 6.4637 6.4095 6.4637 6.4637
summary(decideTests(fit))
CBvsFL CBvsBL FLvsBL
Down 4015 2211 2116
NotSig 2730 2086 1590
Up 2533 4981 5572
3.5.1 Take CB vs FL as an example and visualise the results
Top table
<- topTable(fit, coef = "CBvsFL", number = Inf) |>
res signifTopTable()
datatable(res[1:50, ], rownames = FALSE, caption = "Top 50 DE proteins comparing CB versus FL")
plotVolcano(fit, coef = "CBvsFL", col_symbol = "Protein.Names",
label.P.neg = 1e-15, label.P.pos = 1e-15)
Warning: ggrepel: 111 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
Or, the mean-difference(MD) plot which we prefer because it gives also the average expression
plotMD(fit, column = "CBvsFL",
status = decideTests(fit)[, "CBvsFL"],
hl.cex = 0.5,
main = "Mean-difference plot, coef = CBvsFL")
Plot the top 5 DE proteins
par(mfrow = c(3, 2))
for (ii in 1:5) {
<- res$Protein.Group[ii]
ii_pg ::plotProtein(pureProt, ii_pg,
limpacol = cond.cols[as.numeric(pure_samples$Condition)],
cex = 1,
main = ii_pg)
}
Pathway analysis
## sometimes the UniProtKB to gene name mapping is not up to date
## update UniProtID to gene name mapping
<- read_delim("/Users/li.me/Documents/Proteomics/datasets/Uniprot/250303-HUMAN_9606_idmapping.dat.txt", delim = "\t", col_names = FALSE, show_col_types = FALSE)
db <- filter(db, X2 %in% c("UniProtKB-ID", "Gene_Name"))
db <- pivot_wider(db, id_cols = X1, names_from = X2, values_from = X3, values_fn = \(x) paste0(x, collapse = ";")) |>
db rename(Protein.Group = X1)
$genes <- left_join(fit$genes, db, by = "Protein.Group")
fit
# proteins mapped to multiple gene names are disregarded
$genes$Gene_Name[str_detect(fit$genes$Gene_Name, ";")] <- NA_character_
fit
# map names to entrez
<- "/Users/li.me/Documents/Proteomics/datasets/Uniprot/250423-Homo_sapiens.gene_info.txt"
gene_info_file $genes <- mutate(fit$genes, alias2SymbolUsingNCBI(Gene_Name, gene.info.file = gene_info_file)) |>
fitselect(-Gene_Name)
<- fit$genes genes_anno
KEGG analysis
<- kegga(fit, coef = "CBvsFL", species = "Hs", geneid = "GeneID", trend = TRUE)
keggs
topKEGG(keggs, number = 10)
Pathway N Up Down
hsa00190 Oxidative phosphorylation 98 83 7
hsa01100 Metabolic pathways 1104 444 400
hsa03040 Spliceosome 111 71 25
hsa05415 Diabetic cardiomyopathy 156 88 37
hsa03010 Ribosome 121 73 33
hsa05020 Prion disease 203 106 71
hsa05012 Parkinson disease 195 103 67
hsa05016 Huntington disease 229 114 79
hsa05208 Chemical carcinogenesis - reactive oxygen species 164 87 47
hsa04932 Non-alcoholic fatty liver disease 107 62 26
P.Up P.Down
hsa00190 3.671350e-29 1.0000000
hsa01100 4.788417e-20 1.0000000
hsa03040 1.427665e-15 0.9999996
hsa05415 6.015432e-13 1.0000000
hsa03010 1.029153e-12 0.9999959
hsa05020 4.933256e-12 0.9997452
hsa05012 5.899400e-12 0.9998710
hsa05016 1.531455e-11 0.9998772
hsa05208 4.756629e-11 0.9999983
hsa04932 3.412203e-10 0.9999990
Thanks to ChatGPT 4o:
- Oxidative phosphorylation (hsa00190) is a core mitochondrial pathway responsible for ATP production.
- The cerebellum is known to have higher mitochondrial activity and oxidative metabolism compared to the frontal cortex, reflecting its role in coordinating fine motor activity and maintaining high firing rates in Purkinje cells.
- In contrast, the frontal lobe (especially the prefrontal cortex) shows different metabolic preferences and synaptic protein enrichment patterns, often with a greater proportion of aerobic glycolysis.
Gene ontology analysis:
<- goana(fit, coef = "CBvsFL", species = "Hs", geneid = "GeneID", trend = TRUE)
gos
topGO(gos, number = 10, ontology = "BP")
Term Ont N Up
GO:0006139 nucleobase-containing compound metabolic process BP 2463 855
GO:0042773 ATP synthesis coupled electron transport BP 72 61
GO:0042775 mitochondrial ATP synthesis coupled electron transport BP 72 61
GO:0019646 aerobic electron transport chain BP 67 58
GO:0006396 RNA processing BP 571 254
GO:0034654 nucleobase-containing compound biosynthetic process BP 1944 693
GO:0006119 oxidative phosphorylation BP 111 80
GO:0051641 cellular localization BP 2278 533
GO:0022904 respiratory electron transport chain BP 85 66
GO:0008104 protein localization BP 1734 388
Down P.Up P.Down
GO:0006139 952 1.610816e-22 1.000000e+00
GO:0042773 7 7.391754e-22 1.000000e+00
GO:0042775 7 7.391754e-22 1.000000e+00
GO:0019646 5 7.539051e-22 1.000000e+00
GO:0006396 177 2.228147e-21 1.000000e+00
GO:0034654 709 2.402925e-21 1.000000e+00
GO:0006119 18 1.798419e-20 1.000000e+00
GO:0051641 1185 9.999999e-01 2.179603e-20
GO:0022904 11 8.801173e-20 1.000000e+00
GO:0008104 926 1.000000e+00 1.955675e-19
You can then go on and visualise the results etc.
4 Part 2: Two datasets analysed together
4.1 Load data from a previous experiement (the pilot)
<- "/Users/li.me/Documents/Proteomics/datasets/mixture_pilot/data/DIANN_libfree/"
path_pilot <- readDIANN(paste0(path_pilot, "P4455_libfree.tsv")) pilot
Some filtering on the precursors
<- filterNonProteotypicPeptides(pilot) |>
pilot filterCompoundProteins()
dim(pilot)
[1] 51579 42
4.2 Add sample information
<- read_csv(paste0(path_pilot, "../P4455-design.csv"), col_types = cols())
pilot_samples
<- data.frame(Run = colnames(pilot)) |>
pilot_samples mutate(ID = str_sub(Run, end = 8)) |>
left_join(pilot_samples, by = "ID") |>
arrange(ID) |>
mutate(Condition = str_remove_all(Condition, "ondition ")) |>
mutate(Condition = as.factor(Condition)) |>
mutate(RunOrder = str_sub(Run, start = -4)) |>
mutate(RunOrder = as.numeric(as.factor(as.numeric(RunOrder)))) |>
mutate(User = factor(User, labels = c("User1", "User2")))
Focus on the pure samples only here.
<- filter(pilot_samples, Condition %in% c("C5", "C6", "C7")) |>
pilot_samples mutate(Condition = case_when(Condition == "C5" ~ "BL",
== "C6" ~ "FL",
Condition == "C7" ~ "CB")) |>
Condition mutate(Condition = factor(Condition, levels = c("BL", "CB", "FL")),
Batch = "Pilot") |>
select(Run, User, Condition, Batch)
datatable(pilot_samples, rownames = FALSE)
<- pilot[, pilot_samples$Run] pilot
Also filter out precursors completely missing in the kept samples
<- rowMeans(is.na(pilot$E)) < 1
kp <- pilot[kp, ]
pilot
dim(pilot)
[1] 51486 18
4.3 Basic exploratory analysis for the pilot
4.3.1 Missing values
Overall proportion of missing values:
sum(is.na(pilot$E)) / length(pilot$E)
[1] 0.4157624
Missing values in each sample
# make cond.cols.pilot for consistency
<- brewer.pal(6, "Spectral")[4:6]
cond.cols.pilot
mutate(pilot_samples, NAproportion = colMeans( is.na(pilot$E) )) |>
ggplot(aes(x = Run, y = NAproportion, fill = Condition)) +
geom_col() +
facet_wrap(~ User, ncol = 1, scales = "free_x") +
labs(y = "Missing value proportion") +
scale_fill_manual(values = cond.cols.pilot) +
theme(axis.text.x = element_text(angle = 90))
4.3.2 Detection probability curves
<- dpc(pilot)
pilot_dpcfit plotDPC(pilot_dpcfit)
$dpc pilot_dpcfit
beta0 beta1
-4.5086055 0.4737302
Group-wise DPC (advanced option, future extension)
par(mfrow = c(2, 2))
for (gg in unique(pilot_samples$Condition)) {
<- pilot$E[, pilot_samples$Condition == gg]
subset <- dpc(subset)
dpc_subset plotDPC(dpc_subset, ylim = c(0, 1.1), main = gg)
title(sub = paste0(c("b0 = ", "b1 = "), round(dpc_subset$dpc, 2), collapse = ", "))
}
22027 peptides are completely missing in all samples.
8157 peptides are completely missing in all samples.
3260 peptides are completely missing in all samples.
We will use a DPC slope of 0.7 in DPC-Quantification later.
MDS plot
plotMDS(pilot,
bg = cond.cols.pilot[as.numeric(pilot_samples$Condition)],
pch = 21, cex = 2.5,
main = "MDS plot, precursor level")
legend("topleft", legend = levels(pilot_samples$Condition),
col = cond.cols.pilot, pch = 16, pt.cex = 2, horiz = TRUE)
Let us see how the pure samples look when the two experiments are combined. Concatenating two datasets:
length(pilot$genes$Precursor.Id)
[1] 51486
length(pure$genes$Precursor.Id)
[1] 135528
<- intersect(pilot$genes$Precursor.Id, pure$genes$Precursor.Id)
shared_precursor length(shared_precursor)
[1] 41856
For simplicity, will create a combined dataset by keeping only those shared precursors.
<- cbind(pure[shared_precursor, ], pilot[shared_precursor, ])
cmbn
<- add_row(pure_samples, pilot_samples) |>
cmbn_samples mutate(CondBatch = paste(Condition, Batch, sep = "_")) |>
mutate(CondBatch = factor(CondBatch, levels = sort(unique(CondBatch))))
<- cmbn[, cmbn_samples$Run]
cmbn dim(cmbn)
[1] 41856 33
length(unique(cmbn$genes$Protein.Group))
[1] 4935
MDS plot
<- c(cond.cols, cond.cols.pilot)
cmbn.cols
plotMDS(cmbn,
bg = cmbn.cols[as.numeric(cmbn_samples$CondBatch)],
pch = 21, cex = 2,
main = "MDS plot, precursor level")
legend("bottom", legend = levels(cmbn_samples$CondBatch),
col = cmbn.cols, pch = 16, pt.cex = 2, cex = 0.8)
Two batches are clearly separated by the horizontal dimension.
RLA(t(cmbn$E),
repCol = cmbn.cols[as.numeric(cmbn_samples$CondBatch)],
repLabel = unique(cmbn_samples$CondBatch),
guides = seq(-5, 10, 5),
title = "RLA plot; shared precursors"
)
Two batches are clearly separated again.
4.4 Protein quantification by DPC-Quant
I will perform the protein quantification separately for each batch. Since I will be analysing the two batches together on the protein level, I will quantify only proteins that are shared between the two batches.
<- pure[shared_precursor, ]
pureShared <- pilot[shared_precursor, ] pilotShared
Use the DPCs fitted on all precursors.
<- dpcQuant(pureShared, dpc.slope = 0.8) pureProt
DPC intercept estimated as -8.95
Estimating hyperparameters ...
Quantifying proteins ...
Proteins: 1000 Peptides: 9045
Proteins: 2000 Peptides: 19840
Proteins: 3000 Peptides: 29235
Proteins: 4000 Peptides: 35395
Proteins: 4935 Peptides: 41856
<- dpcQuant(pilotShared, dpc.slope = 0.7) pilotProt
DPC intercept estimated as -6.406
Estimating hyperparameters ...
Quantifying proteins ...
Proteins: 1000 Peptides: 9045
Proteins: 2000 Peptides: 19840
Proteins: 3000 Peptides: 29235
Proteins: 4000 Peptides: 35395
Proteins: 4935 Peptides: 41856
4.5 Protein-level data exploration
par(mfrow = c(1, 2))
plotMDSUsingSEs(pureProt,
bg = cond.cols[as.numeric(pure_samples$Condition)],
pch = 21, cex = 2.5,
main = "MDS plot, DPC-Quant for main")
legend("bottom", legend = levels(pure_samples$Condition),
col = cond.cols, pch = 16, pt.cex = 2, horiz = TRUE)
plotMDSUsingSEs(pilotProt,
bg = cond.cols.pilot[as.numeric(pilot_samples$Condition)],
pch = 21, cex = 2.5,
main = "MDS plot, DPC-Quant for pilot")
legend("bottom", legend = levels(pilot_samples$Condition),
col = cond.cols.pilot, pch = 16, pt.cex = 2, horiz = TRUE)
Concatenate two batches
<- cbind(pureProt, pilotProt)
cmbnProt <- cmbnProt[, cmbn_samples$Run] cmbnProt
MDS plot
plotMDS(cmbnProt,
bg = cmbn.cols[as.numeric(cmbn_samples$CondBatch)],
pch = 21, cex = 2,
main = "MDS plot, DPC-Quant")
legend("bottom", legend = levels(cmbn_samples$CondBatch),
col = cmbn.cols, pch = 16, pt.cex = 2, horiz = FALSE)
Two batches are clearly separated by the horizontal dimension.
RLA plot
RLA(t(cmbnProt$E),
repCol = cmbn.cols[as.numeric(cmbn_samples$CondBatch)],
repLabel = unique(cmbn_samples$CondBatch),
title = "DPC-Quant summarisation of the two batches, simply concatenated"
)
Two batches clearly present themselves.
plotDensities(cmbnProt,
col = cmbn.cols[as.numeric(cmbn_samples$CondBatch)],
legend = FALSE,
main = "Log-intensity densities of shared proteins")
legend("topright", legend = levels(cmbn_samples$CondBatch),
col = cmbn.cols, pch = 16, pt.cex = 2)
Samples clearly distinguishable by both batch and tissue types. Also notice that the two batches of data have different dynamic ranges.
4.5.1 Try removeBatchEffect()
:
<- removeBatchEffect(cmbnProt,
cmbnProtNormed batch = cmbn_samples$Batch,
group = cmbn_samples$Condition)
plotMDS(cmbnProtNormed,
bg = cmbn.cols[as.numeric(cmbn_samples$CondBatch)],
pch = 21, cex = 2,
main = "MDS plot, protein level")
legend("bottom", legend = levels(cmbn_samples$CondBatch),
col = cmbn.cols, pch = 16, pt.cex = 2)
Batch effects were mitigated, but not entirely removed. Note that this examples is beyond simply “two batches” (they were not initially designed for evaluating normalisation methods after all). The two experiments were not only prepared and processed at two different times by different people, but were acquired on different instruments as well. The Orbitrap Astral run not only enabled much more protein IDs, but also covered a much bigger dynamic range. removeBatchEffect()
adjusts for the baseline batch effect for each protein, but does not adjust for the difference in dynamic ranges between the two experiments. To that end, more sophisticated normalisation methods are needed. Internal control proteins (if there are any; e.g., housekeeping proteins) would be helpful in achieving effective normalisation.
However, usually, the removeBatchEffect()
function gives good visual diagnostics. When that is the case, we add the batch
covariate into the design matrix when performing DE analysis as below.
<- model.matrix(~ 0 + Condition + Batch, data = cmbn_samples)
design_rmBatch head(design_rmBatch)
ConditionBL ConditionCB ConditionFL BatchSecond
1 0 0 1 1
2 0 0 1 1
3 0 0 1 1
4 0 0 1 1
5 0 0 1 1
6 0 1 0 1
You can then create your contrasts of interest (if needed) and perform the DE analysis.
4.5.2 Try cyclic loess
Alternatively, one can also try the cyclic loess normalisation.
normalizeCyclicLoess(cmbnProt$E) |>
plotMDS(bg = cmbn.cols[as.numeric(cmbn_samples$CondBatch)],
pch = 21, cex = 2,
main = "MDS plot, cyclic loess")
legend("bottom", legend = levels(cmbn_samples$CondBatch),
col = cmbn.cols, pch = 16, pt.cex = 2)
Although it did not “perfectly” correct for batch effects for this case study, the cyclic loess normalisation might still be useful and effective on other datasets. In fact, cyclic loess is very effective for normalisation of technical variations due to drifts in the instruments (run order effect).
4.5.3 Try ruv
First identify the factor of interest, which is tissue type
<- matrix(as.numeric(cmbn_samples$Condition) - 1, ncol = 1) X
Use lowly variable proteins as the negative controls
<- rowVars(cmbnProt$E)
var_protein <- var_protein <= quantile(var_protein, probs = 0.1) ctl
Run RUV4
<- RUV4(t(cmbnProt$E), X, ctl, k = 3) ruvout
Visualise RUV4-normalised data
<- model.matrix(~ 0 + Condition, data = cmbn_samples)
cmbn_design <- removeBatchEffect(cmbnProt, covariates = ruvout$W, design = cmbn_design)
cmbnProt_ruv4
plotMDS(cmbnProt_ruv4,
bg = cmbn.cols[as.numeric(cmbn_samples$CondBatch)],
pch = 21, cex = 2,
main = "MDS plot, RUV4 (k = 3)")
legend("top", legend = levels(cmbn_samples$CondBatch),
col = cmbn.cols, pch = 16, pt.cex = 2)
Seems good. I will proceed with RUV4-normalised data.
4.6 DE analysis
<- model.matrix(~ 0 + cmbn_samples$Condition + ruvout$W)
cmbn_design colnames(cmbn_design) <- str_remove(colnames(cmbn_design), "cmbn_samples\\$Condition|ruvout\\$")
head(cmbn_design)
BL CB FL W1 W2 W3
1 0 0 1 -0.2206172 -0.06748149 -0.44308012
2 0 0 1 -0.2233234 -0.06680479 -0.43701435
3 0 0 1 -0.2208496 -0.07859086 -0.43676195
4 0 0 1 -0.2192203 -0.07926116 -0.42090739
5 0 0 1 -0.2200015 -0.07943666 -0.43509509
6 0 1 0 -0.2102168 0.27764574 0.01756785
<- makeContrasts(CBvsFL = CB - FL,
cmbn_contr CBvsBL = CB - BL,
FLvsBL = FL - BL,
levels = cmbn_design)
<- dpcDE(cmbnProt, cmbn_design) cmbn_fit
<- contrasts.fit(cmbn_fit, contrasts = cmbn_contr)
cmbn_fit <- eBayes(cmbn_fit, robust = TRUE)
cmbn_fit
summary(cmbn_fit$df.prior)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2202 6.7775 6.7775 6.7488 6.7775 6.7775
summary(decideTests(cmbn_fit))
CBvsFL CBvsBL FLvsBL
Down 657 1104 1725
NotSig 3439 2678 1552
Up 839 1153 1658
Check how many of the DE proteins comparing CB versus FL were identified in before
<- topTable(cmbn_fit, coef = "CBvsFL", number = Inf) |> signifTopTable()
cmbn_res
<- list(Pure = rownames(pureProt)[decideTests(fit)[, "CBvsFL"] != 0],
deps Cmbn = rownames(cmbnProt)[decideTests(cmbn_fit)[, "CBvsFL"] != 0])
ggvenn(deps)
Validation of the biological relevance of these results is yet need to be performed.
5 Key takeaways
- Data exploration and visualisation is a non-dispensible first step.
- Must have technical replicates if multiple batches are expected.
- There is no one-size-fits-all pipeline for all types of MS proteomics data.
- Several batch correction strategies are demonstrated here, but analysts will need to consider each dataset under its own context. The impact of normalisation methods needs to evaluated carefully while taking into account the corresponding biological contexts.
References
Li M, Cobbold SA, Smyth GK (2025). Quantification and differential analysis of mass spectrometry proteomics data with probabilistic recovery of information from missing values. bioRxiv. doi: https://doi.org/10.1101/2025.04.28.651125
Li M, Smyth GK (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. Bioinformatics 39(5), btad200. doi:10.1093/bioinformatics/btad200
Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600
Gandolfo, LC. and Speed, TP (2018). RLE plots: Visualizing unwanted variation in high dimensional data. PloS one, 13(2), p.e0191629. doi:10.1371/journal.pone.0191629
Gagnon-Bartsch, JA and Speed, TP (2012). Using control genes to correct for unwanted variation in microarray data. “Biostatistics”, 13(3), pp.539-552. doi:10.1093/biostatistics/kxr034
6 Session information
sessionInfo()
R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.3.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] ggthemeML_1.0.1 colorspace_2.1-1 ggrepel_0.9.6 ggvenn_0.1.10
[5] DT_0.33 RColorBrewer_1.1-3 matrixStats_1.5.0 ruvms_0.99.0
[9] ruv_0.9.7.1 limpa_1.1.1 limma_3.64.0 lubridate_1.9.4
[13] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.4
[17] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.2
[21] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 farver_2.1.2 blob_1.2.4
[4] Biostrings_2.76.0 fastmap_1.2.0 digest_0.6.37
[7] timechange_0.3.0 lifecycle_1.0.4 statmod_1.5.0
[10] KEGGREST_1.48.0 RSQLite_2.3.11 magrittr_2.0.3
[13] compiler_4.5.0 rlang_1.1.6 sass_0.4.10
[16] tools_4.5.0 yaml_2.3.10 data.table_1.17.0
[19] knitr_1.50 labeling_0.4.3 htmlwidgets_1.6.4
[22] bit_4.6.0 withr_3.0.2 BiocGenerics_0.54.0
[25] stats4_4.5.0 GO.db_3.21.0 scales_1.4.0
[28] cli_3.6.5 rmarkdown_2.29 crayon_1.5.3
[31] generics_0.1.3 RcppParallel_5.1.10 rstudioapi_0.17.1
[34] httr_1.4.7 tzdb_0.5.0 DBI_1.2.3
[37] cachem_1.1.0 parallel_4.5.0 AnnotationDbi_1.70.0
[40] XVector_0.48.0 BiasedUrn_2.0.12 vctrs_0.6.5
[43] jsonlite_2.0.0 IRanges_2.42.0 hms_1.1.3
[46] S4Vectors_0.46.0 bit64_4.6.0-1 crosstalk_1.2.1
[49] jquerylib_0.1.4 glue_1.8.0 stringi_1.8.7
[52] gtable_0.3.6 GenomeInfoDb_1.44.0 UCSC.utils_1.4.0
[55] pillar_1.10.2 htmltools_0.5.8.1 GenomeInfoDbData_1.2.14
[58] R6_2.6.1 zigg_0.0.2 vroom_1.6.5
[61] evaluate_1.0.3 Biobase_2.68.0 png_0.1-8
[64] Rfast_2.1.5.1 memoise_2.0.1 bslib_0.9.0
[67] Rcpp_1.0.14 gridExtra_2.3 org.Hs.eg.db_3.21.0
[70] xfun_0.52 pkgconfig_2.0.3