Example analysis of two calibration datasets using limpa and limma

Author

Mengbo Li

Published

May 9, 2025

1 Overview

In-house global proteomics experiments were performed on total protein lysates from normal human adult bladder (BL), frontal lobe (FL) and cerebellum (CB) total protein (BioChain).

Pilot was prepared and processed in 2023 and the second experiment on pure tissue lysates were analysed in 2024. In particular, the pilot was acquired on timsTOF Pro (Bruker) and the second batch was acquired on the Orbitrap Astral. Both experiments were searched by DIA-NN (1.8).

The main aim of this document is to demonstrate the limpa-limma workflow for differential expression (DE) analysis on mass spectrometry-based proteomics data. In the first part of this document, a standard limpa-limma DE analysis is performed on the second batch of data to compare the proteomes of different tissues. A pathway analysis and GO analysis is also demonstrated.

In the second part of this document, since batch correction/normalisation is another often debated topic, I am also including some basic assessment and treatment strategies into the discussion here. What I present here is not and should not be considered as the “gold standard recipe” to follow for batch correction; it is a rather stringent (but safer) option, and I only aim to demonstrate some of the functionalities limpa and limma have to help investigate and deal with batch effects. This document entails only what I would do in a preliminary analysis given these datasets. It is important to bear in mind that every experiment is different and should be examined with the aid of expert domain knowledge.

Nevertheless, here are a few options that I would recommend to consider for normalisation and batch correction:

If clear batches present, include a batch covariate into the design matrix, and use the limma::removeBatchEffect() function for visualisation. This is one of the more reliable approaches when batches are clearly identified. Yet, you will see later in this document that this approach does not fit universally to every dataset. Some datasets require bespoke analysis.
For experiments with larger sample sizes, it is often observed in the samples the instrument drifting effects, i.e., the sample run order effect. Based on my own experience, limma::normalizeCyclicLoess() is often effective enough to remove such technical variation.
For some more challenging datasets, one may also consider ruv::RUV4() on the protein level. RUV4 was designed for microarray data but the idea of applying it on protein-level data is worthy of a try.
Sometimes, there are no multiple batches but only some “bad quality samples”. In that case, estimating samples weights using dpcDE(..., sample.weights = TRUE) might be another good option. “Bad quality samples” will then be downweighted in the DE analysis.
Relative-log-abundance (RLA) plots are useful in visualising batch effects and/or unwanted variation in data.
When your data look sensible enough, or if there are special biological contexts (e.g., with enrichments), there is no need to always force normalisation on your data.

On the wet lab side:

Talk to the statisticians and bioinformaticians before carrying out the very expensive experiments.
INCLUDE TECHNICAL REPLICATES!! when you know there are going to be multiple batches. That is the only way to evaluate your batch correction performances. Include (as many as possible) biological replicates when you can.

What I hope for this document/demo is to provoke new discussions on better practices for MS-based proteomics data analysis, and in particular, careful exploratory data analysis (EDA) is compulsory before jumping to any conclusions; and to point out that data normalisation/batch correction remains a challenging problem for MS proteomics data.

Disclaimers: The pilot data are proprietary hence not public. The pure tissue data will be publicly available on PRIDE soon.

To cite this: Li M, Cobbold SA, Smyth GK (2025). Quantification and differential analysis of mass spectrometry proteomics data with probabilistic recovery of information from missing values. bioRxiv. doi: https://doi.org/10.1101/2025.04.28.651125

Acknowlegements

Gordon Smyth (WEHI).

Pilot: Anna Quaglieri (Mass Dynamics) and Toby Dite (CSL Limited.)

Pure tissue by Orbitrap Astral: Simon Cobbold (WEHI).

The WEHI Proteomics Facility.

2 Load libraries

library(tidyverse)
library(limpa)
library(ruv)
library(ruvms)
library(matrixStats)

library(RColorBrewer)
library(DT)
library(ggvenn)

theme_set(theme_bw())

# utils
source("/Users/li.me/Documents/Proteomics/250128-utils.R")

3 Part 1: human tissue proteomes measured by the Orbitrap Astral

3.1 Read data

path_pure <- "/Users/li.me/Documents/Proteomics/datasets/_mixture/data/2024-09-14_3_lysate_mix/B_CB_FL_combined_wMBR/"
pure <- readDIANN(paste0(path_pure, "report.tsv"))

pure <- filterNonProteotypicPeptides(pure) |> 
  filterCompoundProteins()
dim(pure)

[1] 135528     15

pure[1:2, 1:2]

An object of class "EList"
$E
                                FL_1     FL_2
AAAAAAAAAAAPPAPPEGASPGDSAR3 15.17844 15.24111
AAAAAAAAAVSR2                     NA       NA

$genes
                            Protein.Group Protein.Names   Genes
AAAAAAAAAAAPPAPPEGASPGDSAR3        Q8WXD9   CSKI1_HUMAN CASKIN1
AAAAAAAAAVSR2                      Q96JP5   ZFP91_HUMAN   ZFP91
                                           Precursor.Id
AAAAAAAAAAAPPAPPEGASPGDSAR3 AAAAAAAAAAAPPAPPEGASPGDSAR3
AAAAAAAAAVSR2                             AAAAAAAAAVSR2

Sample information

pure_samples <- data.frame(Run = colnames(pure)) |> 
  mutate(Condition = str_sub(Run, end = -3)) |> 
  arrange(Condition, Run) |> 
  mutate(Replicate = str_sub(Run, start = -1)) |> 
  mutate(Condition = ifelse(Condition == "B", "BL", Condition)) |> 
  mutate(Condition = factor(Condition, levels = c("BL", "CB", "FL")), 
         Batch = "Second", 
         User = "User3") |> 
  select(Run, User, Condition, Batch) |> 
  arrange(factor(Run, levels = colnames(pure)))

datatable(pure_samples, rownames = FALSE)

3.2 Basic exploratory analysis

Basic statistics

dim(pure)

[1] 135528     15

length(unique(pure$genes$Protein.Group))

[1] 9278

Overall proportion of missing values:

sum(is.na(pure$E)) / length(pure$E)

[1] 0.3346236

Missing values in each sample

# make cond.cols for consistency
cond.cols <- brewer.pal(6, "Spectral")[1:3]

mutate(pure_samples, NAproportion = colMeans( is.na(pure$E) )) |> 
  ggplot(aes(x = Run, y = NAproportion, fill = Condition)) + 
  geom_col() + 
  facet_wrap(~ User, ncol = 1, scales = "free_x") +
  labs(y = "Missing value proportion") + 
  scale_fill_manual(values = cond.cols) +
  theme(axis.text.x = element_text(angle = 90))

MDS plot

plotMDS(pure, 
        bg = cond.cols[as.numeric(pure_samples$Condition)], 
        pch = 21, cex = 2.5, 
        main = "MDS plot, precursor level")
legend("topright", legend = levels(pure_samples$Condition), 
       col = cond.cols, pch = 16, pt.cex = 2, horiz = TRUE)

Box plots of log-intensities

boxplot(pure$E, outline = FALSE, 
        boxfill = cond.cols[as.numeric(pure_samples$Condition)], xaxt = "none", 
        main = "Box plots of log2-intensities")
legend("topleft", 
       legend = sort(unique(pure_samples$Condition)), 
       fill = cond.cols, 
       bty = "n", horiz = TRUE, x.intersp = 0.6)

FL and CB samples are more similar compared to FL.

Densities of log-intensities

plotDensities(pure$E, col = cond.cols[as.numeric(pure_samples$Condition)])

Again - we see a shift in the log-intensities in the bladder samples compared to the two brain tissues.

The relative-log-abundance (RLA) plot

RLA(t(pure$E), 
    repCol = cond.cols[as.numeric(pure_samples$Condition)], 
    repLabel = unique(pure_samples$Condition), 
    guides = seq(-3, 3, 1), 
    title = "RLA plot, precursor level"
    )

RLA plots are a useful visualisation tool to inspect for batch effects and/or unwanted variation in data, although for proteomics data, because of the missing values, I prefer to use RLA plots as a guideline joint with other diagnostic plots, but not treat them as a definitive diagnostic criterion by themselves.

3.3 Detection probability curve (DPC)

pure_dpcfit <- dpc(pure)
plotDPC(pure_dpcfit)

pure_dpcfit$dpc

     beta0      beta1 
-4.9974474  0.4286703

Group-wise DPC

par(mfrow = c(2, 2))

for (gg in unique(pure_samples$Condition)) {
  
  subset <- pure$E[, pure_samples$Condition == gg]
  dpc_subset <- dpc(subset, b1.upper = 1.5)
  plotDPC(dpc_subset, ylim = c(0, 1.1), main = gg)
  title(sub = paste0(c("b0 = ", "b1 = "), round(dpc_subset$dpc, 2), collapse = ", "))
  
}

8717 peptides are completely missing in all samples.

16571 peptides are completely missing in all samples.

52342 peptides are completely missing in all samples.

The three tissues are in fact very different biologically. Here we observe bigger but similar slopes in group-wise DPCs. Based our past experiences on a few in-house biological experiments from the Orbitrap Astral, I will perform a bespoke analysis for this dataset where I will specify a DPC slope of 0.8 in the next section. However, note that DPC-Quant results are quite robust to the DPC.

3.4 Protein quantification by DPC-Quant

pureProt <- dpcQuant(pure, dpc.slope = 0.8)

DPC intercept estimated as -9.005

Estimating hyperparameters ...

Quantifying proteins ...

Proteins: 1000 Peptides: 14268

Proteins: 2000 Peptides: 32667

Proteins: 3000 Peptides: 50954

Proteins: 4000 Peptides: 67969

Proteins: 5000 Peptides: 82912

Proteins: 6000 Peptides: 94015

Proteins: 7000 Peptides: 106315

Proteins: 8000 Peptides: 117511

Proteins: 9000 Peptides: 131293

Proteins: 9278 Peptides: 135528

MDS plot after protein quantification

plotMDSUsingSEs(pureProt, 
                bg = cond.cols[as.numeric(pure_samples$Condition)], 
                pch = 21, cex = 2.5, 
                main = "MDS plot, DPC-Quant")
legend("topright", legend = levels(pure_samples$Condition), 
       col = cond.cols, pch = 16, pt.cex = 2, horiz = TRUE)

3.5 DE analysis

Design matrix

pure_design <- model.matrix(~ 0 + Condition, data = pure_samples)
colnames(pure_design) <- str_remove(colnames(pure_design), "Condition")

pure_contr <- makeContrasts(CBvsFL = CB - FL, 
                            CBvsBL = CB - BL, 
                            FLvsBL = FL - BL, 
                            levels = pure_design)

fit <- dpcDE(pureProt, pure_design)

fit <- contrasts.fit(fit, contrasts = pure_contr)
fit <- eBayes(fit, robust = TRUE)

summary(fit$df.prior)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3535  6.4637  6.4637  6.4095  6.4637  6.4637

summary(decideTests(fit))

       CBvsFL CBvsBL FLvsBL
Down     4015   2211   2116
NotSig   2730   2086   1590
Up       2533   4981   5572

3.5.1 Take CB vs FL as an example and visualise the results

Top table

res <- topTable(fit, coef = "CBvsFL", number = Inf) |> 
  signifTopTable()

datatable(res[1:50, ], rownames = FALSE, caption = "Top 50 DE proteins comparing CB versus FL")

plotVolcano(fit, coef = "CBvsFL", col_symbol = "Protein.Names", 
            label.P.neg = 1e-15, label.P.pos = 1e-15)

Warning: ggrepel: 111 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

Or, the mean-difference(MD) plot which we prefer because it gives also the average expression

plotMD(fit, column = "CBvsFL", 
       status = decideTests(fit)[, "CBvsFL"], 
       hl.cex = 0.5, 
       main = "Mean-difference plot, coef = CBvsFL")

Plot the top 5 DE proteins

par(mfrow = c(3, 2))

for (ii in 1:5) {
  ii_pg <- res$Protein.Group[ii]
  limpa::plotProtein(pureProt, ii_pg, 
                     col = cond.cols[as.numeric(pure_samples$Condition)], 
                     cex = 1, 
                     main = ii_pg)
}

Pathway analysis

## sometimes the UniProtKB to gene name mapping is not up to date
## update UniProtID to gene name mapping 
db <- read_delim("/Users/li.me/Documents/Proteomics/datasets/Uniprot/250303-HUMAN_9606_idmapping.dat.txt", delim = "\t", col_names = FALSE, show_col_types = FALSE)
db <- filter(db, X2 %in% c("UniProtKB-ID", "Gene_Name"))
db <- pivot_wider(db, id_cols = X1, names_from = X2, values_from = X3, values_fn = \(x) paste0(x, collapse = ";")) |> 
  rename(Protein.Group = X1)

fit$genes <- left_join(fit$genes, db, by = "Protein.Group")

# proteins mapped to multiple gene names are disregarded
fit$genes$Gene_Name[str_detect(fit$genes$Gene_Name, ";")] <- NA_character_

# map names to entrez
gene_info_file <- "/Users/li.me/Documents/Proteomics/datasets/Uniprot/250423-Homo_sapiens.gene_info.txt"
fit$genes <- mutate(fit$genes, alias2SymbolUsingNCBI(Gene_Name, gene.info.file = gene_info_file)) |> 
  select(-Gene_Name)

genes_anno <- fit$genes

KEGG analysis

keggs <- kegga(fit, coef = "CBvsFL", species = "Hs", geneid = "GeneID", trend = TRUE)

topKEGG(keggs, number = 10)

                                                   Pathway    N  Up Down
hsa00190                         Oxidative phosphorylation   98  83    7
hsa01100                                Metabolic pathways 1104 444  400
hsa03040                                       Spliceosome  111  71   25
hsa05415                           Diabetic cardiomyopathy  156  88   37
hsa03010                                          Ribosome  121  73   33
hsa05020                                     Prion disease  203 106   71
hsa05012                                 Parkinson disease  195 103   67
hsa05016                                Huntington disease  229 114   79
hsa05208 Chemical carcinogenesis - reactive oxygen species  164  87   47
hsa04932                 Non-alcoholic fatty liver disease  107  62   26
                 P.Up    P.Down
hsa00190 3.671350e-29 1.0000000
hsa01100 4.788417e-20 1.0000000
hsa03040 1.427665e-15 0.9999996
hsa05415 6.015432e-13 1.0000000
hsa03010 1.029153e-12 0.9999959
hsa05020 4.933256e-12 0.9997452
hsa05012 5.899400e-12 0.9998710
hsa05016 1.531455e-11 0.9998772
hsa05208 4.756629e-11 0.9999983
hsa04932 3.412203e-10 0.9999990

Thanks to ChatGPT 4o:

Oxidative phosphorylation (hsa00190) is a core mitochondrial pathway responsible for ATP production.
The cerebellum is known to have higher mitochondrial activity and oxidative metabolism compared to the frontal cortex, reflecting its role in coordinating fine motor activity and maintaining high firing rates in Purkinje cells.
In contrast, the frontal lobe (especially the prefrontal cortex) shows different metabolic preferences and synaptic protein enrichment patterns, often with a greater proportion of aerobic glycolysis.

Gene ontology analysis:

gos <- goana(fit, coef = "CBvsFL", species = "Hs", geneid = "GeneID", trend = TRUE)

topGO(gos, number = 10, ontology = "BP")

                                                             Term Ont    N  Up
GO:0006139       nucleobase-containing compound metabolic process  BP 2463 855
GO:0042773               ATP synthesis coupled electron transport  BP   72  61
GO:0042775 mitochondrial ATP synthesis coupled electron transport  BP   72  61
GO:0019646                       aerobic electron transport chain  BP   67  58
GO:0006396                                         RNA processing  BP  571 254
GO:0034654    nucleobase-containing compound biosynthetic process  BP 1944 693
GO:0006119                              oxidative phosphorylation  BP  111  80
GO:0051641                                  cellular localization  BP 2278 533
GO:0022904                   respiratory electron transport chain  BP   85  66
GO:0008104                                   protein localization  BP 1734 388
           Down         P.Up       P.Down
GO:0006139  952 1.610816e-22 1.000000e+00
GO:0042773    7 7.391754e-22 1.000000e+00
GO:0042775    7 7.391754e-22 1.000000e+00
GO:0019646    5 7.539051e-22 1.000000e+00
GO:0006396  177 2.228147e-21 1.000000e+00
GO:0034654  709 2.402925e-21 1.000000e+00
GO:0006119   18 1.798419e-20 1.000000e+00
GO:0051641 1185 9.999999e-01 2.179603e-20
GO:0022904   11 8.801173e-20 1.000000e+00
GO:0008104  926 1.000000e+00 1.955675e-19

You can then go on and visualise the results etc.

4 Part 2: Two datasets analysed together

4.1 Load data from a previous experiement (the pilot)

path_pilot <- "/Users/li.me/Documents/Proteomics/datasets/mixture_pilot/data/DIANN_libfree/"
pilot <- readDIANN(paste0(path_pilot, "P4455_libfree.tsv"))

Some filtering on the precursors

pilot <- filterNonProteotypicPeptides(pilot) |> 
  filterCompoundProteins()
dim(pilot)

[1] 51579    42

4.2 Add sample information

pilot_samples <- read_csv(paste0(path_pilot, "../P4455-design.csv"), col_types = cols())

pilot_samples <- data.frame(Run = colnames(pilot)) |> 
  mutate(ID = str_sub(Run, end = 8)) |> 
  left_join(pilot_samples, by = "ID") |> 
  arrange(ID) |> 
  mutate(Condition = str_remove_all(Condition, "ondition ")) |> 
  mutate(Condition = as.factor(Condition)) |> 
  mutate(RunOrder = str_sub(Run, start = -4)) |> 
  mutate(RunOrder = as.numeric(as.factor(as.numeric(RunOrder)))) |> 
  mutate(User = factor(User, labels = c("User1", "User2")))

Focus on the pure samples only here.

pilot_samples <- filter(pilot_samples, Condition %in% c("C5", "C6", "C7")) |> 
  mutate(Condition = case_when(Condition == "C5" ~ "BL",
                               Condition == "C6" ~ "FL", 
                               Condition == "C7" ~ "CB")) |> 
  mutate(Condition = factor(Condition, levels = c("BL", "CB", "FL")), 
         Batch = "Pilot") |> 
  select(Run, User, Condition, Batch)

datatable(pilot_samples, rownames = FALSE)

pilot <- pilot[, pilot_samples$Run]

Also filter out precursors completely missing in the kept samples

kp <- rowMeans(is.na(pilot$E)) < 1
pilot <- pilot[kp, ]

dim(pilot)

[1] 51486    18

4.3 Basic exploratory analysis for the pilot

4.3.1 Missing values

Overall proportion of missing values:

sum(is.na(pilot$E)) / length(pilot$E)

[1] 0.4157624

Missing values in each sample

# make cond.cols.pilot for consistency
cond.cols.pilot <- brewer.pal(6, "Spectral")[4:6]

mutate(pilot_samples, NAproportion = colMeans( is.na(pilot$E) )) |> 
  ggplot(aes(x = Run, y = NAproportion, fill = Condition)) + 
  geom_col() + 
  facet_wrap(~ User, ncol = 1, scales = "free_x") +
  labs(y = "Missing value proportion") + 
  scale_fill_manual(values = cond.cols.pilot) +
  theme(axis.text.x = element_text(angle = 90))

4.3.2 Detection probability curves

pilot_dpcfit <- dpc(pilot)
plotDPC(pilot_dpcfit)

pilot_dpcfit$dpc

     beta0      beta1 
-4.5086055  0.4737302

Group-wise DPC (advanced option, future extension)

par(mfrow = c(2, 2))

for (gg in unique(pilot_samples$Condition)) {
  
  subset <- pilot$E[, pilot_samples$Condition == gg]
  dpc_subset <- dpc(subset)
  plotDPC(dpc_subset, ylim = c(0, 1.1), main = gg)
  title(sub = paste0(c("b0 = ", "b1 = "), round(dpc_subset$dpc, 2), collapse = ", "))
  
}

22027 peptides are completely missing in all samples.

8157 peptides are completely missing in all samples.

3260 peptides are completely missing in all samples.

We will use a DPC slope of 0.7 in DPC-Quantification later.

MDS plot

plotMDS(pilot, 
        bg = cond.cols.pilot[as.numeric(pilot_samples$Condition)], 
        pch = 21, cex = 2.5, 
        main = "MDS plot, precursor level")
legend("topleft", legend = levels(pilot_samples$Condition), 
       col = cond.cols.pilot, pch = 16, pt.cex = 2, horiz = TRUE)

Let us see how the pure samples look when the two experiments are combined. Concatenating two datasets:

length(pilot$genes$Precursor.Id)

[1] 51486

length(pure$genes$Precursor.Id)

[1] 135528

shared_precursor <- intersect(pilot$genes$Precursor.Id, pure$genes$Precursor.Id)
length(shared_precursor)

[1] 41856

For simplicity, will create a combined dataset by keeping only those shared precursors.

cmbn <- cbind(pure[shared_precursor, ], pilot[shared_precursor, ])

cmbn_samples <- add_row(pure_samples, pilot_samples) |> 
  mutate(CondBatch = paste(Condition, Batch, sep = "_")) |> 
  mutate(CondBatch = factor(CondBatch, levels = sort(unique(CondBatch))))

cmbn <- cmbn[, cmbn_samples$Run]
dim(cmbn)

[1] 41856    33

length(unique(cmbn$genes$Protein.Group))

[1] 4935

MDS plot

cmbn.cols <- c(cond.cols, cond.cols.pilot)

plotMDS(cmbn, 
        bg = cmbn.cols[as.numeric(cmbn_samples$CondBatch)], 
        pch = 21, cex = 2, 
        main = "MDS plot, precursor level")
legend("bottom", legend = levels(cmbn_samples$CondBatch), 
       col = cmbn.cols, pch = 16, pt.cex = 2, cex = 0.8)

Two batches are clearly separated by the horizontal dimension.

RLA(t(cmbn$E), 
    repCol = cmbn.cols[as.numeric(cmbn_samples$CondBatch)], 
    repLabel = unique(cmbn_samples$CondBatch), 
    guides = seq(-5, 10, 5), 
    title = "RLA plot; shared precursors"
    )

Two batches are clearly separated again.

4.4 Protein quantification by DPC-Quant

I will perform the protein quantification separately for each batch. Since I will be analysing the two batches together on the protein level, I will quantify only proteins that are shared between the two batches.

pureShared <- pure[shared_precursor, ]
pilotShared <- pilot[shared_precursor, ]

Use the DPCs fitted on all precursors.

pureProt <- dpcQuant(pureShared, dpc.slope = 0.8)

DPC intercept estimated as -8.95

Estimating hyperparameters ...

Quantifying proteins ...

Proteins: 1000 Peptides: 9045

Proteins: 2000 Peptides: 19840

Proteins: 3000 Peptides: 29235

Proteins: 4000 Peptides: 35395

Proteins: 4935 Peptides: 41856

pilotProt <- dpcQuant(pilotShared, dpc.slope = 0.7)

DPC intercept estimated as -6.406

Estimating hyperparameters ...

Quantifying proteins ...

Proteins: 1000 Peptides: 9045

Proteins: 2000 Peptides: 19840

Proteins: 3000 Peptides: 29235

Proteins: 4000 Peptides: 35395

Proteins: 4935 Peptides: 41856

4.5 Protein-level data exploration

par(mfrow = c(1, 2))

plotMDSUsingSEs(pureProt, 
                bg = cond.cols[as.numeric(pure_samples$Condition)], 
                pch = 21, cex = 2.5, 
                main = "MDS plot, DPC-Quant for main")
legend("bottom", legend = levels(pure_samples$Condition), 
       col = cond.cols, pch = 16, pt.cex = 2, horiz = TRUE)

plotMDSUsingSEs(pilotProt, 
                bg = cond.cols.pilot[as.numeric(pilot_samples$Condition)], 
                pch = 21, cex = 2.5, 
                main = "MDS plot, DPC-Quant for pilot")
legend("bottom", legend = levels(pilot_samples$Condition), 
       col = cond.cols.pilot, pch = 16, pt.cex = 2, horiz = TRUE)

Concatenate two batches

cmbnProt <- cbind(pureProt, pilotProt)
cmbnProt <- cmbnProt[, cmbn_samples$Run]

MDS plot

plotMDS(cmbnProt, 
        bg = cmbn.cols[as.numeric(cmbn_samples$CondBatch)], 
        pch = 21, cex = 2, 
        main = "MDS plot, DPC-Quant")
legend("bottom", legend = levels(cmbn_samples$CondBatch), 
       col = cmbn.cols, pch = 16, pt.cex = 2, horiz = FALSE)

Two batches are clearly separated by the horizontal dimension.

RLA plot

RLA(t(cmbnProt$E), 
    repCol = cmbn.cols[as.numeric(cmbn_samples$CondBatch)], 
    repLabel = unique(cmbn_samples$CondBatch), 
    title = "DPC-Quant summarisation of the two batches, simply concatenated"
    )

Two batches clearly present themselves.

plotDensities(cmbnProt, 
              col = cmbn.cols[as.numeric(cmbn_samples$CondBatch)], 
              legend = FALSE, 
              main = "Log-intensity densities of shared proteins")
legend("topright", legend = levels(cmbn_samples$CondBatch), 
       col = cmbn.cols, pch = 16, pt.cex = 2)

Samples clearly distinguishable by both batch and tissue types. Also notice that the two batches of data have different dynamic ranges.

4.5.1 Try `removeBatchEffect()`:

cmbnProtNormed <- removeBatchEffect(cmbnProt, 
                                    batch = cmbn_samples$Batch, 
                                    group = cmbn_samples$Condition)

plotMDS(cmbnProtNormed, 
        bg = cmbn.cols[as.numeric(cmbn_samples$CondBatch)], 
        pch = 21, cex = 2, 
        main = "MDS plot, protein level")
legend("bottom", legend = levels(cmbn_samples$CondBatch), 
       col = cmbn.cols, pch = 16, pt.cex = 2)

Batch effects were mitigated, but not entirely removed. Note that this examples is beyond simply “two batches” (they were not initially designed for evaluating normalisation methods after all). The two experiments were not only prepared and processed at two different times by different people, but were acquired on different instruments as well. The Orbitrap Astral run not only enabled much more protein IDs, but also covered a much bigger dynamic range. removeBatchEffect() adjusts for the baseline batch effect for each protein, but does not adjust for the difference in dynamic ranges between the two experiments. To that end, more sophisticated normalisation methods are needed. Internal control proteins (if there are any; e.g., housekeeping proteins) would be helpful in achieving effective normalisation.

However, usually, the removeBatchEffect() function gives good visual diagnostics. When that is the case, we add the batch covariate into the design matrix when performing DE analysis as below.

design_rmBatch <- model.matrix(~ 0 + Condition + Batch, data = cmbn_samples)
head(design_rmBatch)

  ConditionBL ConditionCB ConditionFL BatchSecond
1           0           0           1           1
2           0           0           1           1
3           0           0           1           1
4           0           0           1           1
5           0           0           1           1
6           0           1           0           1

You can then create your contrasts of interest (if needed) and perform the DE analysis.

4.5.2 Try cyclic loess

Alternatively, one can also try the cyclic loess normalisation.

normalizeCyclicLoess(cmbnProt$E) |> 
  plotMDS(bg = cmbn.cols[as.numeric(cmbn_samples$CondBatch)], 
          pch = 21, cex = 2, 
          main = "MDS plot, cyclic loess")
legend("bottom", legend = levels(cmbn_samples$CondBatch), 
       col = cmbn.cols, pch = 16, pt.cex = 2)

Although it did not “perfectly” correct for batch effects for this case study, the cyclic loess normalisation might still be useful and effective on other datasets. In fact, cyclic loess is very effective for normalisation of technical variations due to drifts in the instruments (run order effect).

4.5.3 Try ruv

First identify the factor of interest, which is tissue type

X <- matrix(as.numeric(cmbn_samples$Condition) - 1, ncol = 1)

Use lowly variable proteins as the negative controls

var_protein <- rowVars(cmbnProt$E)
ctl <- var_protein <= quantile(var_protein, probs = 0.1)

Run RUV4

ruvout <- RUV4(t(cmbnProt$E), X, ctl, k = 3)

Visualise RUV4-normalised data

cmbn_design <- model.matrix(~ 0 + Condition, data = cmbn_samples)
cmbnProt_ruv4 <- removeBatchEffect(cmbnProt, covariates = ruvout$W, design = cmbn_design)

plotMDS(cmbnProt_ruv4, 
        bg = cmbn.cols[as.numeric(cmbn_samples$CondBatch)], 
        pch = 21, cex = 2, 
        main = "MDS plot, RUV4 (k = 3)")
legend("top", legend = levels(cmbn_samples$CondBatch), 
       col = cmbn.cols, pch = 16, pt.cex = 2)

Seems good. I will proceed with RUV4-normalised data.

4.6 DE analysis

cmbn_design <- model.matrix(~ 0 + cmbn_samples$Condition + ruvout$W)
colnames(cmbn_design) <- str_remove(colnames(cmbn_design), "cmbn_samples\\$Condition|ruvout\\$")

head(cmbn_design)

  BL CB FL         W1          W2          W3
1  0  0  1 -0.2206172 -0.06748149 -0.44308012
2  0  0  1 -0.2233234 -0.06680479 -0.43701435
3  0  0  1 -0.2208496 -0.07859086 -0.43676195
4  0  0  1 -0.2192203 -0.07926116 -0.42090739
5  0  0  1 -0.2200015 -0.07943666 -0.43509509
6  0  1  0 -0.2102168  0.27764574  0.01756785

cmbn_contr <- makeContrasts(CBvsFL = CB - FL, 
                            CBvsBL = CB - BL, 
                            FLvsBL = FL - BL, 
                            levels = cmbn_design)

cmbn_fit <- dpcDE(cmbnProt, cmbn_design)

cmbn_fit <- contrasts.fit(cmbn_fit, contrasts = cmbn_contr)
cmbn_fit <- eBayes(cmbn_fit, robust = TRUE)

summary(cmbn_fit$df.prior)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2202  6.7775  6.7775  6.7488  6.7775  6.7775

summary(decideTests(cmbn_fit))

       CBvsFL CBvsBL FLvsBL
Down      657   1104   1725
NotSig   3439   2678   1552
Up        839   1153   1658

Check how many of the DE proteins comparing CB versus FL were identified in before

cmbn_res <- topTable(cmbn_fit, coef = "CBvsFL", number = Inf) |> signifTopTable()

deps <- list(Pure = rownames(pureProt)[decideTests(fit)[, "CBvsFL"] != 0],
             Cmbn = rownames(cmbnProt)[decideTests(cmbn_fit)[, "CBvsFL"] != 0])

ggvenn(deps)

Validation of the biological relevance of these results is yet need to be performed.

5 Key takeaways

Data exploration and visualisation is a non-dispensible first step.
Must have technical replicates if multiple batches are expected.
There is no one-size-fits-all pipeline for all types of MS proteomics data.
Several batch correction strategies are demonstrated here, but analysts will need to consider each dataset under its own context. The impact of normalisation methods needs to evaluated carefully while taking into account the corresponding biological contexts.

References

Li M, Cobbold SA, Smyth GK (2025). Quantification and differential analysis of mass spectrometry proteomics data with probabilistic recovery of information from missing values. bioRxiv. doi: https://doi.org/10.1101/2025.04.28.651125

Li M, Smyth GK (2023). Neither random nor censored: estimating intensity-dependent probabilities for missing values in label-free proteomics. Bioinformatics 39(5), btad200. doi:10.1093/bioinformatics/btad200

Li M (2024). Linear Models and Empirical Bayes Methods for Mass Spectrometry-based Proteomics Data. PhD Thesis, University of Melbourne. http://hdl.handle.net/11343/351600

Gandolfo, LC. and Speed, TP (2018). RLE plots: Visualizing unwanted variation in high dimensional data. PloS one, 13(2), p.e0191629. doi:10.1371/journal.pone.0191629

Gagnon-Bartsch, JA and Speed, TP (2012). Using control genes to correct for unwanted variation in microarray data. “Biostatistics”, 13(3), pp.539-552. doi:10.1093/biostatistics/kxr034

6 Session information

sessionInfo()

R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.3.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] ggthemeML_1.0.1    colorspace_2.1-1   ggrepel_0.9.6      ggvenn_0.1.10     
 [5] DT_0.33            RColorBrewer_1.1-3 matrixStats_1.5.0  ruvms_0.99.0      
 [9] ruv_0.9.7.1        limpa_1.1.1        limma_3.64.0       lubridate_1.9.4   
[13] forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.0.4       
[17] readr_2.1.5        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.2     
[21] tidyverse_2.0.0   

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1        farver_2.1.2            blob_1.2.4             
 [4] Biostrings_2.76.0       fastmap_1.2.0           digest_0.6.37          
 [7] timechange_0.3.0        lifecycle_1.0.4         statmod_1.5.0          
[10] KEGGREST_1.48.0         RSQLite_2.3.11          magrittr_2.0.3         
[13] compiler_4.5.0          rlang_1.1.6             sass_0.4.10            
[16] tools_4.5.0             yaml_2.3.10             data.table_1.17.0      
[19] knitr_1.50              labeling_0.4.3          htmlwidgets_1.6.4      
[22] bit_4.6.0               withr_3.0.2             BiocGenerics_0.54.0    
[25] stats4_4.5.0            GO.db_3.21.0            scales_1.4.0           
[28] cli_3.6.5               rmarkdown_2.29          crayon_1.5.3           
[31] generics_0.1.3          RcppParallel_5.1.10     rstudioapi_0.17.1      
[34] httr_1.4.7              tzdb_0.5.0              DBI_1.2.3              
[37] cachem_1.1.0            parallel_4.5.0          AnnotationDbi_1.70.0   
[40] XVector_0.48.0          BiasedUrn_2.0.12        vctrs_0.6.5            
[43] jsonlite_2.0.0          IRanges_2.42.0          hms_1.1.3              
[46] S4Vectors_0.46.0        bit64_4.6.0-1           crosstalk_1.2.1        
[49] jquerylib_0.1.4         glue_1.8.0              stringi_1.8.7          
[52] gtable_0.3.6            GenomeInfoDb_1.44.0     UCSC.utils_1.4.0       
[55] pillar_1.10.2           htmltools_0.5.8.1       GenomeInfoDbData_1.2.14
[58] R6_2.6.1                zigg_0.0.2              vroom_1.6.5            
[61] evaluate_1.0.3          Biobase_2.68.0          png_0.1-8              
[64] Rfast_2.1.5.1           memoise_2.0.1           bslib_0.9.0            
[67] Rcpp_1.0.14             gridExtra_2.3           org.Hs.eg.db_3.21.0    
[70] xfun_0.52               pkgconfig_2.0.3

1 Overview

Acknowlegements

2 Load libraries

3 Part 1: human tissue proteomes measured by the Orbitrap Astral

3.1 Read data

3.2 Basic exploratory analysis

3.3 Detection probability curve (DPC)

3.4 Protein quantification by DPC-Quant

3.5 DE analysis

3.5.1 Take CB vs FL as an example and visualise the results

4 Part 2: Two datasets analysed together

4.1 Load data from a previous experiement (the pilot)

4.2 Add sample information

4.3 Basic exploratory analysis for the pilot

4.3.1 Missing values

4.3.2 Detection probability curves

4.4 Protein quantification by DPC-Quant

4.5 Protein-level data exploration

4.5.1 Try removeBatchEffect():

4.5.2 Try cyclic loess

4.5.3 Try ruv

4.6 DE analysis

5 Key takeaways

References

6 Session information

4.5.1 Try `removeBatchEffect()`: