Supplementary results and figures
Mengbo Li
Bioinformatics Division, WEHIsupp.Rmd
Overview
On this page, the analysis workflow presented in the main text is applied on each of the four example datasets on the protein group level. Additionally, two more datasets were downloaded. The same analyses are also performed on both precursor- and protein-levels on these additional datasets.
Load packages
library(tidyverse)
library(protDP)
dfList <- seq(1, 5, 2)
lineColours <- RColorBrewer::brewer.pal(3, "Dark2")
Dataset A: Hybrid proteome data
LFQ intensities summarised by MaxLFQ (Cox et al. 2014) were extracted from the DIA-NN (Demichev et al. 2020) output on the protein group level. The log2-transformation is applied to LFQ intensities.
Empirical logistic splines for detected proportions
hyeProt <- gatherResults(dat)
for (i in 1:length(dfList)) {
if (i == 1)
plotEmpSplines(hyeProt$nuis, X = hyeProt$splineFits_params0[[i]]$X,
hyeProt$splineFits[[i]]$params, lineCol = lineColours[i],
point.cex = 0.15, ylim = c(0, 1.04))
if (i > 1)
plotEmpSplines(hyeProt$nuis, X = hyeProt$splineFits_params0[[i]]$X,
hyeProt$splineFits[[i]]$params, lineCol = lineColours[i],
newPlot = FALSE)
}
legend("bottomright", legend = paste("df = ", dfList), col = lineColours,
lwd = 2, lty = 1, cex = 0.8)
Reduced deviance compared to an intercept model
ggplot(slice(hyeProt$devs, 2:4), aes(x = df, y = percDevReduced)) +
geom_point() + geom_line() + geom_text(aes(label = signif(percDevReduced,
2)), vjust = -0.8) + scale_x_continuous(breaks = c(1, 3,
5)) + labs(x = "d.f.", y = "Reduced deviance(%)") + ylim(0.9,
1) + theme_classic()
Empirical logit-linear curve with capped probabilities
plotEmpSplines(hyeProt$nuis, X = hyeProt$splineFits_params0[[1]]$X,
hyeProt$cappedLinearFit$params, capped = TRUE, lineCol = lineColours[1],
ylim = c(0, 1.04), point.cex = 0.15)
The estimated parameters are
We see that the estimated \(\alpha\) value is 1.
Dataset B: Cell cycle proteomes
LFQ intensities summarised by MaxLFQ (Cox et al. 2014) were extracted from the DIA-NN (Demichev et al. 2020) report for protein groups. The log2-transformation is first applied to LFQ intensities.
Empirical logistic splines for detected proportions
scProt <- gatherResults(dat, b1.upper = Inf)
for (i in 1:length(dfList)) {
if (i == 1)
plotEmpSplines(scProt$nuis, X = scProt$splineFits_params0[[i]]$X,
scProt$splineFits[[i]]$params, lineCol = lineColours[i],
add.jitter = FALSE, point.cex = 0.15)
if (i > 1)
plotEmpSplines(scProt$nuis, X = scProt$splineFits_params0[[i]]$X,
scProt$splineFits[[i]]$params, lineCol = lineColours[i],
newPlot = FALSE)
}
legend("bottomright", legend = paste("df = ", dfList), col = lineColours,
lwd = 2, lty = 1, cex = 0.8)
Reduced deviance compared to an intercept model
ggplot(slice(scProt$devs, 2:4), aes(x = df, y = percDevReduced)) +
geom_point() + geom_line() + geom_text(aes(label = signif(percDevReduced,
2)), vjust = -0.8) + scale_x_continuous(breaks = c(1, 3,
5)) + labs(x = "d.f.", y = "Reduced deviance(%)") + ylim(0.9,
1) + theme_classic()
Empirical logit-linear curve with capped probabilities
plotEmpSplines(scProt$nuis, X = scProt$splineFits_params0[[1]]$X,
scProt$cappedLinearFit$params, capped = TRUE, lineCol = lineColours[1],
point.cex = 0.15)
Estimated parameters are
Detection probability curve assuming normal observed intensities
plotDPC(scProt$dpcFit, add.jitter = FALSE, point.cex = 0.1)
Parameters of the detection probability curve are as follows:
Dataset C: HepG2 technical replicate data
For the protein group level analysis, we use the
proteinGroups.txt
file from the MaxQuant output. The LFQ
intensities are first log2-transformed.
Empirical logistic splines for detected proportions
hepg2Prot <- gatherResults(dat)
for (i in 1:length(dfList)) {
if (i == 1)
plotEmpSplines(hepg2Prot$nuis, X = hepg2Prot$splineFits_params0[[i]]$X,
hepg2Prot$splineFits[[i]]$params, lineCol = lineColours[i],
jitter.amount = 1/ncol(dat)/2)
if (i > 1)
plotEmpSplines(hepg2Prot$nuis, X = hepg2Prot$splineFits_params0[[i]]$X,
hepg2Prot$splineFits[[i]]$params, lineCol = lineColours[i],
newPlot = FALSE)
}
legend("bottomright", legend = paste("df = ", dfList), col = lineColours,
lwd = 2, lty = 1, cex = 0.8)
Reduced deviance compared to an intercept model
ggplot(slice(hepg2Prot$devs, 2:4), aes(x = df, y = percDevReduced)) +
geom_point() + geom_line() + geom_text(aes(label = signif(percDevReduced,
2)), vjust = -0.8) + scale_x_continuous(breaks = c(1, 3,
5)) + labs(x = "d.f.", y = "Reduced deviance(%)") + ylim(0.9,
1) + theme_classic()
Empirical logit-linear curve with capped probabilities
plotEmpSplines(hepg2Prot$nuis, X = hepg2Prot$splineFits_params0[[1]]$X,
hepg2Prot$cappedLinearFit$params, capped = TRUE, lineCol = lineColours[1],
jitter.amount = 1/ncol(dat)/2)
The estimated parameters are
Dataset D: Human blood plasma proteome
For the protein group level analysis, we use the
proteinGroups.txt
file from the MaxQuant output downloaded
from the ProteomeXchange Consortium via the PRIDE partner repository
with the dataset identifier PXD014777.
Empirical logistic splines for detected proportions
ddaPlasmaProt <- gatherResults(dat)
for (i in 1:length(dfList)) {
if (i == 1)
plotEmpSplines(ddaPlasmaProt$nuis, X = ddaPlasmaProt$splineFits_params0[[i]]$X,
ddaPlasmaProt$splineFits[[i]]$params, lineCol = lineColours[i],
add.jitter = FALSE)
if (i > 1)
plotEmpSplines(ddaPlasmaProt$nuis, X = ddaPlasmaProt$splineFits_params0[[i]]$X,
ddaPlasmaProt$splineFits[[i]]$params, lineCol = lineColours[i],
newPlot = FALSE)
}
legend("bottomright", legend = paste("df = ", dfList), col = lineColours,
lwd = 2, lty = 1, cex = 0.8)
Reduced deviance compared to an intercept model
ggplot(slice(ddaPlasmaProt$devs, 2:4), aes(x = df, y = percDevReduced)) +
geom_point() + geom_line() + geom_text(aes(label = signif(percDevReduced,
2)), vjust = -0.8) + scale_x_continuous(breaks = c(1, 3,
5)) + labs(x = "d.f.", y = "Reduced deviance(%)") + ylim(0.9,
1) + theme_classic()
Empirical logit-linear curve with capped probabilities
plotEmpSplines(ddaPlasmaProt$nuis, X = ddaPlasmaProt$splineFits_params0[[1]]$X,
ddaPlasmaProt$cappedLinearFit$params, capped = TRUE, lineCol = lineColours[1],
add.jitter = FALSE)
Estimated parameters are
Detection probability curve assuming normal observed intensities
plotDPC(ddaPlasmaProt$dpcFit, add.jitter = FALSE)
Parameters for the fitted detection probability curve are
Supplementary dataset: Sydney heart bank data
Cryopreserved left ventricular myocardium samples from the human hearts were analysed. MS data were acquired in DIA mode and analysed by Spectronaunt v12 with a DDA spectral library generated from the pooled sample (Li et al. 2020). Details on sample preparation, LC-MS/MS workflow and data processing steps including the generation of the spectial library can be found in Li et al. (2020). Here we consider the healthy donor heart samples. Both precursor- and protein group-level data are log2-transformed before analysis.
Data summary
data("shbheart")
shbheart_prec <- shbheart$prec
dim(shbheart_prec)
[1] 42742 24
shbheart_prot <- shbheart$prot
dim(shbheart_prot)
[1] 3208 24
The overall proportion of missing data on the precursor-level is
While the overall proportion of missingness on the protein group-level is
Empirical logistic splines for detected proportions
The analysis workflow presented in the manuscript is applied on the dataset on both precursor- and protein group-level data:
res <- list(prec = gatherResults(shbheart_prec), prot = gatherResults(shbheart_prot,
b0.upper = Inf))
par(mfrow = c(2, 1))
for (res_i in 1:2) {
eachRes <- res[[res_i]]
for (i in 1:length(dfList)) {
if (i == 1)
plotEmpSplines(eachRes$nuis, X = eachRes$splineFits_params0[[i]]$X,
eachRes$splineFits[[i]]$params, lineCol = lineColours[i],
jitter.amount = 1/ncol(shbheart_prec)/2, point.cex = 0.15)
if (i > 1)
plotEmpSplines(eachRes$nuis, X = eachRes$splineFits_params0[[i]]$X,
eachRes$splineFits[[i]]$params, lineCol = lineColours[i],
newPlot = FALSE)
}
title(sub = c("Precursor-level", "Protein-level")[res_i])
legend("bottomright", legend = paste("df = ", dfList), col = lineColours,
lwd = 2, lty = 1, cex = 0.8)
}
Reduced deviance compared to an intercept model
devPlot1 <- ggplot(slice(res[[1]]$devs, 2:4), aes(x = df, y = percDevReduced)) +
geom_point() + geom_line() + geom_text(aes(label = signif(percDevReduced,
2)), vjust = -0.8) + scale_x_continuous(breaks = c(1, 3,
5)) + labs(x = "d.f.", y = "Reduced deviance(%)") + ylim(0.9,
1) + theme_classic() + ggtitle("Precursor-level")
devPlot2 <- ggplot(slice(res[[2]]$devs, 2:4), aes(x = df, y = percDevReduced)) +
geom_point() + geom_line() + geom_text(aes(label = signif(percDevReduced,
2)), vjust = -0.8) + scale_x_continuous(breaks = c(1, 3,
5)) + labs(x = "d.f.", y = "Reduced deviance(%)") + ylim(0.9,
1) + theme_classic() + ggtitle("Protein-level")
gridExtra::grid.arrange(devPlot1, devPlot2, ncol = 2)
Empirical logit-linear curve with capped probabilities
par(mfrow = c(2, 1))
for (res_i in 1:2) {
eachRes <- res[[res_i]]
plotEmpSplines(eachRes$nuis, X = eachRes$splineFits_params0[[1]]$X,
eachRes$cappedLinearFit$params, capped = TRUE, lineCol = lineColours[1],
jitter.amount = 1/ncol(shbheart_prec)/2, point.cex = 0.15)
title(sub = c("Precursor-level", "Protein-level")[res_i])
}
With estimated parameters on precursor-level data being:
and estimated parameters on protein-level data being:
Estimated \(\alpha\) values are equal to 1 on both precursor- and protein group-level data.
Detection probability curve assuming normal observed intensities
par(mfrow = c(2, 1))
for (res_i in 1:2) {
eachRes <- res[[res_i]]
plotDPC(eachRes$dpcFit, jitter.amount = 1/ncol(shbheart_prec)/2,
point.cex = 0.15)
title(sub = c("Precursor-level", "Protein-level")[res_i])
}
With estimated parameters on precursor-level data being:
and estimated parameters on protein-level data being:
Supplementary dataset: UPS1 spiked-in yeast extract
Three concentrations of UPS1 (25 fmol, 10 fmol and 5 fmol) were
spiked in yeast extract (Giai Gianetto et al.
2016). This is a DDA dataset and MS raw data were processed by
MaxQuant. Here we look at the dataset which compares 25 fmol to 10 fmol
spiked-ins. Processed data were downloaded from ProteomeXchange
Consortium via the PRIDE partner repository with the dataset identifier
PXD002370. For peptide-level data, we use the peptides.txt
file and for the protein-level analysis, we use the
proteinGroups.txt
file from the MaxQuant output.
Log2-transformation is first applied before analysis.
Data summary
data("ratio2.5")
usp1_prec <- log2(ratio2.5$prec)
dim(usp1_prec)
[1] 13186 6
usp1_prot <- log2(ratio2.5$prot)
dim(usp1_prot)
[1] 2342 6
The overall proportion of missing data for the precursor-level data is
The overall proportion of missing data for the protein-level data is
Empirical logistic splines for detected proportions
res <- list(prec = gatherResults(usp1_prec, b0.upper = Inf),
prot = gatherResults(usp1_prot, b0.upper = Inf))
par(mfrow = c(2, 1))
for (res_i in 1:2) {
eachRes <- res[[res_i]]
for (i in 1:length(dfList)) {
if (i == 1)
plotEmpSplines(eachRes$nuis, X = eachRes$splineFits_params0[[i]]$X,
eachRes$splineFits[[i]]$params, lineCol = lineColours[i],
jitter.amount = 1/ncol(usp1_prec)/2, point.cex = 0.15,
ylim = c(0, 1.08))
if (i > 1)
plotEmpSplines(eachRes$nuis, X = eachRes$splineFits_params0[[i]]$X,
eachRes$splineFits[[i]]$params, lineCol = lineColours[i],
newPlot = FALSE)
}
title(sub = c("Precursor-level", "Protein-level")[res_i])
legend("bottomright", legend = paste("df = ", dfList), col = lineColours,
lwd = 2, lty = 1, cex = 0.8)
}
Reduced deviance compared to an intercept model
devPlot1 <- ggplot(slice(res[[1]]$devs, 2:4), aes(x = df, y = percDevReduced)) +
geom_point() + geom_line() + geom_text(aes(label = signif(percDevReduced,
2)), vjust = -0.8) + scale_x_continuous(breaks = c(1, 3,
5)) + labs(x = "d.f.", y = "Reduced deviance(%)") + ylim(0.9,
1) + theme_classic() + ggtitle("Precursor-level")
devPlot2 <- ggplot(slice(res[[2]]$devs, 2:4), aes(x = df, y = percDevReduced)) +
geom_point() + geom_line() + geom_text(aes(label = signif(percDevReduced,
2)), vjust = -0.8) + scale_x_continuous(breaks = c(1, 3,
5)) + labs(x = "d.f.", y = "Reduced deviance(%)") + ylim(0.9,
1) + theme_classic() + ggtitle("Protein-level")
gridExtra::grid.arrange(devPlot1, devPlot2, ncol = 2)
The spline with 1 degree of freedom contributes the majority to the total amount of reduced deviance in both precursor- and protein group-level data.
Empirical logit-linear curve with capped probabilities
par(mfrow = c(2, 1))
for (res_i in 1:2) {
eachRes <- res[[res_i]]
plotEmpSplines(eachRes$nuis, X = eachRes$splineFits_params0[[1]]$X,
eachRes$cappedLinearFit$params, capped = TRUE, lineCol = lineColours[1],
jitter.amount = 1/ncol(usp1_prec)/2, point.cex = 0.15,
ylim = c(0, 1.08))
title(sub = c("Precursor-level", "Protein-level")[res_i])
}
With estimated parameters on precursor-level data being:
and estimated parameters on protein-level data being:
Estimated \(\alpha\) values are 1 for both precursor- and protein group-level data.
Detection probability curve assuming normal observed intensities
par(mfrow = c(2, 1))
for (res_i in 1:2) {
eachRes <- res[[res_i]]
plotDPC(eachRes$dpcFit, jitter.amount = 1/ncol(usp1_prec)/2,
point.cex = 0.15, ylim = c(0, 1.08))
title(sub = c("Precursor-level", "Protein-level")[res_i])
}
With estimated parameters for the fitted detection probability curve for precursor-level data being:
and estimated parameters for the fitted detection probability curve for the protein group-level data being:
References
Session information
sessionInfo()
R version 4.3.1 (2023-06-16)
Platform: x86_64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.5.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Australia/Melbourne
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] protDP_1.1.0 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3 purrr_1.0.2
[7] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] sass_0.4.7 utf8_1.2.3 generics_0.1.3 stringi_1.7.12 hms_1.1.3
[6] digest_0.6.33 magrittr_2.0.3 RColorBrewer_1.1-3 timechange_0.2.0 evaluate_0.22
[11] grid_4.3.1 fastmap_1.1.1 rprojroot_2.0.3 jsonlite_1.8.7 limma_3.57.9
[16] gridExtra_2.3 formatR_1.14 fansi_1.0.5 scales_1.2.1 textshaping_0.3.7
[21] jquerylib_0.1.4 cli_3.6.1 rlang_1.1.1 splines_4.3.1 munsell_0.5.0
[26] withr_2.5.1 cachem_1.0.8 yaml_2.3.7 tools_4.3.1 tzdb_0.4.0
[31] memoise_2.0.1 colorspace_2.1-0 vctrs_0.6.3 R6_2.5.1 lifecycle_1.0.3
[36] fs_1.6.3 ragg_1.2.6 pkgconfig_2.0.3 desc_1.4.2 pkgdown_2.0.7
[41] pillar_1.9.0 bslib_0.5.1 gtable_0.3.4 glue_1.6.2 statmod_1.5.0
[46] systemfonts_1.0.5 xfun_0.40 tidyselect_1.2.0 rstudioapi_0.15.0 knitr_1.44
[51] farver_2.1.1 htmltools_0.5.6.1 labeling_0.4.3 rmarkdown_2.25 compiler_4.3.1