SmartPhos Explorer is an R Shiny app for streamlined analysis of proteomics and phosphoproteomics, featuring data preprocessing, exploratory data analysis, hypothesis testing, time-series analysis and functional interpretation.
The front page of the Shiny app looks as below:
The current tabs are:
In this vignette, we will walk you through the process of preparing your data and performing downstream analyses using different tabs in the Shiny app.
This tab contains the functions for preprocessing before diving into data analysis and interpretation. In this tab, users can start with uploading a MultiAssayExperiment object, generated by the SmartPhos R package or a zip file containing quantification tables from Spectronaut or MaxQuant. Output from DIA-NN is planned for the future update. After uploading the data, options for choosing the transformation and normalization methods, performing batch effect correction, and displaying the missing values are available. There is also an option for saving the uploaded data as R object (.rds file) for reusing in the future to save processing time.
Uploading is one of the most important steps in the smooth functioning of the app.
As shown in the figure, users have the option to decide between uploading a zip file and a MultiAssayExperiment object. The zip file contains the output files generated by Spectronaut or MaxQuant software. For users who are not experienced with R, this is the preferred option.
When using the zip file option, there are certain requirements the users need to follow for the error-free conversion of the data from these files into a multiAssayExperiment R object. One of the most important requirements for the zip file is to contain a properly formatted fileTable.txt file. The details of this file are explained in the next subsection.
This file should be a tab-separated tsv file that contains the information including name of the quantification table, sample ID and experimental details, such as whether a certain sample was enriched for phospho-peptides, whether the quantification was on protein groups or phosphorylation sites and so on. Optional annotations, such experimental conditions, time points and etc. can also be included in this table and some of them are required for specific downstream analysis tasks. An error message will be displayed if this file is missing or the format of file is incorrect. The mandatory columns in this table are as follows:
fileName: The names of quantification tables of protein groups or phosphorylation sites. They should be the standard output files from Spectronaut or MaxQuant. Make sure that it contains only the file names with the file extension (For example: “20240826_Protein_Report.xls”, “20240826_PTMReport.xls”) and not the relative or absolute path.
Important notes on the format of the quantification table from Spectronaut: The quantification table from Spectronaut should be in the “wide” format, with different samples in different columns. The table should be in plain text format, not Excel tables. The file extension does not matter. Spectronaut sometimes output .xls files but they are actually plain text files. The acceptable deliminators for the tables are: tab ( (preferred), comma (,), colon (:) and semicolon (;). For full proteome measurement, “PG.ProteinGroups”, “PG.Genes” and “[sample name]PG.Quantity” columns must be present. For phosphoproteome measurement, “[sample name]PTM.SiteProbability”, “[sample name]PTM.Quantity”, “PTM.CollapseKey”, “PG.UniProtIds”,“PG.Genes”, “PTM.SiteLocation”,“PTM.SiteAA” and “PTM.FlankingRegion” columns must be present.
id: Contains unique identifiers for sample (sample IDs). The sample IDs should correspond to the prefix of actual column names in the quantification table. For example: the ID value FullProteome_1stCrtl_0min_rep2 corresponds to one of the column names oecf4_220602_BH_ET_FullProteome_1stCrtl_0min_rep2.raw.PG.Quantity in the quantification table. SmartPhos will automatically recognize the columns in the quantification table that match a certain sample ID, as long as this sample ID is present in the columns that contain “PG.Quantity”.
type: Contains information on the type of search performed on the samples by Spectronaut or MaxQuant. There are two possibilities: proteome (quantification of protein abundance) and phosphoproteome (quantification of the abundance of phosphorylation sites). The two types are represented as two assays in the multiAssayExperiment object. Note that the app will still work if only one of the data type is present but some functions may not be available. For example, normalisation of phosphorylation sites by protein abundance will be not available of only phosphoproteome measurement is available.
sampleType: Contains information of whether enrichment for phosphorylated peptides is performed for a certain sample. The sampleType can be Phospho (or PP) for the phospho-enriched samples and FullProteome (or FP) for the non-enriched sample. Not that the app will still work if only one of the sample type is present but some functions may not be available. For example, the novel normalisation correction for phosphoproteomic data will not be available if full proteome without enrichment is absent.
The above mentioned columns are mandatory ones. Other than that, fileTable.txt can contain optional columns such as:
sampleName column is important for performing normalization correction or for normalizing Phosphoproteome samples by Full Proteome samples. This should be the sample id without FP_/PP_ or Phospho_/FullProteome_ suffix or prefix. For example, for the sample ID, FullProteome_1stCrtl_0min_rep1, the sampleName should be, “1stCrtl_0min_rep1”. Without this column, it would not be possible to perform normalization correction and normalizing phosphoproteome samples by full proteome samples.
subjectID with information like patientID, cell lines, etc. This column will be used if there are multiple samples from the same subject and users wish to include the subject ID as covariates in hypothesis testing (e.g. paired t-test) or time-series analysis.
timepoint with values like 0min, 20min, 6h, 24h, etc. This column is important for the time-series clustering tab and other tabs which make use of the results from the time-series clustering tab.
replicate with values rep1, rep2, replicate1 and so on.
treatment containing information about different treatments or experimental conditions.
Users can include any metadata for visualization or hypothesis testing in the fileTable.txt.
Here is an example of the fileTable.txt file:
NOTE: It’s better to avoid any special characters (‘(’, ‘)’, ‘#’, ‘%’, ‘@’, etc) in the column names in the quantification table output by Spectronaut/MaxQuant and same for the ID column of the fileTable.txt file. In addition, sample IDs should not start with numbers. Those can cause error in parsing the data.
Users have option to select the column annotations, which are provided by users in the fileTable.txt, when using the zip file method. If the upload of zip file is successful, another widget will appear which will allow the user to select from the different available annotation options. Multiple selections are possible simply by clicking on the list item.
For users with experience in R, they can parse data by using our SmartPhos package, which will create an multiAssayExperiment object. Users can save this object as an rds file and choose the “MultiAssayExperiment object” option for uploading the data.
Users can also first upload the zip file to SmartPhos Explorer and download the processed data as an rds file (see section 2.1.5) . This file can be uploaded using the “MultiAssayExperiment option** option in the future analysis, which will save some time as processing the quantification table can be slow if the file size is large.
If the processing of zip files or multiAssayExperiment object is successful, then the upload tab will show the total number of samples and features.
The user will have option to choose the assay and sample type as shown in the figure above. In case the upload is not successful, the app will display an error message. These error messages are:
Users have the options to download the parsed data as a .rds file for future use. There are also options of saving the parsed data in the local version of Shiny app as well as to load and delete previously saved data objects.
SmartPhos package performs some pre-filtering and preprocessing based on various threshold values during the generation of the multiAssayExperiment object. On top of that, this panel has different options for preprocessing of the selected assay. The options provided by the shiny app are as follows:
After selecting the options, press the Process button to perform the preprocessing.
The preprocessing tab has some more options:
The different outputs available to the users are:
This tab performs principal component analysis (PCA) on the imputed assay from the Preprocessing tab and then plot the principal components.
Simply click on Run PCA button to perform PCA. The different options available to the users after plotting are:
NOTE: If imputation is selected none in the preprocessing tab, then performing PCA is not possible. An error message will pop-up on screen.
This tabs allow the user to plot the heatmap of the imputed assay from the preprocessing tab. Users can select from the three choices available:
For all the above-mentioned options, users can add additional column annotations to the plotted heatmap. Adding more column annotations to the heatmap can be helpful to understand the heatmap better and see the patterns in the data more clearly. User can also download the heatmap as a PDF.
The tab also has error checks. If the user tries to plot the heatmap of differentially expressed genes or the genes from the selected time series cluster prior to performing the aforementioned analysis, an error message will be displayed.
This tab performs differential expression analysis on the transformed and normalized assay from the first tab. The goal of performing differential expression analysis is to quantify the expression levels of genes between different experimental conditions using statistical tests.
If the users want paired t-test on the patient IDs, cell lines, etc, then the users must have subjectID as one of the column in the fileTable.txt file with the relevant information. The subjectID column should also be selected in the additional column annotations before the generation of multiAssayExperiment object.
The two methods available for performing differential expression analysis are:
To use this tab, first the user has to select the metadata column for which they want to perform the analysis. Users need to select samples for the reference and target group, between which the differential expression analysis will be performed. This can be done by either selecting treatments or selecting sample IDs. In the former, time points should also be specified if the column timepoint is present in the fileTable.txt file. Multiple treatments and time points can be selected for each group. As the selection is updated for each group, the user interface informs users of the number of samples in each group as well as a warning if a sample is present in both group. The latter option allows users to directly choose which sample to include in each group, thereby allowing more flexibility in sample selection. The two figures below show examples of the user interface when using either option. Notice that a warning was given in the latter since one sample was present in both groups.
The output of differential expression analysis is a table which contains the differentially expressed genes arranged by the lowest p-values. The other information present in the table are: Uniprot ID, log2 fold change, t-statistic, adjusted p-value and some information if the data is phosphoproteomic data. A volcano plot is also plotted, highlighing the points with positive and negative log fold-change and above certain p value. The histogram of p-value is also plotted.
Users have options to filter the differentially expressed genes table based on p-value and log fold change value. There is also an option of using adjusted p-value instead.
If the user click on the row of the differential expression table, the corresponding point in the volcano plot is highlighted as star. Moreover, a box plot is plot to show the change in normalized intensities for the selected conditions. Also, if the user click on any coloured point in the volcano plot, the corresponding row in the differential expression table is highlighted. The user also has the option to download the differential expression table as a TSV file.
In case the differential expression analysis is not possible for the selected settings, an error message will appear.
This tab performs fuzzy c-means clustering to group proteins/phosphopeptides based on how their level changed over time. The algorithm considers the time-resolved trend, but not the expression levels. Thus, members of the same cluster would have a similar trend over time (e.g., all increasing or all decreasing), though their expression level can be different.
To use this tab, users must specify the timepoint column in the fileTable.txt file when preparing the data. The time points are either unit-less numbers (e.g., 1, 2, 3) or are in hour and/or minute, which must be typed as “h” and “min”, respectively (e.g., 1min, 2min, 3h). Please notice that mixing the two said options (e.g., 1, 2min, 3) or using other unit for time (e.g., 1hour, 2minute, 3day) will likely lead to wrong results.
The options for using this tab are as follow:
Users first need to choose the metadata column of interest for the analysis. Based on the selected metadata, user will be provided with the different conditions to choose from. Additionally, if the clustering is performed on either “logFC” or “two-condition expression”, a reference condition should also be chosen. The options to perform clustering on determine what it means for genes to be in the same cluster:
Users can choose to include or exclude certain time points by checking the boxes. The time points are updated when selecting the treatments.
The plot shows the time-course trend of genes in the clusters. If either expression or logFC was chosen, the color would indicate the genes’ membership probability:
If two-condition expression was chosen, the color is used to distinguish the treatments:
Accompanying the plot is a table show the details of the genes in each cluster. User can select the cluster to view with the drop-down menu on the upper left side of the table. The table can be downloaded as a tsv file. The selected cluster will be used as input for enrichment analysis and kinase activity inference.
Clicking on a gene will show a plot of its time-course expression level and a line showing the average expression level:
In case zero timepoint is not available for a selected treatment, user has the option to add zero timepoint to the treatment from one of the controls (or conditions with zero timepoint). This will be reflected in the plots showing the clustering results and also when a particular gene is selected for seeing the normalized expression level.
If no cluster was found, for instance due to cut-off values being too high, users will be informed with a pop-up window:
This tab performs enrichment analysis on genes or phosphosites that are differentially expressed or in a time series cluster. To use this tab, users would first need to perform either a differential expression analysis or time-series clustering. Users have three options to choose the source of gene list:
We offer the possibilities to perform enrichment analysis on either gene sets (gene-centric) or post-translational modification signature sets (site-centric). In the latter, each set contains PTM site names with direction of regulation (up- or down-regulated) instead of gene names. Only phosphorylation sites will be considered since this pipeline supports proteomic and phosphoproteomic data. The gene sets and PTM signature sets are derived from the Molecular Signatures Database (Subramanian, Tamayo et al.,2005; Liberzon et al., 2011, Liberzon et al., 2015) and the PTM Signature Database version 2.0.0 (Krug et al., 2019), respectively. Users are encouraged to consult the PTMsigDB website (https://proteomics.broadapps.org/ptmsigdb/) and the paper of Krug et al. for details on how PTM signature sets were curated and their annotation.
In gene-centric pathway enrichment, users can perform either Parametric Analysis of Gene Set Enrichment (PAGE) (Kim & Volsky, 2005) or Gene Set Enrichment Analysis (GSEA) (Subramanian, Tamayo et al.,2005) with Differential Expression analysis result. With Time series clustering result, we offer the Fisher’s exact test. For phospho-signature enrichment on Differential Expression analysis result, we offer PTM-Signature Enrichment Analysis (PTM-SEA), a method adapted from GSEA by Krug et al. (2019) to be applicable with the PTM Signature Database. On time series clustering result, we offer the Fisher’s exact test, in which each signature set is split into two, one containing upregulated and the other downregulated phosphosites.
Options for performing enrichment analysis are shown below:
If either differential expression analysis or time-series clustering has not been performed (depending on the selected Source of gene list), users would be informed of the error:
Otherwise, the result is shown in a table indicating the genesets’ names, number of genes in set, enrichment score (PAGE, GSEA, and PTM-SEA), and p-values. In PTM-SEA, each reported enrichment score is normalized by the mean of the null distribution’s enrichment score to control for differences in signature set size. Example results are shown in the figures below, along with explanation for each column.
(Pathway enrichment for Differential Expression analysis)
(Pathway enrichment for Time series clustering)
(Phospho-signature enrichment for Differential Expression analysis)
(Phospho-signature enrichment for Time series clustering)
Clicking on a set will open a table showing the details of genes/phosphopeptides in the set. We would like to notice that for pathway enrichment, a gene with more than one detected phosphosite would appear in several rows of this table, each corresponding to a phosphosite. Nevertheless, the analysis would only consider each gene once regardless of its number of phosphosite, thereby limiting the bias towards genes with several sites. For phospho-signature enrichment, the table has an additional column “PubMedID” (not shown in the figure below), which contains the PubMed identification or hyperlink of the source of the interaction. Multiple sources for a phosphosite are separated by semicolons.
Clicking on a gene/phosphosite in this table will highlight the sets containing it:
The tables above can be downloaded as tsv files. Clicking on a gene/phosphosite will also show its expression level either as a boxplot (differential expression analysis) or a scatterplot (time-series clustering).
If the user has selected All time-series cluster, then a dot plot is generated. The dot plot displays all the enriched pathways for each cluster. Clicking on the points inside the dot plot will give the list of genes or phosphosites associated with the particular pathway and the cluster. Clicking on the row of the list will plot a expression level trend for the selected genes or phosphosites.
This tab performs kinase activity inference based on phosphopeptides that are differentially expressed or in a cluster. Similar to enrichment analysis, users would first need to perform either a differential expression analysis or time-series clustering and select a cluster of interest. By combining prior knowledge about known kinase-phosphosite interactions and the data, the tab can infer the activity of the kinases responsible for the phosphopeptides being considered. The activity is estimated by an activity score computed with the package decoupleR (Badia-I-Mompel et al., 2022) following the authors’ tutorial on Kinase and Transcription Factor activity estimation. For time-series clustering, users can also estimate how likely the kinases are associated with phosphopeptides in the selected cluster.
A network of kinase-phosphosite interactions is constructed using the package OmnipathR (Türei et al., 2021). Users can choose to construct this network with prior knowledge from either Homo sapiens (taxonomy ID = 9606) or Mus musculus (taxonomy ID = 10090) by choosing the organism in Select reference species.
Users need to first select whether to perform the inference on result from either differential expression analysis or time-series clustering. This selection affects what type of analysis can be done, as is shown in the next two subsections.
When selecting differential expression, the list of phosphopeptides from the differential expression analysis are used for kinase activity inference. The options to perform the inference from differential expression analysis result are shown below:
The output is a table showing the kinases, the activity score, and (adjusted) p-values. Positive scores are highlighted pink while negative ones are highlighted blue. For kinases with the most positive or negative activity score (default 10), their scores are also plotted in a horizontal barplot. Kinases whose p-values under the threshold are colored red.
Interpreting the result: The kinase activity is computed from comparing two conditions, hence it is important to take into account the conditions when interpreting the result. A positive score means that the kinase is more active in the selected condition compared to the reference one, while a negative score means that the kinase is more active in the reference condition.
Selecting a kinase will open a table showing the phosphosites targeted by the kinase. The information shown in the table is from the differential expression analysis: log2FC, t-statistic, peptide sequence, and (adjusted) p-values. We would like to notice that these p-values are from the differential expression analysis and should not be confused with those derived from the kinase activity inference.
When selecting time-series clustering, the list of phosphopeptides in the selected cluster (in the Time series clustering tab) is used to either infer the kinase activity score or how likely the kinases are associated with the cluster. These analyses will be explained in more details in the following 2 subsections.
For this part, users would need to define the number of permutations for calculating the null distribution.
Then the option to perform clustering specified in the Time series clustering tab affects the analysis being done in this part. If the option expression was chosen, the fold-change in expression levels for consecutive time points would be used for the inference. Thus, the result would reflect how the kinase activity change over time. Below is an example of the result when performing the inference with 4 time points:
Clicking on a kinase will open a table showing the phosphosites targeted by the kinase:
If the logFC or two-condition expression option was chosen for time-series clustering, the fold-change in expression levels between the two treatments would be used for the inference. Thus, the result would reflect how the kinase activity differ between two treatments at each time point. The layout of the result is similar to that of the above-mentioned expression option:
If no kinase was found (wrong reference species or no overlap with prior knowledge), users will be informed of the error:
In addition to kinase activity, users can choose to determine whether the kinases are associated with the selected cluster, using either the Fisher’s exact test or the Fast Gene Set Enrichment Analysis algorithm (FGSEA). In FGSEA, the probability of each site to be in the cluster is used to rank the sites. Performing FGSEA is done by the decoupleR package (Badia-I-Mompel et al., 2022) and requires users to set the number of permutations for calculating the null distribution. Both methods use the kinase-phosphosite interaction network as their reference. Unlike the above part on estimating kinase activity, the choice of method for time series clustering in the Time series clustering tab does not affect the method in this part.
The result for the Fisher’s exact test is displayed in a table. Kinases whose (adjusted) p-values higher than the p-value threshold are excluded.
The result for the FGSEA is also displayed in a table showing the enrichment score.
Clicking a kinase will open a table showing the phosphosites targeted by the kinase in the cluster.
This tab displays the information about the user inputs. All the settings selected by the user are displayed in a table and can be downloaded as a TSV file. The table contains the following columns:
Badia-I-Mompel, P., Santiago, J. V., Braunger, J., Geiß, C., Dimitrov, D., Müller‐Dott, S., Tauš, P., Dugourd, A., Holland, C. H., Flores, R. O. R., & Sáez-Rodríguez, J. (2022). decoupleR: ensemble of computational methods to infer biological activities from omics data. Bioinformatics Advances, 2(1). https://doi.org/10.1093/bioadv/vbac016
Kim, SY., Volsky, D.J. PAGE: Parametric Analysis of Gene Set Enrichment. BMC Bioinformatics 6, 144 (2005). https://doi.org/10.1186/1471-2105-6-144
Krug, K., Mertins, P., Zhang, B., Hornbeck, P., Raju, R., Ahmad, R., Szucs, M. J., Mundt, F., Forestier, D., Jané‐Valbuena, J., Keshishian, H., Gillette, M. A., Tamayo, P., Mesirov, J. P., Jaffe, J. D., Carr, S. A., & Mani, D. R. (2019). A curated resource for phosphosite-specific signature analysis. Molecular & Cellular Proteomics, 18(3), 576–593. https://doi.org/10.1074/mcp.tir118.000943
Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, J. P., & Tamayo, P. (2015). The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell systems, 1(6), 417–425. https://doi.org/10.1016/j.cels.2015.12.004
Liberzon, A., Subramanian, A., Pinchback, R. M., Thorvaldsdóttir, H., Tamayo, P., & Mesirov, J. P. (2011). Molecular signatures database (MSigDB) 3.0. Bioinformatics, 27(12), 1739–1740. https://doi.org/10.1093/bioinformatics/btr260
Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A. G., Pomeroy, S. L., Golub, T. R., Lander, E. S., & Mesirov, J. P. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545–15550. https://doi.org/10.1073/pnas.0506580102
Türei, D., Valdeolivas, A., Gul, L., Palacio-Escat, N., Klein, M., Ivanova, O., Ölbei, M., Gábor, A., Theis, F. J., Módos, D., Korcsmáros, T., & Sáez-Rodríguez, J. (2021). Integrated intra‐ and intercellular signaling knowledge for multicellular omics analysis. Molecular Systems Biology, 17(3). https://doi.org/10.15252/msb.20209923