Creating CompoundDb annotation resources
Source:vignettes/create-compounddb.Rmd
create-compounddb.Rmd
Authors: Johannes Rainer
Modified: 2023-04-12 09:49:59.722174
Compiled: Wed Apr 12 09:56:08 2023
Introduction
Chemical compound annotation and information can be retrieved from a
variety of sources including HMDB, LipidMaps or ChEBI. The
CompoundDb
package provides functionality to extract data
relevant for (chromatographic) peak annotations in
metabolomics/lipidomics experiments from these sources and to store it
into a common format (i.e. an CompDb
object/database). This
vignette describes how such CompDb
objects can be created
exemplified with package-internal test files that represent data subsets
from some annotation resources.
The R object to represent the compound annotation is the
CompDb
object. Each object (respectively its database) is
supposed to contain and provide annotations from a single source
(e.g. HMDB or LipidMaps) but it is also possible to create cross-source
databases too.
Creating CompDb
databases
CompDb
databases can be created from existing data
resources such as the Human Metabolome Database (HMDB) by importing all
of their data or can alternatively be build by sequentially
adding data and information to the database. This section explains first
how data can be loaded from existing resources to create a
CompDb
database and in the last subsection how an empty
CompDb
can be sequentially and manually filled with
annotation data @ref(sec:fill).
The CompoundDb
package provides the
compound_tbl_sdf
and the
compound_tbl_lipidblast
functions to extract relevant
compound annotation from files in SDF (structure-data file) format or in
the json files from LipidBlast (http://mona.fiehnlab.ucdavis.edu/downloads).
CompoundDb
allows to process SDF files from:
- Human Metabolome Database (HMDB), http://www.hmdb.ca
- Chemical Entities of Biological Interest (ChEBI): https://www.ebi.ac.uk/chebi
- LIPID MAPS Structure Database (LMSD): http://www.lipidmaps.org
- PubChem: https://pubchem.ncbi.nlm.nih.gov
- MoNa (Massbank of North America): http://mona.fiehnlab.ucdavis.edu (for MoNa import see the next section).
Note however that it is also possible to define such a table manually
and use that to create the database. As simple example for this is
provided in the section CompDb
from custom input
@ref(sec:custom) below or the help page of createCompDb
for
more details on that.
CompDb
from HMDB data
Below we use the compound_tbl_sdf
to extract compound
annotations from a SDF file representing a very small subset of the HMDB
database. To generate a database for the full HMDB we would have to
download the structures.sdf file containing all metabolites and
load that file instead.
library(CompoundDb)
## Locate the file
hmdb_file <- system.file("sdf/HMDB_sub.sdf.gz", package = "CompoundDb")
## Extract the data
cmps <- compound_tbl_sdf(hmdb_file)
The function returns by default a
(data.frame
-equivalent) tibble
(from the
tidyverse’s tibble
package).
cmps
## # A tibble: 9 × 8
## compound_id name inchi inchikey formula exactmass synonyms smiles
## <chr> <chr> <chr> <chr> <chr> <dbl> <named > <chr>
## 1 HMDB0000001 1-Methylhistidine InCh… BRMWTNU… C7H11N… 169. <chr> "CN1C…
## 2 HMDB0000002 1,3-Diaminopropa… InCh… XFNJVJP… C3H10N2 74.1 <chr> "NCCC…
## 3 HMDB0000005 2-Ketobutyric ac… InCh… TYEYBOS… C4H6O3 102. <chr> "CCC(…
## 4 HMDB0000008 2-Hydroxybutyric… InCh… AFENDNX… C4H8O3 104. <chr> "CCC(…
## 5 HMDB0000010 2-Methoxyestrone InCh… WHEUWNK… C19H24… 300. <chr> "[H][…
## 6 HMDB0000011 (R)-3-Hydroxybut… InCh… WHBMMWS… C4H8O3 104. <chr> "C[C@…
## 7 HMDB0000012 Deoxyuridine InCh… MXHRCPN… C9H12N… 228. <chr> "OC[C…
## 8 HMDB0004370 N-Methyltryptami… InCh… NCIKQJB… C11H14… 174. <chr> "CNCC…
## 9 HMDB0006719 5,6-trans-Vitami… InCh… QYSXJUF… C27H44O 384. <chr> "CC(C…
The tibble
contains columns
-
compound_id
: the resource-specific ID of the compound. Can be aninteger
or acharacter
. -
name
: the name of the compound, mostly a generic or common name. -
inchi
: the compound’s inchi. -
inchikey
: the INCHI key. -
formula
: the chemical formula of the compound. -
exactmass
: the compounds (monoisotopic) mass. -
synonyms
: alist
of aliases/synonyms for the compound. -
smiles
: the SMILES of the compound.
To create a simple compound database, we could pass this
tibble
along with additional required metadata information
to the createCompDb
function. In the present example we
want to add however also MS/MS spectrum data to the database. We thus
load below the MS/MS spectra for some of the compounds from the
respective xml files downloaded from HMDB. To this end we pass the path
to the folder in which the files are located to the
msms_spectra_hmdb
function. The function identifies the xml
files containing MS/MS spectra based on their their file name and loads
the respective spectrum data. The folder can therefore also contain
other files, but the xml files from HMDB should not be renamed or the
function will not recognice them. Note also that at present only MS/MS
spectrum xml files from HMDB are supported (one xml file per spectrum);
these could be downloaded from HMDB with the
hmdb_all_spectra.zip file.
## Locate the folder with the xml files
xml_path <- system.file("xml", package = "CompoundDb")
spctra <- msms_spectra_hmdb(xml_path)
Also here, spectra information can be manually provided by adhering
to the expected structure of the data.frame
(see
?createCompDb
for details).
At last we have to create the metadata for the resource. The metadata
information for a CompDb
resource is crucial as it defines
the origin of the annotations and its version. This information should
thus be carefully defined by the user. Below we use the
make_metadata
helper function to create a
data.frame
in the expected format. The organism should be
provided in the format e.g. "Hsapiens"
for human or
"Mmusculus"
for mouse, i.e. capital first letter followed
by lower case characters without whitespaces.
metad <- make_metadata(source = "HMDB", url = "http://www.hmdb.ca",
source_version = "4.0", source_date = "2017-09",
organism = "Hsapiens")
With all the required data ready we create the SQLite database for
the HMDB subset. With path
we specify the path to the
directory in which we want to save the database. This defaults to the
current working directory, but for this example we save the database
into a temporary folder.
db_file <- createCompDb(cmps, metadata = metad, msms_spectra = spctra,
path = tempdir())
The variable db_file
is now the file name of the SQLite
database. We can pass this file name to the CompDb
function
to get the CompDb
objects acting as the interface to the
database.
cmpdb <- CompDb(db_file)
cmpdb
## class: CompDb
## data source: HMDB
## version: 4.0
## organism: Hsapiens
## compound count: 9
## MS/MS spectra count: 4
To extract all compounds from the database we can use the
compounds
function. The parameter columns
allows to choose the database columns to return. Any columns for any of
the database tables are supported. To get an overview of available
database tables and their columns, the tables
function can
be used:
tables(cmpdb)
## $ms_compound
## [1] "compound_id" "name" "inchi" "inchikey" "formula"
## [6] "exactmass" "smiles"
##
## $msms_spectrum
## [1] "original_spectrum_id" "compound_id" "polarity"
## [4] "collision_energy" "predicted" "splash"
## [7] "instrument_type" "instrument" "precursor_mz"
## [10] "spectrum_id" "msms_mz_range_min" "msms_mz_range_max"
##
## $msms_spectrum_peak
## [1] "spectrum_id" "mz" "intensity" "peak_id"
##
## $synonym
## [1] "compound_id" "synonym"
Below we extract only selected columns from the compounds table.
## name formula exactmass
## 1 (R)-3-Hydroxybutyric acid C4H8O3 104.0473
## 2 1,3-Diaminopropane C3H10N2 74.0844
## 3 1-Methylhistidine C7H11N3O2 169.0851
## 4 2-Hydroxybutyric acid C4H8O3 104.0473
## 5 2-Ketobutyric acid C4H6O3 102.0317
## 6 2-Methoxyestrone C19H24O3 300.1725
## 7 5,6-trans-Vitamin D3 C27H44O 384.3392
## 8 Deoxyuridine C9H12N2O5 228.0746
## 9 N-Methyltryptamine C11H14N2 174.1157
Analogously we can use the Spectra
function to extract
spectrum data from the database. The function returns by default a
Spectra
object from the Spectra
package with all spectra metadata available as spectra
variables.
## MSn data (Spectra) with 4 spectra in a MsBackendCompDb backend:
## msLevel precursorMz polarity
## <integer> <numeric> <integer>
## 1 NA NA 1
## 2 NA NA 1
## 3 NA NA 1
## 4 NA NA 0
## ... 32 more variables/columns.
## Use 'spectraVariables' to list all of them.
## data source: HMDB
## version: 4.0
## organism: Hsapiens
The available spectra variables for the Spectra
object can be retrieved with spectraVariables
:
spectraVariables(sps)
## [1] "msLevel" "rtime"
## [3] "acquisitionNum" "scanIndex"
## [5] "dataStorage" "dataOrigin"
## [7] "centroided" "smoothed"
## [9] "polarity" "precScanNum"
## [11] "precursorMz" "precursorIntensity"
## [13] "precursorCharge" "collisionEnergy"
## [15] "isolationWindowLowerMz" "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz" "compound_id"
## [19] "name" "inchi"
## [21] "inchikey" "formula"
## [23] "exactmass" "smiles"
## [25] "original_spectrum_id" "predicted"
## [27] "splash" "instrument_type"
## [29] "instrument" "spectrum_id"
## [31] "msms_mz_range_min" "msms_mz_range_max"
## [33] "synonym"
Individual spectra variables can be accessed with the $
operator:
sps$collisionEnergy
## [1] 10 25 NA 20
And the actual m/z and intensity values with mz
and
intensity
:
mz(sps)
## NumericList of length 4
## [[1]] 109.2 124.2 124.5 170.16 170.52
## [[2]] 83.1 96.12 97.14 109.14 124.08 125.1 170.16
## [[3]] 44.1 57.9 61.4 71.2 73.8 78.3 78.8 ... 142.9 144.1 157.6 158 175.2 193.2
## [[4]] 111.0815386 249.2587746 273.2587746 ... 367.3006394 383.3319396
## m/z of the 2nd spectrum
mz(sps)[[2]]
## [1] 83.10 96.12 97.14 109.14 124.08 125.10 170.16
Note that it is also possible to retrieve specific spectra, e.g. for
a provided compound, or add compound annotations to the
Spectra
object. Below we use the filter expression
~ compound_id == "HMDB0000001"
to get only MS/MS spectra for
the specified compound. In addition we ask for the "name"
and "inchikey"
of the compound.
sps <- Spectra(cmpdb, filter = ~ compound_id == "HMDB0000001",
columns = c(tables(cmpdb)$msms_spectrum, "name",
"inchikey"))
sps
## MSn data (Spectra) with 2 spectra in a MsBackendCompDb backend:
## msLevel precursorMz polarity
## <integer> <numeric> <integer>
## 1 NA NA 1
## 2 NA NA 1
## ... 32 more variables/columns.
## Use 'spectraVariables' to list all of them.
## data source: HMDB
## version: 4.0
## organism: Hsapiens
The available spectra variables:
spectraVariables(sps)
## [1] "msLevel" "rtime"
## [3] "acquisitionNum" "scanIndex"
## [5] "dataStorage" "dataOrigin"
## [7] "centroided" "smoothed"
## [9] "polarity" "precScanNum"
## [11] "precursorMz" "precursorIntensity"
## [13] "precursorCharge" "collisionEnergy"
## [15] "isolationWindowLowerMz" "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz" "compound_id"
## [19] "name" "inchi"
## [21] "inchikey" "formula"
## [23] "exactmass" "smiles"
## [25] "original_spectrum_id" "predicted"
## [27] "splash" "instrument_type"
## [29] "instrument" "spectrum_id"
## [31] "msms_mz_range_min" "msms_mz_range_max"
## [33] "synonym"
The compound’s name and INCHI key have thus also been added as spectra variables:
sps$inchikey
## [1] "BRMWTNUJHUMWMS-LURJTMIESA-N" "BRMWTNUJHUMWMS-LURJTMIESA-N"
To share or archive the such created CompDb
database, we
can also create a dedicated R package containing the annotation. To
enable reproducible research, each CompDb
package should
contain the version of the originating data source in its file name
(which is by default extracted from the metadata of the resource). Below
we create a CompDb
package from the generated database
file. Required additional information we have to provide to the function
are the package creator/maintainer and its version.
createCompDbPackage(
db_file, version = "0.0.1", author = "J Rainer", path = tempdir(),
maintainer = "Johannes Rainer <johannes.rainer@eurac.edu>")
## Creating package in /tmp/Rtmpc3LOpB/CompDb.Hsapiens.HMDB.4.0
The function creates a folder (in our case in a temporary directory)
that can be build and installed with R CMD build
and
R CMD INSTALL
.
Special care should also be put on the license of the package that
can be passed with the license
parameter. The license of
the package and how and if the package can be distributed will depend
also on the license of the originating resource.
CompDb
from custom data
A CompDb
database can also be created from custom,
manually defined annotations. To illustrate this we create below first a
data.frame
with some arbitrary compound annotations.
According to the ?createCompDb
help page, the data frame
needs to have columns "compound_id"
, "name"
,
"inchi"
, "inchikey"
, "formula"
,
"exactmass"
, "synonyms"
. All columns except
"compound_id"
can also contain missing values. It is also
possible to define additional columns. Below we thus create a
data.frame
with some compound annotations as well as
additional columns. Note that all these annotations in this example are
for illustration purposes only and are by no means real. Also,
we don’t provide any information for columns "inchi"
,
"inchikey"
and "formula"
setting all values
for these to NA
.
cmps <- data.frame(
compound_id = c("CP_0001", "CP_0002", "CP_0003", "CP_0004"),
name = c("A", "B", "C", "D"),
inchi = NA_character_,
inchikey = NA_character_,
formula = NA_character_,
exactmass = c(123.4, 234.5, 345.6, 456.7),
compound_group = c("G01", "G02", "G01", "G03")
)
Next we add also synonyms for each compound. This columns supports multiple values for each row.
We also need to define the metadata for our database, which
we do with the make_metadata
function. With this
information we can already create a first rudimentary
CompDb
database that contains only compound annotations. We
thus create below our custom CompDb
database in a temporary
directory. We also manually specify the name of our database with the
dbFile
parameter - if not provided, the name of the
database will be constructed based on information from the
metadata
parameter. In a real-case scenario,
path
and dbFile
should be changed to something
more meaningful.
metad <- make_metadata(source = "manually defined", url = "",
source_version = "1.0.0", source_date = "2022-03-01",
organism = NA_character_)
db_file <- createCompDb(cmps, metadata = metad, path = tempdir(),
dbFile = "CompDb.test.sqlite")
We can now load this toy database using the CompDb
function providing the full path to the database file. Note that we load
the database in read-write mode by specifying
flags = RSQLite::SQLITE_RW
- by default CompDb
will load databases in read-only mode hence ensuring that the data
within the database can not be compromised. In our case we would however
like to add more information to this database later and hence we load it
in read-write mode.
## class: CompDb
## data source: manually defined
## version: 1.0.0
## organism: NA
## compound count: 4
We can now retrieve annotations from the database with the
compound
function.
compounds(cdb)
## name inchi inchikey formula exactmass compound_group
## 1 A <NA> <NA> <NA> 123.4 G01
## 2 B <NA> <NA> <NA> 234.5 G02
## 3 C <NA> <NA> <NA> 345.6 G01
## 4 D <NA> <NA> <NA> 456.7 G03
Or also search and filter the annotations.
## name inchi inchikey formula exactmass compound_group
## 1 A <NA> <NA> <NA> 123.4 G01
## 2 B <NA> <NA> <NA> 234.5 G02
Next we would like to add also MS2 spectra data to the database. This
could be either done directly in the createCompDb
call with
parameter msms_spectra
, or with the
insertSpectra
function that allows to add MS2 spectra data
to an existing CompDb
which can be provided as a
Spectra
object. We thus below manually create a
Spectra
object with some arbitrary MS2 spectra -
alternatively, Spectra
can be imported from a variety of
input sources, including MGF or MSP files using e.g. the MsBackendMgf
or MsBackendMsp
packages.
#' Define basic spectra variables
df <- DataFrame(msLevel = 2L, precursorMz = c(124.4, 124.4, 235.5))
#' Add m/z and intensity information for each spectrum
df$mz <- list(
c(3, 20, 59.1),
c(2, 10, 30, 59.1),
c(100, 206, 321.1))
df$intensity <- list(
c(10, 13, 45),
c(5, 8, 9, 43),
c(10, 20, 400))
#' Create the Spectra object
sps <- Spectra(df)
The Spectra
object needs also to have a variable
(column) called "compound_id"
which provides the
information with which existing compound in the database the spectrum is
associated.
compounds(cdb, "compound_id")
## compound_id
## 1 CP_0001
## 2 CP_0002
## 3 CP_0003
## 4 CP_0004
sps$compound_id <- c("CP_0001", "CP_0001", "CP_0002")
We can also add additional information to the spectra, such as the instrument.
sps$instrument <- "AB Sciex TripleTOF 5600+"
And we can now add these spectra to our existing toy
CompDb
. Parameter columns
allows to specify
which of the spectra variables should be stored into the
database.
cdb <- insertSpectra(cdb, spectra = sps,
columns = c("compound_id", "msLevel",
"precursorMz", "instrument"))
cdb
## class: CompDb
## data source: manually defined
## version: 1.0.0
## organism: NA
## compound count: 4
## MS/MS spectra count: 3
We have thus now a CompDb
database with compound
annotations and 3 MS2 spectra. We could for example also retrieve the
MS2 spectra for the compound with the name A from the database
with:
Spectra(cdb, filter = ~ name == "A")
## MSn data (Spectra) with 2 spectra in a MsBackendCompDb backend:
## msLevel precursorMz polarity
## <integer> <numeric> <integer>
## 1 2 124.4 NA
## 2 2 124.4 NA
## ... 31 more variables/columns.
## Use 'spectraVariables' to list all of them.
## data source: manually defined
## version: 1.0.0
## organism: NA
CompDb
from MoNA data
MoNa (Massbank of North America) provides a large SDF file with all
spectra which can be used to create a CompDb
object with
compound information and MS/MS spectra. Note however that MoNa is
organized by spectra and the annotation of the compounds is not
consistent and normalized. Spectra from the same compound can have their
own compound identified and data that e.g. can differ in their chemical
formula, precision of their exact mass or other fields.
Similar to the example above, compound annotations can be imported
with the compound_tbl_sdf
function while spectrum data can
be imported with msms_spectra_mona
. In the example below we
use however the import_mona_sdf
that wraps both functions
to reads both compound and spectrum data from a SDF file without having
to import the file twice. As an example we use a small subset from a
MoNa SDF file that contains only 7 spectra.
mona_sub <- system.file("sdf/MoNa_export-All_Spectra_sub.sdf.gz",
package = "CompoundDb")
mona_data <- import_mona_sdf(mona_sub)
## Warning: MoNa data can currently not be normalized and the compound table
## contains thus highly redundant data.
As a result we get a list
with a data.frame each for
compound and spectrum information. These can be passed along to the
createCompDb
function to create the database (see
below).
metad <- make_metadata(source = "MoNa",
url = "http://mona.fiehnlab.ucdavis.edu/",
source_version = "2018.11", source_date = "2018-11",
organism = "Unspecified")
mona_db_file <- createCompDb(mona_data$compound, metadata = metad,
msms_spectra = mona_data$msms_spectrum,
path = tempdir())
We can now load and use this database, e.g. by extracting all compounds as shown below.
## name
## 1 Sulfachlorpyridazine
## 2 Sulfaclozine
## 3 Sulfadimidine
## 4 Sulfamethazine
## 5 Sulfamethazine
## inchi
## 1 InChI=1S/C10H9ClN4O2S/c11-9-5-6-10(14-13-9)15-18(16,17)8-3-1-7(12)2-4-8/h1-6H,12H2,(H,14,15)
## 2 InChI=1S/C10H9ClN4O2S/c11-9-5-13-6-10(14-9)15-18(16,17)8-3-1-7(12)2-4-8/h1-6H,12H2,(H,14,15)
## 3 InChI=1S/C12H14N4O2S/c1-8-7-9(2)15-12(14-8)16-19(17,18)11-5-3-10(13)4-6-11/h3-7H,13H2,1-2H3,(H,14,15,16)
## 4 InChI=1S/C12H14N4O2S/c1-8-7-9(2)15-12(14-8)16-19(17,18)11-5-3-10(13)4-6-11/h3-7H,13H2,1-2H3,(H,14,15,16)
## 5 InChI=1S/C12H14N4O2S/c1-8-7-9(2)15-12(14-8)16-19(17,18)11-5-3-10(13)4-6-11/h3-7H,13H2,1-2H3,(H,14,15,16)
## inchikey formula exactmass smiles
## 1 XOXHILFPRYWFOD-UHFFFAOYSA-N C10H9ClN4O2S 284.0135 <NA>
## 2 QKLPUVXBJHRFQZ-UHFFFAOYSA-N C10H9ClN4O2S 284.0135 <NA>
## 3 ASWVTGNCAZCNNR-UHFFFAOYSA-N C12H14N4O2S 278.0837 <NA>
## 4 ASWVTGNCAZCNNR-UHFFFAOYSA-N C12H14N4O2S 278.0837 <NA>
## 5 ASWVTGNCAZCNNR-UHFFFAOYSA-N C12H14N4O2S 278.0837 <NA>
As stated in the introduction of this section the
compound
information contains redundant information and the
table has essentially one row per spectrum. Feedback on how to reduce
the redundancy in the ms_compound table is highly appreciated.
CompDb
by sequentially filling with data
As an alternative to creating a full database from an existing
resource it is also possible to create an empty CompDb
database and sequentially filling it with data. This could for
example be used to create a laboratory specific annotation library with
compound, ion and fragment spectra of pure standards measured on a
certain LC-MS setup. Below we create an empty CompDb
database providing the file name of the database. In the example we
store the database to a temporary file but in a real use case a
meaningful file name and file path should be used instead.
dbfile <- tempfile()
mydb <- emptyCompDb(dbfile)
mydb
## class: CompDb
## data source: NA
## version: NA
## organism: NA
## compound count: 0
We next define some first compound annotation we want to add to the
database. For compound annotations, fields "compound_id"
(an arbitrary ID of the compound), "name"
(the compound
name), "inchi"
, "inchikey"
,
"formula"
(the chemical formula) and
"exactmass"
(the monoisotopic mass) are expected, but,
except of "compound_id"
, they can also contain missing
values or be completely omitted. Below we define a
data.frame
with annotations for some compounds and add this
annotation to the database using the insertCompound
function.
cmp <- data.frame(compound_id = c("1", "2"),
name = c("Caffeine", "Glucose"),
formula = c("C8H10N4O2", "C6H12O6"),
exactmass = c(194.080375584, 180.063388116))
mydb <- insertCompound(mydb, cmp)
mydb
## class: CompDb
## data source: NA
## version: NA
## organism: NA
## compound count: 2
We next add fragment spectra for the compounds. These could for
example represent MS2 spectra measured for the pure standard of the
compound and could be extracted for example from xcms
result objects or other sources. Below we load some fragment spectra for
caffeine from an MGF file distributed with this package. We use the
MsBackendMgf
package to import that data into a Spectra
object.
library(MsBackendMgf)
caf_ms2 <- Spectra(system.file("mgf", "caffeine.mgf", package = "CompoundDb"),
source = MsBackendMgf())
caf_ms2
## MSn data (Spectra) with 2 spectra in a MsBackendMgf backend:
## msLevel rtime scanIndex
## <integer> <numeric> <integer>
## 1 2 NA NA
## 2 2 NA NA
## ... 25 more variables/columns.
We can evaluate what spectra variables are available in the imported data.
spectraVariables(caf_ms2)
## [1] "msLevel" "rtime"
## [3] "acquisitionNum" "scanIndex"
## [5] "dataStorage" "dataOrigin"
## [7] "centroided" "smoothed"
## [9] "polarity" "precScanNum"
## [11] "precursorMz" "precursorIntensity"
## [13] "precursorCharge" "collisionEnergy"
## [15] "isolationWindowLowerMz" "isolationWindowTargetMz"
## [17] "isolationWindowUpperMz" "TITLE"
## [19] "original_spectrum_id" "compound_id"
## [21] "collision_energy" "predicted"
## [23] "splash" "spectrum_id"
## [25] "msms_mz_range_min" "msms_mz_range_max"
caf_ms2$rtime
## [1] NA NA
There are many variables available, but most of them, like for
example the retention time, or not defined as this information was not
provided in the MGF file. In order to associate these fragment spectra
to the caffeine compound we just added to the database, we need to
assign them the ID of the compound (in our case "1"
).
caf_ms2$compound_id <- "1"
We can then add the spectra to the database using the
insertSpectra
function. With parameter columns
we specify which of the spectra variables we actually want to store in
the database.
mydb <- insertSpectra(mydb, caf_ms2,
columns = c("compound_id", "msLevel", "splash",
"precursorMz", "collisionEnergy"))
mydb
## class: CompDb
## data source: NA
## version: NA
## organism: NA
## compound count: 2
## MS/MS spectra count: 2
We thus have now 2 compounds in the database and 2 fragment spectra:
compounds(mydb)
## name inchi inchikey formula exactmass
## 1 Caffeine <NA> <NA> C8H10N4O2 194.0804
## 2 Glucose <NA> <NA> C6H12O6 180.0634
sps <- Spectra(mydb)
sps$name
## [1] "Caffeine" "Caffeine"
With insertCommpound
and insertSpectra
further compounds and fragment spectra could be added to the database.
Note that both functions support also to add additional columns
(fields or variables) to the database. As an example
we define below a compound with an arbitrary additional column and add
this to the database using parameter addColumns = TRUE
.
cmps <- data.frame(compound_id = "3", name = "X003",
formula = "C5H2P3O", extra_field = "artificial compound")
mydb <- insertCompound(mydb, cmps, addColumns = TRUE)
compounds(mydb)
## name inchi inchikey formula exactmass extra_field
## 1 Caffeine <NA> <NA> C8H10N4O2 194.0804 <NA>
## 2 Glucose <NA> <NA> C6H12O6 180.0634 <NA>
## 3 X003 <NA> <NA> C5H2P3O NA artificial compound
The additional column is now available in the database. Existing
entries in a CompDb
can also be deleted using the
deleteCompound
or deleteSpectra
functions.
Both require as additional input the IDs of the compound(s) (or spectra)
to delete. Below we extract the IDs and names of the compounds from our
database.
## compound_id name
## 1 1 Caffeine
## 2 2 Glucose
## 3 3 X003
We can now delete the compound "X003"
with
deleteCompound
and the ID of this compound.
mydb <- deleteCompound(mydb, ids = "3")
compounds(mydb)
## name inchi inchikey formula exactmass extra_field
## 1 Caffeine <NA> <NA> C8H10N4O2 194.0804 <NA>
## 2 Glucose <NA> <NA> C6H12O6 180.0634 <NA>
Note that deleting a compound with associated spectra (or ions) will
result in an error, thus it would not be possible to delete caffeine
from the database, because it contains also MS2 spectra for that
compound. Using parameter recursive = TRUE
in the
deleteCompound
call would however allow to delete the
compound and all associated spectra (and/or ions) along
with it. Below we delete thus caffeine and the associated MS2 spectra
which leaves us a CompDb
with a single compound and no more
MS2 spectra.
mydb <- deleteCompound(mydb, ids = "1", recursive = TRUE)
compounds(mydb)
## name inchi inchikey formula exactmass extra_field
## 1 Glucose <NA> <NA> C6H12O6 180.0634 <NA>
Spectra(mydb)
## MSn data (Spectra) with 0 spectra in a MsBackendCompDb backend:
Note that these functions can also be used to add or remove
annotations to/from any CompDb
database, as long as the
database is writeable (i.e. the database is loaded by
specifying flags = RSQLite::SQLITE_RW
as additional
parameter to the CompDb
call to load the database).
Session information
## R version 4.3.0 beta (2023-04-06 r84184)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] MsBackendMgf_1.7.5 Spectra_1.9.15 ProtGenerics_1.29.1
## [4] BiocParallel_1.33.12 CompoundDb_1.3.3 S4Vectors_0.37.5
## [7] BiocGenerics_0.45.3 AnnotationFilter_1.23.0 BiocStyle_2.27.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 dplyr_1.1.1 blob_1.2.4
## [4] bitops_1.0-7 fastmap_1.1.1 lazyeval_0.2.2
## [7] RCurl_1.98-1.12 digest_0.6.31 lifecycle_1.0.3
## [10] cluster_2.1.4 rsvg_2.4.0 RSQLite_2.3.1
## [13] magrittr_2.0.3 compiler_4.3.0 rlang_1.1.0
## [16] sass_0.4.5 tools_4.3.0 utf8_1.2.3
## [19] yaml_2.3.7 knitr_1.42 htmlwidgets_1.6.2
## [22] bit_4.0.5 xml2_1.3.3 purrr_1.0.1
## [25] desc_1.4.2 grid_4.3.0 fansi_1.0.4
## [28] colorspace_2.1-0 ggplot2_3.4.2 scales_1.2.1
## [31] MASS_7.3-58.4 cli_3.6.1 rmarkdown_2.21
## [34] ragg_1.2.5 generics_0.1.3 rjson_0.2.21
## [37] DBI_1.1.3 cachem_1.0.7 stringr_1.5.0
## [40] zlibbioc_1.45.0 parallel_4.3.0 BiocManager_1.30.20
## [43] XVector_0.39.0 base64enc_0.1-3 vctrs_0.6.1
## [46] jsonlite_1.8.4 bookdown_0.33 IRanges_2.33.1
## [49] bit64_4.0.5 clue_0.3-64 systemfonts_1.0.4
## [52] jquerylib_0.1.4 glue_1.6.2 pkgdown_2.0.7.9000
## [55] codetools_0.2-19 DT_0.27 stringi_1.7.12
## [58] gtable_0.3.3 GenomeInfoDb_1.35.16 GenomicRanges_1.51.4
## [61] munsell_0.5.0 tibble_3.2.1 pillar_1.9.0
## [64] htmltools_0.5.5 GenomeInfoDbData_1.2.10 R6_2.5.1
## [67] dbplyr_2.3.2 textshaping_0.3.6 rprojroot_2.0.3
## [70] evaluate_0.20 Biobase_2.59.0 png_0.1-8
## [73] memoise_2.0.1 bslib_0.4.2 MetaboCoreUtils_1.7.0
## [76] Rcpp_1.0.10 gridExtra_2.3 ChemmineR_3.51.1
## [79] xfun_0.38 fs_1.6.1 MsCoreUtils_1.11.5
## [82] pkgconfig_2.0.3