Working with `DataCombined` class

Introduction

In Modeling and Simulation (M&S) workflows, we often need to work with multiple observed and/or simulated datasets. Some of these datasets can be linked together in a group (e.g., the simulated and the corresponding observed data) for analysis and visualization.

The DataCombined class in ospsuite provides a container to store data, to group them, and to transform them (e.g. scale or offset). The class accepts data coming from SimulationResults or as a DataSet object (see Observed data).

Let’s first generate some simulation results and load observed data.

library(ospsuite)

# Simulation results
simFilePath <- system.file("extdata", "Aciclovir.pkml", package = "ospsuite")
sim <- loadSimulation(simFilePath)
simResults <- runSimulations(sim)[[1]]
outputPath <- "Organism|PeripheralVenousBlood|Aciclovir|Plasma (Peripheral Venous Blood)"

# observed data
obsData <- lapply(
  c("ObsDataAciclovir_1.pkml", "ObsDataAciclovir_2.pkml", "ObsDataAciclovir_3.pkml"),
  function(x) loadDataSetFromPKML(system.file("extdata", x, package = "ospsuite"))
)
# Name the elements of the list according to the names of the loaded `DataSet`.
names(obsData) <- lapply(obsData, function(x) x$name)

Typically, DataCombined is used to create figures using the plotXXX() functions. The different plot types are described in Visualizations with DataCombined. To visualize the different functionalities of DataCombined in this document, we will use the plotIndividualTimeProfile() function.

Creating `DataCombined` object

First, we create a new instance of DataCombined class.

myDataCombined <- DataCombined$new()

Then we add simulated results to this object using $addSimulationResults():

myDataCombined$addSimulationResults(
  simulationResults = simResults,
  quantitiesOrPaths = outputPath,
  names = "Aciclovir Plasma",
  groups = "Aciclovir PVB"
)

Next we add observed data to this object using $addDataSets():

myDataCombined$addDataSets(
  obsData$`Vergin 1995.Iv`,
  groups = "Aciclovir PVB"
)

Every data, be it simulated results or from DataSet, must have a unique name within DataCombined. If not specified by user, the path of simulated results or the $name property of the DataSet are used as the name. Alternatively, we can define the name when adding the data, as in the above example adding simulated results.

There are a few things to keep in mind here:

It doesn’t matter in which order observed or simulated datasets are added.
You can add multiple DataSet objects in $addDataSets() method call.
You can add only a single instance of SimulationResults in $addSimulationResults() method call.
If you add a dataset with the same name as existing (e.g., adding simulation results from multiple simulations and not specifying the name, so the path of the results is used as the name), the new data will replace the existing.

Grouping

Since the DataCombined object can store many datasets, some of them may naturally form a grouping and you would wish to track this in the object. Both $addDataSets() and $addSimulationResults() allow group specification via groups argument.

When being plotted, data sets without a grouping will appear with distinct color, line style (for simulated results), or symbol style (for observed data) with a separate entry in the legend, with the name of the data set being the legend entry.

Though it is possible to group data with different dimensions, it makes sense to only group data sets that have the same dimension (or dimensions that can be transformed into each other, like Amount and Mass) in order to be able to compare the data. Grouping data sets with different dimensions most likely will result in an error when trying to create plots or calculating residuals with such a DataCombined.

Let’s create a DataCombined and add simulation results and observed data without specifying their grouping and create a time profile:

myDataCombined <- DataCombined$new()

myDataCombined$addSimulationResults(
  simulationResults = simResults,
  quantitiesOrPaths = outputPath,
  names = "Aciclovir Plasma"
)

myDataCombined$addDataSets(
  obsData$`Vergin 1995.Iv`
)

plotIndividualTimeProfile(dataCombined = myDataCombined)

If you do not specify groups when you add datasets, and wish to update groupings later, you can use the $setGroups() method. All data within one group will get one entry in the legend with the name of the group, and will be plotted with the same color.

myDataCombined$setGroups(
  names = c("Aciclovir Plasma", obsData$`Vergin 1995.Iv`$name),
  groups = c("Aciclovir PVB", "Aciclovir PVB")
)

plotIndividualTimeProfile(dataCombined = myDataCombined)

At any point, you can check the current names and groupings with the following active field:

myDataCombined$groupMap
#> # A tibble: 2 × 3
#>   name             group         dataType 
#>   <chr>            <chr>         <chr>    
#> 1 Aciclovir Plasma Aciclovir PVB simulated
#> 2 Vergin 1995.Iv   Aciclovir PVB observed

Transformations

Sometimes the raw data included in DataCombined needs to be transformed using specified offset and scale factor values.

This is supported via $setDataTransformations() method, where you can specify the names of datasets, and offsets and scale factors. If these arguments are scalar (i.e., of length 1), then the same value will be applied to all specified names. Scale factors are unitless values by which raw data values will be multiplied, offsets are added to the raw values without any conversion and must therefore be in the same units as the raw data.

The internal data frame in DataCombined will be transformed with the specified parameters, with the new columns computed as:

For x values: newXValue = (rawXValue + xOffset) * xScaleFactor
For y values: newYValue = (rawYValue + yOffset) * yScaleFactor
For arithmetic error: newErrorValue = rawErrorValue * abs(yScaleFactor)
For geometric error: no transformation.

At any point, you can check the applied offsets and scale factors with the following active field:

myDataCombined$dataTransformations
#> # A tibble: 2 × 5
#>   name             xOffsets yOffsets xScaleFactors yScaleFactors
#>   <chr>               <dbl>    <dbl>         <dbl>         <dbl>
#> 1 Aciclovir Plasma        0        0             1             1
#> 2 Vergin 1995.Iv          0        0             1             1

Now lets take a closer look on possible situations where you would apply any transformations. Consider the observed data from the examples above. It reports concentrations of aciclovir after single intravenous administration.

myDataCombinedTranformations <- DataCombined$new()

myDataCombinedTranformations$addDataSets(
  obsData$`Vergin 1995.Iv`
)

plotIndividualTimeProfile(dataCombined = myDataCombinedTranformations)

However, we might want to use this data set with a simulation where aciclovir is administered 24 hours after simulation begin. To be able to compare simulation results with the data set, we can offset the observed data time by 24 hours. Keep in mind that the offset must be given in the same unit as the data set values are.

# Check the units of the observed time values
obsData$`Vergin 1995.Iv`$xUnit
#> [1] "h"

myDataCombinedTranformations$setDataTransformations(
  forNames = obsData$`Vergin 1995.Iv`$name,
  xOffsets = 24
)

plotIndividualTimeProfile(dataCombined = myDataCombinedTranformations)

In the next step, we want to normalize observed concentrations to a dose. We can easily achieve this with the scale factor. In the next example, we normalize observed values to 250 mg dose by setting the yScaleFactor to 1/250:

myDataCombinedTranformations$setDataTransformations(
  forNames = obsData$`Vergin 1995.Iv`$name,
  yScaleFactors = 1 / 250
)

plotIndividualTimeProfile(dataCombined = myDataCombinedTranformations)

Finally, offsetting the observation values might be useful when working with measurements of endogenous substrates, such as the hormone glucagon, and want to correct for the individual specific baseline levels of the hormone.

Extracting a combined data frame

The data frame (also sometimes called as a table) data structure is central to R-based workflows, and, thus, we may wish to extract a data frame for datasets present in the object.

Internally, DataCombined extracts data frames for observed and simulated datasets and combines them.

myDataCombined$toDataFrame()
#> # A tibble: 504 × 27
#>    IndividualId xValues name           yValues xDimension xUnit yDimension yUnit
#>           <int>   <dbl> <chr>            <dbl> <chr>      <chr> <chr>      <chr>
#>  1            0       0 Aciclovir Pla…    0    Time       min   Concentra… µmol…
#>  2            0       1 Aciclovir Pla…    3.25 Time       min   Concentra… µmol…
#>  3            0       2 Aciclovir Pla…    9.10 Time       min   Concentra… µmol…
#>  4            0       3 Aciclovir Pla…   15.0  Time       min   Concentra… µmol…
#>  5            0       4 Aciclovir Pla…   20.7  Time       min   Concentra… µmol…
#>  6            0       5 Aciclovir Pla…   26.2  Time       min   Concentra… µmol…
#>  7            0       6 Aciclovir Pla…   31.4  Time       min   Concentra… µmol…
#>  8            0       7 Aciclovir Pla…   36.4  Time       min   Concentra… µmol…
#>  9            0       8 Aciclovir Pla…   41.1  Time       min   Concentra… µmol…
#> 10            0       9 Aciclovir Pla…   45.5  Time       min   Concentra… µmol…
#> # ℹ 494 more rows
#> # ℹ 19 more variables: molWeight <dbl>, dataType <chr>, yErrorValues <dbl>,
#> #   yErrorType <chr>, yErrorUnit <chr>, lloq <dbl>, Source <chr>, File <chr>,
#> #   Sheet <chr>, Molecule <chr>, Species <chr>, Organ <chr>, Compartment <chr>,
#> #   `Study Id` <chr>, Gender <chr>, Dose <chr>, Route <chr>,
#> #   `Patient Id` <chr>, group <chr>

This function returns a tibble data frame. If you wish to modify how it is printed, you can have a look at the available options here. In fact, let’s change a few options and print the data frame again.

options(
  pillar.width = Inf, # show all columns
  pillar.min_chars = Inf # to turn off truncation of column titles
)

myDataCombined$toDataFrame()

#> # A tibble: 504 × 27
#>    IndividualId xValues name             yValues xDimension xUnit
#>           <int>   <dbl> <chr>              <dbl> <chr>      <chr>
#>  1            0       0 Aciclovir Plasma    0    Time       min  
#>  2            0       1 Aciclovir Plasma    3.25 Time       min  
#>  3            0       2 Aciclovir Plasma    9.10 Time       min  
#>  4            0       3 Aciclovir Plasma   15.0  Time       min  
#>  5            0       4 Aciclovir Plasma   20.7  Time       min  
#>  6            0       5 Aciclovir Plasma   26.2  Time       min  
#>  7            0       6 Aciclovir Plasma   31.4  Time       min  
#>  8            0       7 Aciclovir Plasma   36.4  Time       min  
#>  9            0       8 Aciclovir Plasma   41.1  Time       min  
#> 10            0       9 Aciclovir Plasma   45.5  Time       min  
#>    yDimension            yUnit  molWeight dataType  yErrorValues yErrorType
#>    <chr>                 <chr>      <dbl> <chr>            <dbl> <chr>     
#>  1 Concentration (molar) µmol/l      225. simulated           NA NA        
#>  2 Concentration (molar) µmol/l      225. simulated           NA NA        
#>  3 Concentration (molar) µmol/l      225. simulated           NA NA        
#>  4 Concentration (molar) µmol/l      225. simulated           NA NA        
#>  5 Concentration (molar) µmol/l      225. simulated           NA NA        
#>  6 Concentration (molar) µmol/l      225. simulated           NA NA        
#>  7 Concentration (molar) µmol/l      225. simulated           NA NA        
#>  8 Concentration (molar) µmol/l      225. simulated           NA NA        
#>  9 Concentration (molar) µmol/l      225. simulated           NA NA        
#> 10 Concentration (molar) µmol/l      225. simulated           NA NA        
#>    yErrorUnit  lloq Source File  Sheet Molecule Species Organ Compartment
#>    <chr>      <dbl> <chr>  <chr> <chr> <chr>    <chr>   <chr> <chr>      
#>  1 NA            NA NA     NA    NA    NA       NA      NA    NA         
#>  2 NA            NA NA     NA    NA    NA       NA      NA    NA         
#>  3 NA            NA NA     NA    NA    NA       NA      NA    NA         
#>  4 NA            NA NA     NA    NA    NA       NA      NA    NA         
#>  5 NA            NA NA     NA    NA    NA       NA      NA    NA         
#>  6 NA            NA NA     NA    NA    NA       NA      NA    NA         
#>  7 NA            NA NA     NA    NA    NA       NA      NA    NA         
#>  8 NA            NA NA     NA    NA    NA       NA      NA    NA         
#>  9 NA            NA NA     NA    NA    NA       NA      NA    NA         
#> 10 NA            NA NA     NA    NA    NA       NA      NA    NA         
#>    `Study Id` Gender Dose  Route `Patient Id` group        
#>    <chr>      <chr>  <chr> <chr> <chr>        <chr>        
#>  1 NA         NA     NA    NA    NA           Aciclovir PVB
#>  2 NA         NA     NA    NA    NA           Aciclovir PVB
#>  3 NA         NA     NA    NA    NA           Aciclovir PVB
#>  4 NA         NA     NA    NA    NA           Aciclovir PVB
#>  5 NA         NA     NA    NA    NA           Aciclovir PVB
#>  6 NA         NA     NA    NA    NA           Aciclovir PVB
#>  7 NA         NA     NA    NA    NA           Aciclovir PVB
#>  8 NA         NA     NA    NA    NA           Aciclovir PVB
#>  9 NA         NA     NA    NA    NA           Aciclovir PVB
#> 10 NA         NA     NA    NA    NA           Aciclovir PVB
#> # ℹ 494 more rows

ospsuite also provides a few helper functions to modify the data frame further.

When multiple (observed and/or simulated) datasets are present in a data frame, they are likely to have different units. convertUnits() function helps to convert them to a common unit. The function will not modify the DataCombined object but return a new data frame with unified units.¹.

convertUnits(
  myDataCombined,
  xUnit = ospUnits$Time$s,
  yUnit = ospUnits$`Concentration [mass]`$`µg/l`
)
#> # A tibble: 504 × 27
#>    IndividualId xValues name           yValues xDimension xUnit yDimension yUnit
#>           <int>   <dbl> <chr>            <dbl> <chr>      <chr> <chr>      <chr>
#>  1            0       0 Aciclovir Pla…      0  Time       s     Concentra… µg/l 
#>  2            0      60 Aciclovir Pla…    733. Time       s     Concentra… µg/l 
#>  3            0     120 Aciclovir Pla…   2050. Time       s     Concentra… µg/l 
#>  4            0     180 Aciclovir Pla…   3382. Time       s     Concentra… µg/l 
#>  5            0     240 Aciclovir Pla…   4668. Time       s     Concentra… µg/l 
#>  6            0     300 Aciclovir Pla…   5901. Time       s     Concentra… µg/l 
#>  7            0     360 Aciclovir Pla…   7077. Time       s     Concentra… µg/l 
#>  8            0     420 Aciclovir Pla…   8194. Time       s     Concentra… µg/l 
#>  9            0     480 Aciclovir Pla…   9249. Time       s     Concentra… µg/l 
#> 10            0     540 Aciclovir Pla…  10242. Time       s     Concentra… µg/l 
#> # ℹ 494 more rows
#> # ℹ 19 more variables: molWeight <dbl>, dataType <chr>, yErrorValues <dbl>,
#> #   yErrorType <chr>, yErrorUnit <chr>, lloq <dbl>, Source <chr>, File <chr>,
#> #   Sheet <chr>, Molecule <chr>, Species <chr>, Organ <chr>, Compartment <chr>,
#> #   `Study Id` <chr>, Gender <chr>, Dose <chr>, Route <chr>,
#> #   `Patient Id` <chr>, group <chr>

Further functionalities

Grouping of simulated results with observed data inside DataCombined also allows to calculate the error between simulated results and observations using the calculateResiduals() function. The function calculates the distance between each observed data point within the group and the simulated results. Simulation results interpolated if no simulated value is available for a certain observed time point. Observed data beyond simulated time is ignored, i.e., no extrapolation is performed.

There are two ways to calculate the distance (residuals) - linear and logarithmic - specified by the scaling argument. For linear scaling (scaling = tlf::Scaling$lin):

Residuals are calculated as: Simulation value - Observed value. This means that the residuals are defined by absolute differences.

For logarithmic scaling (scaling = tlf::Scaling$log):

Residuals are calculated as: log(Simulation value) - log(Observed value) = log (Simulation Value / Observed Value). This means that the ratio of values is considered which is independent of the magnitude of the value. But for very small observed values, in particular close to 0 values, this can lead to problems, because log(10E-N) = -N can becomes large.

# Linear residuals
calculateResiduals(myDataCombined, scaling = tlf::Scaling$lin)$residualValues
#>  [1]  9.49304275 -0.14538952  0.17965548  1.84599976  3.52552442  3.64160094
#>  [7]  3.15654735  2.38392729  1.48703530  0.50532597  0.08009793 -0.14892803
#> [13] -0.29395086

# Logarithmic residuals
calculateResiduals(myDataCombined, scaling = tlf::Scaling$log)$residualValues
#>  [1]  0.239719620 -0.007286928  0.012618336  0.155242835  0.384912455
#>  [6]  0.482210178  0.578799737  0.577914238  0.647582143  0.457679775
#> [11]  0.137321647 -0.418715711 -2.305882655

To quickly calculate the total error of the DataCombined, one can sum up the absolute values of the residuals:

# Linear residuals
totalError <- sum(abs(calculateResiduals(myDataCombined, scaling = tlf::Scaling$lin)$residualValues))

print(totalError)
#> [1] 26.88703

Visualizations with `DataCombined`

See Visualizations with DataCombined describing functions that can visualize data stored in DataCombined object.