# Multivariate Analysis¶

The nPYc-Toolbox provides the capacity to generate a PCA model of the data (via the pyChemometrics module), and subsequently, to use this to assess data quality, identify potential sample and feature outliers, and determine any potential analytical associations with the main sources of variance in the data.

## PCA Model¶

A PCA model can be generated using exploratoryAnalysisPCA():

PCAmodel = nPYc.multivariate.exploratoryAnalysisPCA(dataset, scaling=1)


There are a number of parameters which can be optimised depending on the dataset.

One key parameter is ‘scaling’, which divides each column in the data matrix by its respective standard deviation raised to a power of the scaling parameter. This parameter can range in value between 0 and 1, and recommended values are 0 for mean-centering only, 0.5 for Pareto scaling and 1 for unit variance (UV) scaling. The outcome of PCA model will vary based on the scaling method selected, and different scaling functions can be appropriate depending on the data itself and the question being asked of the data (van der Berg et al [1]).

The default scaling is unit variance (scaling=1), which scales every variable to have a variance of one, and thus all variables (despite their different magnitudes) become equally important in the model. For NMR, when smaller variables are more likely to be background noise, it may be that mean-centering the data only (scaling=0) can be appropriate.

Each model is cross-validated using 7-fold cross-validation and the recommended number of principal components is automatically estimated based on two criteria, when either one of these is met no more components will be added and the PCA model will be returned. There criteria are:

• minQ2: Q2 is the variance predicted by each component (from cross-validation), when adding a component does not improve Q2 by at least this value (default minQ2=0.05) then no more components will be added.
• maxComponents: this defines the maximum number of components (default maxComponents=10) returned by the model (regardless of Q2 increases).

If different parameters are required these can be specified as additional input arguments, for example, to generate a PCA model with a maximum of five components:

PCAmodel = nPYc.multivariate.exploratoryAnalysisPCA(dataset, scaling=1, maxComponents=5)


The main function parameters (which may be of interest to advanced users) are as follows:

nPYc.multivariate.exploratoryAnalysisPCA.exploratoryAnalysisPCA(npycDataset, scaling=1, maxComponents=10, minQ2=0.05, withExclusions=False, **kwargs)

Performs and exploratory analysis using PCA on the data contained in an Dataset.

Parameters: npycDataset (Dataset) – Dataset to model scaling – Choice of scaling. maxComponents (int) – Maximum number of components to fit. minQ2 – Minimum % of improvement in Q2Y over the previous component to add . withExclusions (Boolean) – If True, PCA will be fitted on the npyc_dataset after applying feature and sample Mask, if False the PCA is performed on whole dataset. Fitted PCA model ChemometricsPCA

## Multivariate Analysis Report¶

The analytical multivariate report provides visualisations summarising the largest sources of variance in the dataset (from a PCA model) with particular emphasis on any potential analytical sources, and can be generated using multivariateReport():

nPYc.reports.multivariateReport(dataset, PCAmodel)


These consist of:

• Scree plot of variance (Figure 1)
• Scores plots coloured by sample type (Figure 2)
• Strong sample outliers (Figure 3)
• DmodX sample outliers (Figure 4)
• Distribution of analytical parameters (Figure 6)
• Heatmap of potential associations between analytical parameters and the main sources of variance (Figures 7 and 8)
• Scores plots coloured by analytical parameters with potential association (Figures 9-11)

The following sections describe these in more detail:

Scree plot of variance

The scree plot (Figure 1) shows the percentage of variance (cumulative) both explained (R2) and predicted (Q2) by each principal component in the model.

This information can help to guide interpretation of the subsequent plots, for example, if separation is seen between QC samples in a given component, this would be much more serious if this component explained 50% of the variance than if the component only explained 3% of the variance.

If the variance is not predictable (negative or low Q2) this indicates that the model is not robust to different subsets of samples being are removed. This could be for a number of reasons, but it implies that there are likely no key analytical sources of variance, as these would inherently contribute throughout the run.

Scores plots coloured by sample type

The PCA scores represent the new location of the samples in each principal component. A typical way to look at these is to plot the scores values in two dimensions, corresponding to pairs of components. Each point in a scores plot therefore represents a sample, with samples close together being more similar to each other, and those further apart being more dissimilar. By colouring by sample type (Figure 2), we can check, firstly, the consistency of the QC samples (SR and LTR) i.e. that they are tightly clustered; and secondly, for the presence of any sample outliers (any samples which are very different to the others).

Outlying study samples in this plot are of interest (does this relate to biology, sample collection etc.), but not of particular concern. However, any QC samples with unusual behaviour, or unusual clustering within QC samples, should be checked. Further sections of the multivariate report can help to elucidate if this behaviour may result from an analytical source (see below).

Strong sample outliers

Strong sample outliers are those which are very different from the others, these are seen as far outside the Hotelling’s T ellipse (Figure 2) and with high values when you sum the total distance from origin across all components (Figure 3), these samples have high leverage and skew the model with their contribution to variance.

Outliers in study samples are common, owing for example, to the presence of strong drug or dietary related signals. If LTR samples are strong outliers this is not a concern, as they may have a significantly different composition to the study samples, similarly SR samples may be overly weighted with very high concentration metabolites in one or more study sample, so may also be located away from the origin.

It would be expected here for all LTR and all SR samples to have roughly the same total distance from origin (although the two groups may be different from each other), as above, any deviation from this should be investigated.

DmodX sample outliers

DmodX (also called distance to model) is a measure of how well each sample fits the model itself. Unlike the strong sample outliers, outliers in DmodX (Figure 4) do not skew the model, but are simply not well represented by the model.

The interpretation of the DmodX plot is exactly as for the strong sample outliers though, with any deviation from a consistent value for all SR or LTR samples worthy of further investigation.

The PCA loadings represent the weight of each original feature in forming the new PCA variables (scores). Thus there is a set of loadings (with values corresponding to each original feature) for each component in the model. The loadings can be plotted as for the scores, in pairs, however for improved interpretation here the loadings are plotted for each component separately. For LC-MS data, this takes the form of an ion map with points coloured by the model loadings (Figure 5A); for NMR data a standard (the median) spectrum, with each point coloured by it’s relative contribution to that particular component (Figure 5B); and for targeted datasets points corresponding to each named feature (Figure 5C).

These plots are useful in showing the regions of the spectrum with the most variance, and, by comparison with the scores plots (Figures 2 and 9-11) useful in determining any relationship to analytical sources (for example, variance in QC samples or association with specific analytical parameters).

Distributions of analytical parameters

Distributions of the sample metadata (in this case relating to recorded analytical parameters) are plotted in Figure 6 (in this example for the DEVSET LC-MS dataset).

Figure 6. Histograms of metadata distributions (plotted for fields with non-uniform values only).

This allows the user to check if the behaviour of each parameter is as expected.

Heatmap of potential associations between analytical parameters and the main sources of variance

Potential associations between the principal components and any analytical parameters is tested by calculating either the correlation (for continuous data) or a Kruskal-Wallis test (for categorical data) between each parameter and the PCA scores for each component. The returned values are displayed in Figures 7 and 8, for continuous and categorical data respectively.

Figure 7. Heatmap of correlation to PCA scores for suitable metadata fields.

These allow a quick assessment of whether any of the largest sources of variance in a dataset may have analytical sources. By default, any significant associations (correlation > 0.3 or Kruskal-Wallis p-value < 0.05) are plotted (see below), these default values can be amended by setting the “r_threshold” or “kw_threshold” when calling ‘multivariateReport’.

Scores plots coloured by analytical parameters with potential association

Additionally for any fields where the correlation or p-value (respectively) exceed the threshold (default thresholds r_threshold=0.3, kw_threshold=0.05) the PCA scores plots are generated with sample points coloured according to their values for the flagged analytical parameter (in this case Run Order):

Figure 9. PCA scores plots coloured by metadata (significance by correlation).

This allows quick identification and assessment of any analytical or pre-processing issues, and the subsequent action required depends on the analytical parameter flagged, for example, in this case batch and run-order correction would need to be applied, as there is a strong association with run order in PC3.

Further options

As for the main reports (see Quality Assessment Reports), there are a number of different options which can be specified by the user if required, some examples of which include:

# To save as html documents with static images to a specified location ('saveDir'):
saveDir = '/path to save outputs'
nPYc.reports.multivariateReport(dataset, PCAmodel, destinationPath=saveDir)

# To report on only those samples and features not set to be masked from the dataset, NOTE, PCAmodel must be generated with the same exlusion set:
PCAmodel = nPYc.multivariate.exploratoryAnalysisPCA(dataset, scaling=1, withExclusions=True)
nPYc.reports.multivariateReport(dataset, PCAmodel, withExclusions=True)

# To plot scores plots coloured by any analytical parameter with a correlation exceeding 0.4 (default value=0.3, see Figure 9 above):
nPYc.reports.multivariateReport(dataset, PCAmodel, r_threshold=0.4)


Full function parameters (which may be of interest to advanced users) are as follows:

class nPYc.reports.multivariateReport

PCA based analysis of a dataset. A PCA model is generated for the data object, then potential associations between the scores and any sample metadata determined by correlation (continuous data) or a Kruskal-Wallis test (categorical data).

The multivariateReport has three options for the reportType argument:

• ‘analytical’ Reports on analytical qualities of the data only (as defined in the relevant SOP).
• ‘biological’ Reports on biological qualities of the data only (all columns in sampleMetadata except those defined as analytical or skipped in the SOP).
• ‘all’ Reports on all qualities of the data (all columns in sampleMetadata except those defined as skipped in the SOP).
Parameters: dataTrue (Dataset) – Dataset to report on pcaModel (ChemometricsPCA) – PCA model object (scikit-learn based) reportType (str) – Type of sample metadata to report on, one of analytical, biological or all withExclusions (bool) – If True, only report on features and samples not masked by the sample and feature masks biologicalMeasurements (dict) – Dictionary of type of data contained in each biological sampleMetadata field. Keys are sampleMetadata column names, and values one of ‘categorical’, ‘continuous’, ‘date’ dModX_criticalVal (None or float) – Samples with a value in DModX space exceeding this critical value are listed as potential outliers dModX_criticalVal_type (None or str) – Type of critical value in DModX, one of Fcrit or Percentile scores_criticalVal (None or float) – Samples with a value in scores space exceeding this critical value are listed as potential outliers kw_threshold (None or float) – Fields with a Kruskal-Willis p-value greater than this are not deemed to have a significant association with the PCA score r_threshold (None or float) – Fields with a (absolute) correlation coefficient value less than this are not deemed to have a significant association with the PCA score hotellings_alpha (float) – Alpha value for plotting the Hotelling’s ellipse in scores plots (default = 0.05) excludeFields (None or list) – If not None, list of sample metadata fields to be additionally excluded from analysis destinationPath (None or str) – If None plot interactively, otherwise save report to the path specified

## Interactive Plots¶

Scores and loadings from a PCAmodel can also be explored interactively with the plotScoresInteractive() and plotLoadingsInteractive() functions.

For example, a scores plot for components 1 and 2, with sample points coloured by Run Order can be generated using:

data = nPYc.plotting.plotScoresInteractive(dataset, PCAmodel, 'Run Order', components=[1, 2])
iplot(data)


data = nPYc.plotting.plotLoadingsInteractive(dataset, PCAmodel, component=2)