Datasets

The nPYc-Toolbox is built around creating an object for each imported dataset. This object contains the metabolic profiling data itself, alongside all associated sample and feature metadata; various methods for generating, reporting and plotting important quality control parameters; and methods for pre-processing such as filtering poor quality features or correcting trends in batch and run-order.

The first step in creating an nPYc-Toolbox object is to import the acquired data, creating a Dataset specific for the data type:

For example, to import LC-MS data into a MSDataset object:

msData = nPYc.MSDataset('path to data')

Depending on the data type, the Dataset can be set up directly from the raw data, from common interchange formats, or from the outputs of popular data-processing tools. The supported data types are described in more detail in the data specific sections below.

When importing the data, default parameters, for example, specific parameters such as the number of points to interpolate NMR data into, or more generally the format to save figures as, are loaded from the Configuration Files. These parameters are subsequently saved in the Attributes dictionary and used throughout subsequent implementation of the pipeline.

For example, for NMR data, the nPYc-Toolbox contains two default configuration files, ‘GenericNMRUrine’ and ‘GenericNMRBlood’ for urine and blood datasets respectively, therefore, to import NMR spectra from urine samples the sop parameter would be:

nmrData = nPYc.NMRDataset('path to data', sop='GenericNMRurine')

A full list of the parameters for each dataset type is given in the Built-in Configuration SOPs. If different values are required, these can be modified directly in the appropriate SOP file, or alternatively they can be set by the user by modifying the required ‘Attribute’, either at import, or by subsequent direct modification in the pipeline. For example, to set the line width threshold (LWFailThreshold) to subsequently flag NMR spectra with line widths not meeting this value:

# EITHER, set the required value (here 0.8) at import
nmrData = nPYc.NMRDataset(rawDataPath, pulseProgram='noesygppr1d', LWFailThreshold=0.8)

# OR, set the *Attribute* directly (after importing nmrData)
nmrData.Attributes['LWFailThreshold'] = 0.8

Dataset objects have several key attributes, including:

  • sampleMetadata: A \(n\) × \(p\) pandas dataframe of sample identifiers and sample associated metadata (each row here corresponds to a row in the intensityData file)
  • featureMetadata: A \(m\) × \(q\) pandas dataframe of feature identifiers and feature associated metadata (each row here corresponds to a column in the intensityData file)
  • intensityData: A \(n\) × \(m\) numpy matrix of measurements, where each row and column respectively correspond to a the measured intensity of a specific sample feature
  • sampleMask: A \(n\) numpy boolean vector where True and False flag samples for inclusion or exclusion respectively
  • featureMask: A \(m\) numpy boolean vector where True and False flag features for inclusion or exclusion respectively
Structure of the key attributes of a dataset

Structure of the key attributes of a Dataset object. Of note, rows in the featureMetadata Dataframe correspond to columns in the intensityData matrix.

Once created, you can query the number of features or samples it contains by running:

dataset.noFeatures
dataset.noSamples

Or directly inspect the sample or feature metadata, and the raw measurements:

dataset.sampleMetadata
dataset.featureMetadata
dataset.intensityData

For more details on using the sample and feature masks see Sample and Feature Masks.

It is possible to add additional study design parameters or sample metadata into the Dataset using the addSampleInfo() method (see Sample Metadata for details).

For full method specific details see Installation and Tutorials.

LC-MS Datasets

The toolbox is designed to be agnostic to the source of peak-picked profiling datasets, currently supporting the outputs of XCMS (Tautenhahn et al [1]), Bruker Metaboscape, and Progenesis QI, but simply expandable to data from other peak-pickers. Current best-practices in quality control of profiling LC-MS (Want et al [2], Dunn et al [3], Lewis et al [4]) data are applied, including utilising repeated injections of Study Reference samples in order to calculate analytical precision for the measurement of each feature (Relative Standard Deviation), and a serial dilution of the reference sample to asses the linearity of response (Correlation to Dilution), for full details see Feature Summary Report: LC-MS Datasets.

Study Reference samples are also used (in conjunction with Long-Term Reference samples if available) to assess and correct trends in batch and run-order (Batch & Run-Order Correction). Additionally, both RSD and correlation to dilution are used to filter features to retain only those measured with a high precision and accuracy (Sample and Feature Masks).

NMR Datasets

The nPYc-Toolbox supports input of processed Bruker GmbH format 1D experiments. Upon import, each spectrum’s chemical shift axis is calibrated to a reference peak (Pearce et al [5]), and all spectra interpolated onto a common scale, with full parameters as per the NMRDataset Objects configuration SOPs. The toolbox supports automated calculation of the quality control metrics described previously (Dona et al [6]), including assessments of line-width, water suppression quality, and baseline stability, for full details see Feature Summary Report: NMR Datasets.

Targeted Datasets

The TargetedDataset represents quantitative datasets where compounds are already identified, the exactitude of the quantification can be established, units are known and calibration curve or internal standards are employed (Lee et al [7]). It implements a set of reports and data consistency checks to assist analysts in assessing the presence of batch effects, applying limits of quantification (LOQ), standardizing the linearity range over multiple batches, and determining and visualising the accuracy and precision of each measurement, for more details see Feature Summary Report: NMR Targeted Datasets.

The nPYc-Toolbox supports input of both MS-derived targeted datasets (tutorial and further documentation in progress), and two Bruker proprietary human biofluid quantification platforms (IVDr algorithms) that generate targeted outputs from the NMR profiling data, BI-LISA for quantification of Lipoproteins (blood samples only) and BIQUANT-PS and BIQUANT-UR for small molecule metabolites (for blood and urine respectively).

Dataset Specific Syntax and Parameters

The main function parameters (which may be of interest to advanced users) are as follows:

Note, the Dataset object serves as a common parent to MSDataset, TargetedDataset, and NMRDataset, and should not typically be instantiated independently.

class nPYc.objects.Dataset(sop='Generic', sopPath=None, **kwargs)

Base class for nPYc dataset objects.

Parameters:
  • sop (str) – Load configuration parameters from specified SOP JSON file
  • sopPath – By default SOPs are loaded from the nPYc/StudyDesigns/SOP/ directory, if not None the directory specified in sopPath= will be searched before the builtin SOP directory.
featureMetadata = None

\(m\) × \(q\) pandas dataframe of feature identifiers and metadata

The featureMetadata table can include any datatype that can be placed in a pandas cell, However the toolbox assumes certain prerequisites on the following columns in order to function:

Column dtype Usage
Feature Name str or float ID of the feature measured in this column. Each ‘Feature Name’ must be unique in the table. If ‘Feature Name’ is numeric, the columns should be sorted in ascending or descending order.
sampleMetadata = None

\(n\) × \(p\) dataframe of sample identifiers and metadata.

The sampleMetadata table can include any datatype that can be placed in a pandas cell, However the toolbox assumes certain prerequisites on the following columns in order to function:

Column dtype Usage
Sample ID str ID of the sampling event generating this sample
AssayRole AssayRole Defines the role of this assay
SampleType SampleType Defines the type of sample acquired
Sample File Name str Unique file name for the analytical data
Sample Base Name str Common identifier that links analytical data to the Sample ID
Dilution float Where AssayRole is LinearityReference, the expected abundance is indicated here
Batch int Acquisition batch
Correction Batch int When detecting and correcting for batch and Run-Order effects, run-order effects are characterised within samples sharing the same Correction Batch, while batch effects are detected between distinct values
Acquired Time datetime.datetime Date and time of acquisition of raw data
Run order int Order of sample acquisition
Exclusion Details str Details of reasoning if marked for exclusion
Metadata Available bool Records which samples had metadata provided with the .addSampleInfo() method
featureMask = None

\(m\) element vector, with True representing features to be included in analysis, and False those to be excluded

sampleMask = None

\(p\) element vector, with True representing samples to be included in analysis, and False those to be excluded

AnalyticalPlatform = None

VariableType enum specifying the type of data represented.

Attributes = None

Dictionary of object configuration attributes, including those loaded from SOP files.

Defined attributes are as follows:

Key dtype Usage
‘dpi’ positive int Raster resolution when plotting figures
‘figureSize’ positive (float, float) Size to plot figures
‘figureFormat’ str Format to save figures in
‘histBins’ positive int Number of bins to use when drawing histograms
‘Feature Names’ Column in featureMetadata ID of the primary feature name
intensityData

\(n\) × \(m\) numpy matrix of measurements

noSamples
Returns:Number of samples in the dataset (n)
Return type:int
noFeatures
Returns:Number of features in the dataset (m)
Return type:int
log

Return log entries as a string.

name

Returns or sets the name of the dataset. name must be a string

Normalisation

Normaliser object that transforms the measurements in intensityData.

validateObject(verbose=True, raiseError=False, raiseWarning=True)

Checks that all the attributes specified in the class definition are present and of the required class and/or values. Checks for attributes existence and type. Check for mandatory columns existence, but does not check the column values (type or uniqueness). If ‘sampleMetadataExcluded’, ‘intensityDataExcluded’, ‘featureMetadataExcluded’ or ‘excludedFlag’ exist, the existence and number of exclusions (based on ‘sampleMetadataExcluded’) is checked

Parameters:
  • verbose (bool) – if True the result of each check is printed (default True)
  • raiseError (bool) – if True an error is raised when a check fails and the validation is interrupted (default False)
  • raiseWarning (bool) – if True a warning is raised when a check fails
Returns:

True if the Object conforms to basic Dataset

Return type:

bool

Raises:
  • TypeError – if the Object class is wrong
  • AttributeError – if self.Attributes does not exist
  • TypeError – if self.Attributes is not a dict
  • AttributeError – if self.Attributes[‘Log’] does not exist
  • TypeError – if self.Attributes[‘Log’] is not a list
  • AttributeError – if self.Attributes[‘dpi’] does not exist
  • TypeError – if self.Attributes[‘dpi’] is not an int
  • AttributeError – if self.Attributes[‘figureSize’] does not exist
  • TypeError – if self.Attributes[‘figureSize’] is not a list
  • ValueError – if self.Attributes[‘figureSize’] is not of length 2
  • TypeError – if self.Attributes[‘figureSize’][0] is not a int or float
  • TypeError – if self.Attributes[‘figureSize’][1] is not a int or float
  • AttributeError – if self.Attributes[‘figureFormat’] does not exist
  • TypeError – if self.Attributes[‘figureFormat’] is not a str
  • AttributeError – if self.Attributes[‘histBins’] does not exist
  • TypeError – if self.Attributes[‘histBins’] is not an int
  • AttributeError – if self.Attributes[‘noFiles’] does not exist
  • TypeError – if self.Attributes[‘noFiles’] is not an int
  • AttributeError – if self.Attributes[‘quantiles’] does not exist
  • TypeError – if self.Attributes[‘quantiles’] is not a list
  • ValueError – if self.Attributes[‘quantiles’] is not of length 2
  • TypeError – if self.Attributes[‘quantiles’][0] is not a int or float
  • TypeError – if self.Attributes[‘quantiles’][1] is not a int or float
  • AttributeError – if self.Attributes[‘sampleMetadataNotExported’] does not exist
  • TypeError – if self.Attributes[‘sampleMetadataNotExported’] is not a list
  • AttributeError – if self.Attributes[‘featureMetadataNotExported’] does not exist
  • TypeError – if self.Attributes[‘featureMetadataNotExported’] is not a list
  • AttributeError – if self.Attributes[‘analyticalMeasurements’] does not exist
  • TypeError – if self.Attributes[‘analyticalMeasurements’] is not a dict
  • AttributeError – if self.Attributes[‘excludeFromPlotting’] does not exist
  • TypeError – if self.Attributes[‘excludeFromPlotting’] is not a list
  • AttributeError – if self.VariableType does not exist
  • AttributeError – if self._Normalisation does not exist
  • TypeError – if self._Normalisation is not the Normaliser ABC
  • AttributeError – if self._name does not exist
  • TypeError – if self._name is not a str
  • AttributeError – if self._intensityData does not exist
  • TypeError – if self._intensityData is not a numpy.ndarray
  • AttributeError – if self.sampleMetadata does not exist
  • TypeError – if self.sampleMetadata is not a pandas.DataFrame
  • LookupError – if self.sampleMetadata does not have a Sample File Name column
  • LookupError – if self.sampleMetadata does not have an AssayRole column
  • LookupError – if self.sampleMetadata does not have a SampleType column
  • LookupError – if self.sampleMetadata does not have a Dilution column
  • LookupError – if self.sampleMetadata does not have a Batch column
  • LookupError – if self.sampleMetadata does not have a Correction Batch column
  • LookupError – if self.sampleMetadata does not have a Run Order column
  • LookupError – if self.sampleMetadata does not have a Sample ID column
  • LookupError – if self.sampleMetadata does not have a Sample Base Name column
  • LookupError – if self.sampleMetadata does not have an Acquired Time column
  • LookupError – if self.sampleMetadata does not have an Exclusion Details column
  • AttributeError – if self.featureMetadata does not exist
  • TypeError – if self.featureMetadata is not a pandas.DataFrame
  • LookupError – if self.featureMetadata does not have a Feature Name column
  • AttributeError – if self.sampleMask does not exist
  • TypeError – if self.sampleMask is not a numpy.ndarray
  • ValueError – if self.sampleMask are not bool
  • AttributeError – if self.featureMask does not exist
  • TypeError – if self.featureMask is not a numpy.ndarray
  • ValueError – if self.featureMask are not bool
  • AttributeError – if self.sampleMetadataExcluded does not exist
  • TypeError – if self.sampleMetadataExcluded is not a list
  • AttributeError – if self.intensityDataExcluded does not exist
  • TypeError – if self.intensityDataExcluded is not a list
  • ValueError – if self.intensityDataExcluded does not have the same number of exclusions as self.sampleMetadataExcluded
  • AttributeError – if self.featureMetadataExcluded does not exist
  • TypeError – if self.featureMetadataExcluded is not a list
  • ValueError – if self.featureMetadataExcluded does not have the same number of exclusions as self.sampleMetadataExcluded
  • AttributeError – if self.excludedFlag does not exist
  • TypeError – if self.excludedFlag is not a list
  • ValueError – if self.excludedFlag does not have the same number of exclusions as self.sampleMetadataExcluded
initialiseMasks()

Re-initialise featureMask and sampleMask to match the current dimensions of intensityData, and include all samples.

updateMasks(filterSamples=True, filterFeatures=True, sampleTypes=[<SampleType.StudySample>, <SampleType.StudyPool>, <SampleType.ExternalReference>, <SampleType.MethodReference>, <SampleType.ProceduralBlank>], assayRoles=[<AssayRole.Assay>, <AssayRole.PrecisionReference>, <AssayRole.LinearityReference>, <AssayRole.Blank>], **kwargs)

Update sampleMask and featureMask according to parameters.

updateMasks() sets sampleMask or featureMask to False for those items failing analytical criteria.

Note

To avoid reintroducing items manually excluded, this method only ever sets items to False, therefore if you wish to move from more stringent criteria to a less stringent set, you will need to reset the mask to all True using initialiseMasks().

Parameters:
  • filterSamples (bool) – If False don’t modify sampleMask
  • filterFeatures (bool) – If False don’t modify featureMask
  • sampleTypes (SampleType) – List of types of samples to retain
  • sampleRoles (AssayRole) – List of assays roles to retain
applyMasks()

Permanently delete elements masked (those set to False) in sampleMask and featureMask, from featureMetadata, sampleMetadata, and intensityData.

addSampleInfo(descriptionFormat=None, filePath=None, filetype=None, **kwargs)

Load additional metadata and map it in to the sampleMetadata table.

Possible options:

  • ‘Basic CSV’ Joins the sampleMetadata table with the data in the csv file at filePath=, matching on the ‘Sample File Name’ column in both (see Sample Metadata).
  • ‘Filenames’ Parses sample information out of the filenames, based on the named capture groups in the regex passed in filenamespec
  • ‘Raw Data’ Extract analytical parameters from raw data files
  • ‘ISATAB’ ISATAB study designs
Parameters:
  • descriptionFormat (str) – Format of metadata to be added
  • filePath (str) – Path to the additional data to be added
Raises:

NotImplementedError – if the descriptionFormat is not understood

addFeatureInfo(filePath=None, descriptionFormat=None, featureId=None, **kwargs)

Load additional metadata and map it in to the featureMetadata table.

Possible options:

  • ‘Reference Ranges’ JSON file specifying upper and lower reference ranges for a feature.
Parameters:
  • filePath (str) – Path to the additional data to be added
  • descriptionFormat (str) –
  • featureId (str) – Unique feature Id field in the metadata file provided to match with internal Feature Name
Raises:

NotImplementedError – if the descriptionFormat is not understood

excludeSamples(sampleList, on='Sample File Name', message='User Excluded')

Sets the sampleMask for the samples listed in sampleList to False to mask them from the dataset.

Parameters:
  • sampleList (list) – A list of sample IDs to be excluded
  • on (str) – name of the column in sampleMetadata to match sampleList against, defaults to ‘Sample File Name’
  • message (str) – append this message to the ‘Exclusion Details’ field for each sample excluded, defaults to ‘User Excluded’
Returns:

a list of IDs passed in sampleList that could not be matched against the sample IDs present

Return type:

list

excludeFeatures(featureList, on='Feature Name', message='User Excluded')

Masks the features listed in featureList from the dataset.

Parameters:
  • featureList (list) – A list of feature IDs to be excluded
  • on (str) – name of the column in featureMetadata to match featureList against, defaults to ‘Feature Name’
  • message (str) – append this message to the ‘Exclusion Details’ field for each feature excluded, defaults to ‘User Excluded’
Returns:

A list of ID passed in featureList that could not be matched against the feature IDs present.

Return type:

list

exportDataset(destinationPath='.', saveFormat='CSV', isaDetailsDict={}, withExclusions=True, escapeDelimiters=False, filterMetadata=True)

Export dataset object in a variety of formats for import in other software, the export is named according to the name attribute of the Dataset object.

Possible save formats are:

Parameters:
  • destinationPath (str) – Save data into the directory specified here
  • format (str) – File format for saved data, defaults to CSV.
  • detailsDict (dict) – Contains several key: value pairs required to for exporting ISATAB.

detailsDict should have the format: detailsDict = {

‘investigation_identifier’ : “i1”, ‘investigation_title’ : “Give it a title”, ‘investigation_description’ : “Add a description”, ‘investigation_submission_date’ : “2016-11-03”, ‘investigation_public_release_date’ : “2016-11-03”, ‘first_name’ : “Noureddin”, ‘last_name’ : “Sadawi”, ‘affiliation’ : “University”, ‘study_filename’ : “my_ms_study”, ‘study_material_type’ : “Serum”, ‘study_identifier’ : “s1”, ‘study_title’ : “Give the study a title”, ‘study_description’ : “Add study description”, ‘study_submission_date’ : “2016-11-03”, ‘study_public_release_date’ : “2016-11-03”, ‘assay_filename’ : “my_ms_assay”

}

Parameters:
  • withExclusions (bool) – If True mask features and samples will be excluded
  • escapeDelimiters (bool) – If True remove characters commonly used as delimiters in csv files from metadata
  • filterMetadata (bool) – If True does not export the sampleMetadata and featureMetadata columns listed in self.Attributes[‘sampleMetadataNotExported’] and self.Attributes[‘featureMetadataNotExported’]
Raises:

ValueError – if saveFormat is not understood

getFeatures(featureIDs, by=None, useMasks=True)

Get a feature or list of features by name or ranges.

If VariableType is Discrete, getFeature() expects either a single or list of values, and matching features are returned. If VariableType is Spectral, pass either a single, or list of (min, max) tuples, the features returned will be a slice of the combined ranges. If the ranges passed overlap, the union will be returned.

Parameters:
  • featureIDs – A single or list of feature IDs to return
  • by (None or str) – Column in featureMetadata to search in, None use the column defined in Attributes[‘Feature Names’]
Returns:

(featureMetadata, intensityData)

Return type:

(pandas.Dataframe, numpy.ndarray)

class nPYc.objects.MSDataset(datapath, fileType='QI', sop='GenericMS', **kwargs)

MSDataset extends Dataset to represent both peak-picked LC- or DI-MS datasets (discrete variables), and Continuum mode (spectral) DI-MS datasets.

Objects can be initialised from a variety of common data formats, currently peak-picked data from Progenesis QI or XCMS, and targeted Biocrates datasets.

  • Progenesis QI
    QI import operates on csv files exported via the ‘Export Compound Measurements’ menu option in QI. Import requires the presence of both normalised and raw datasets, but will only import the raw meaturenents.
  • XCMS
    XCMS import operates on the csv files generated by XCMS with the peakTable() method. By default, the csv is expected to have 14 columns of feature parameters, with the intensity values for the first sample coming on the 15 column. However, the number of columns to skip is dataset dependent and can be set with the (e noFeatureParams= keyword argument.
  • Biocrates
    Operates on spreadsheets exported from Biocrates MetIDQ. By default loads data from the sheet named ‘Data Export’, this may be overridden with the sheetName= argument, If the number of sample metadata columns differes from the default, this can be overridden with the noSampleParams= argument.
correlationToDilution

Returns the correlation of features to dilution as calculated on samples marked as ‘Dilution Series’ in sampleMetadata, with dilution expressed in ‘Dilution’.

Returns:Vector of feature correlations to dilution
Return type:numpy.ndarray
artifactualLinkageMatrix

Gets overlapping artifactual features.

rsdSP

Returns percentage relative standard deviations for each feature in the dataset, calculated on samples with the Assay Role PrecisionReference and Sample Type StudyPool in sampleMetadata.

Returns:Vector of feature RSDs
Return type:numpy.ndarray
rsdSS

Returns percentage relative standard deviations for each feature in the dataset, calculated on samples with the Assay Role Assay and Sample Type StudySample in sampleMetadata.

Returns:Vector of feature RSDs
Return type:numpy.ndarray
applyMasks()

Permanently delete elements masked (those set to False) in sampleMask and featureMask, from featureMetadata, sampleMetadata, and intensityData.

Resets feature linkage matrix and feature correlations.

updateMasks(filterSamples=True, filterFeatures=True, sampleTypes=[<SampleType.StudySample>, <SampleType.StudyPool>, <SampleType.ExternalReference>, <SampleType.MethodReference>, <SampleType.ProceduralBlank>], assayRoles=[<AssayRole.Assay>, <AssayRole.PrecisionReference>, <AssayRole.LinearityReference>, <AssayRole.Blank>], featureFilters={'artifactualFilter': False, 'blankFilter': False, 'correlationToDilutionFilter': True, 'rsdFilter': True, 'varianceRatioFilter': True}, **kwargs)

Update sampleMask and featureMask according to QC parameters.

updateMasks() sets sampleMask or featureMask to False for those items failing analytical criteria.

Note

To avoid reintroducing items manually excluded, this method only ever sets items to False, therefore if you wish to move from more stringent criteria to a less stringent set, you will need to reset the mask to all True using initialiseMasks().

Parameters:
  • filterSamples (bool) – If False don’t modify sampleMask
  • filterFeatures (bool) – If False don’t modify featureMask
  • sampleTypes (SampleType) – List of types of samples to retain
  • assayRoles (AssayRole) – List of assays roles to retain
  • correlationThreshold (None or float) – Mask features with a correlation below this value. If None, use the value from Attributes[‘corrThreshold’]
  • rsdThreshold (None or float) – Mask features with a RSD below this value. If None, use the value from Attributes[‘rsdThreshold’]
  • varianceRatio (None or float) – Mask features where the RSD measured in study samples is below that measured in study reference samples multiplied by varianceRatio
  • withArtifactualFiltering (None or bool) – If None use the value from Attributes['artifactualFilter']. If False doesn’t apply artifactual filtering. If Attributes['artifactualFilter'] is set to False artifactual filtering will not take place even if withArtifactualFiltering is set to True.
  • deltaMzArtifactual (None or float) – Maximum allowed m/z distance between two grouped features. If None, use the value from Attributes[‘deltaMzArtifactual’]
  • overlapThresholdArtifactual (None or float) – Minimum peak overlap between two grouped features. If None, use the value from Attributes[‘overlapThresholdArtifactual’]
  • corrThresholdArtifactual (None or float) – Minimum correlation between two grouped features. If None, use the value from Attributes[‘corrThresholdArtifactual’]
  • blankThreshold (None, False, or float) – Mask features thats median intesity falls below blankThreshold x the level in the blank. If False do not filter, if None use the cutoff from Attributes[‘blankThreshold’], otherwise us the cutoff scaling factor provided
saveFeatureMask()

Updates featureMask and saves as ‘Passing Selection’ in self.featureMetadata

addSampleInfo(descriptionFormat=None, filePath=None, filenameSpec=None, filetype='Waters .raw', **kwargs)

Load additional metadata and map it in to the sampleMetadata table.

Possible options:

  • ‘NPC LIMS’ NPC LIMS files mapping files names of raw analytical data to sample IDs
  • ‘NPC Subject Info’ Map subject metadata from a NPC sample manifest file (format defined in ‘PCSOP.082’)
  • ‘Raw Data’ Extract analytical parameters from raw data files
  • ‘ISATAB’ ISATAB study designs
  • ‘Filenames’ Parses sample information out of the filenames, based on the named capture groups in the regex passed in filenamespec
  • ‘Basic CSV’ Joins the sampleMetadata table with the data in the csv file at filePath=, matching on the ‘Sample File Name’ column in both.
Parameters:
  • descriptionFormat (str) – Format of metadata to be added
  • filePath (str) – Path to the additional data to be added
  • filenameSpec (None or str) – Only used if descriptionFormat is ‘Filenames’. A regular expression that extracts sample-type information into the following named capture groups: ‘fileName’, ‘baseName’, ‘study’, ‘chromatography’ ‘ionisation’, ‘instrument’, ‘groupingKind’ ‘groupingNo’, ‘injectionKind’, ‘injectionNo’, ‘reference’, ‘exclusion’ ‘reruns’, ‘extraInjections’, ‘exclusion2’. if None is passed, use the filenameSpec key in Attributes, loaded from the SOP json
Raises:

NotImplementedError – if the descriptionFormat is not understood

amendBatches(sampleRunOrder)

Creates a new batch starting at the sample index in sampleRunOrder, and amends subsequent batch numbers in sampleMetadata[‘Correction Batch’]

Parameters:sampleRunOrder (int) – Index of first sample in new batch
artifactualFilter(featMask=None)

Filter artifactual features on top of the featureMask already present if none given as input Keep feature with the highest intensity on the mean spectra

Parameters:featMask (numpy.ndarray or None) – A featureMask (True for inclusion), if None, use featureMask
Returns:Amended featureMask
Return type:numpy.ndarray
excludeFeatures(featureList, on='Feature Name', message='User Excluded')

Masks the features listed in featureList from the dataset.

Parameters:
  • featureList (list) – A list of feature IDs to be excluded
  • on (str) – name of the column in featureMetadata to match featureList against, defaults to ‘Feature Name’
  • message (str) – append this message to the ‘Exclusion Details’ field for each feature excluded, defaults to ‘User Excluded’
Returns:

A list of ID passed in featureList that could not be matched against the feature IDs present.

Return type:

list

initialiseMasks()

Re-initialise featureMask and sampleMask to match the current dimensions of intensityData, and include all samples.

validateObject(verbose=True, raiseError=False, raiseWarning=True)

Checks that all the attributes specified in the class definition are present and of the required class and/or values.

Returns 4 boolean: is the object a Dataset < a basic MSDataset < has the object parameters for QC < has the object sample metadata.

To employ all class methods, the most inclusive (has the object sample metadata) must be successful:

  • ‘Basic MSDataset’ checks Dataset types and uniqueness as well as additional attributes.
  • ‘has parameters for QC’ is ‘Basic MSDataset’ + sampleMetadata[[‘SampleType, AssayRole, Dilution, Run Order, Batch, Correction Batch, Sample Base Name]]
  • ‘has sample metadata’ is ‘has parameters for QC’ + sampleMetadata[[‘Sample ID’, ‘Subject ID’, ‘Matrix’]]

Column type() in pandas.DataFrame are established on the first sample when necessary Does not check for uniqueness in sampleMetadata['Sample File Name'] Does not currently check Attributes['Raw Data Path'] type Does not currently check corrExclusions type

Parameters:
  • verbose (bool) – if True the result of each check is printed (default True)
  • raiseError (bool) – if True an error is raised when a check fails and the validation is interrupted (default False)
  • raiseWarning (bool) – if True a warning is raised when a check fails
Returns:

A dictionary of 4 boolean with True if the Object conforms to the corresponding test. ‘Dataset’ conforms to Dataset, ‘BasicMSDataset’ conforms to Dataset + basic MSDataset, ‘QC’ BasicMSDataset + object has QC parameters, ‘sampleMetadata’ QC + object has sample metadata information

Return type:

dict

Raises:
  • TypeError – if the Object class is wrong
  • AttributeError – if self.Attributes[‘rtWindow’] does not exist
  • TypeError – if self.Attributes[‘rtWindow’] is not an int or float
  • AttributeError – if self.Attributes[‘msPrecision’] does not exist
  • TypeError – if self.Attributes[‘msPrecision’] is not an int or float
  • AttributeError – if self.Attributes[‘varianceRatio’] does not exist
  • TypeError – if self.Attributes[‘varianceRatio’] is not an int or float
  • AttributeError – if self.Attributes[‘blankThreshold’] does not exist
  • TypeError – if self.Attributes[‘blankThreshold’] is not an int or float
  • AttributeError – if self.Attributes[‘corrMethod’] does not exist
  • TypeError – if self.Attributes[‘corrMethod’] is not a str
  • AttributeError – if self.Attributes[‘corrThreshold’] does not exist
  • TypeError – if self.Attributes[‘corrThreshold’] is not an int or float
  • AttributeError – if self.Attributes[‘rsdThreshold’] does not exist
  • TypeError – if self.Attributes[‘rsdThreshold’] is not an int or float
  • AttributeError – if self.Attributes[‘artifactualFilter’] does not exist
  • TypeError – if self.Attributes[‘artifactualFilter’] is not a bool
  • AttributeError – if self.Attributes[‘deltaMzArtifactual’] does not exist
  • TypeError – if self.Attributes[‘deltaMzArtifactual’] is not an int or float
  • AttributeError – if self.Attributes[‘overlapThresholdArtifactual’] does not exist
  • TypeError – if self.Attributes[‘overlapThresholdArtifactual’] is not an int or float
  • AttributeError – if self.Attributes[‘corrThresholdArtifactual’] does not exist
  • TypeError – if self.Attributes[‘corrThresholdArtifactual’] is not an int or float
  • AttributeError – if self.Attributes[‘FeatureExtractionSoftware’] does not exist
  • TypeError – if self.Attributes[‘FeatureExtractionSoftware’] is not a str
  • AttributeError – if self.Attributes[‘Raw Data Path’] does not exist
  • TypeError – if self.Attributes[‘Raw Data Path’] is not a str
  • AttributeError – if self.Attributes[‘Feature Names’] does not exist
  • TypeError – if self.Attributes[‘Feature Names’] is not a str
  • TypeError – if self.VariableType is not an enum ‘VariableType’
  • AttributeError – if self.corrExclusions does not exist
  • AttributeError – if self._correlationToDilution does not exist
  • TypeError – if self._correlationToDilution is not a numpy.ndarray
  • AttributeError – if self._artifactualLinkageMatrix does not exist
  • TypeError – if self._artifactualLinkageMatrix is not a pandas.DataFrame
  • AttributeError – if self._tempArtifactualLinkageMatrix does not exist
  • TypeError – if self._tempArtifactualLinkageMatrix is not a pandas.DataFrame
  • AttributeError – if self.fileName does not exist
  • TypeError – if self.fileName is not a str
  • AttributeError – if self.filePath does not exist
  • TypeError – if self.filePath is not a str
  • ValueError – if self.sampleMetadata does not have the same number of samples as self._intensityData
  • TypeError – if self.sampleMetadata[‘Sample File Name’] is not str
  • TypeError – if self.sampleMetadata[‘AssayRole’] is not an enum ‘AssayRole’
  • TypeError – if self.sampleMetadata[‘SampleType’] is not an enum ‘SampleType’
  • TypeError – if self.sampleMetadata[‘Dilution’] is not an int or float
  • TypeError – if self.sampleMetadata[‘Batch’] is not an int or float
  • TypeError – if self.sampleMetadata[‘Correction Batch’] is not an int or float
  • TypeError – if self.sampleMetadata[‘Run Order’] is not an int
  • TypeError – if self.sampleMetadata[‘Acquired Time’] is not a datetime
  • TypeError – if self.sampleMetadata[‘Sample Base Name’] is not str
  • LookupError – if self.sampleMetadata does not have a Matrix column
  • TypeError – if self.sampleMetadata[‘Matrix’] is not a str
  • LookupError – if self.sampleMetadata does not have a Subject ID column
  • TypeError – if self.sampleMetadata[‘Subject ID’] is not a str
  • TypeError – if self.sampleMetadata[‘Sample ID’] is not a str
  • ValueError – if self.featureMetadata does not have the same number of features as self._intensityData
  • TypeError – if self.featureMetadata[‘Feature Name’] is not a str
  • ValueError – if self.featureMetadata[‘Feature Name’] is not unique
  • LookupError – if self.featureMetadata does not have a m/z column
  • TypeError – if self.featureMetadata[‘m/z’] is not an int or float
  • LookupError – if self.featureMetadata does not have a Retention Time column
  • TypeError – if self.featureMetadata[‘Retention Time’] is not an int or float
  • ValueError – if self.sampleMask has not been initialised
  • ValueError – if self.sampleMask does not have the same number of samples as self._intensityData
  • ValueError – if self.featureMask has not been initialised
  • ValueError – if self.featureMask does not have the same number of features as self._intensityData
class nPYc.objects.NMRDataset(datapath, fileType='Bruker', sop='GenericNMRurine', pulseprogram= 'noesygpp1d', **kwargs)

NMRDataset extends Dataset to represent both spectral and peak-picked NMR datasets.

Objects can be initialised from a variety of common data formats, including Bruker-format raw data, and BI-LISA targeted lipoprotein analysis.

  • Bruker
    When loading Bruker format raw spectra (1r files), all directores below datapath will be scanned for valid raw data, and those matching pulseprogram loaded and aligned onto a common scale as defined in sop.
  • BI-LISA
    BI-LISA data can be read from Excel workbooks, the name of the sheet containing the data to be loaded should be passed in the pulseProgram argument. Feature descriptors will be loaded from the ‘Analytes’ sheet, and file names converted back to the ExperimentName/expno format from ExperimentName_EXPNO_expno.
Parameters:
  • fileType (str) – Type of data to be loaded
  • sheetname (str) – Load data from the specifed sheet of the Excel workbook
  • pulseprogram (str) – When loading raw data, only import spectra aquired with pulseprogram
addSampleInfo(descriptionFormat=None, filePath=None, filenameSpec=None, **kwargs)

Load additional metadata and map it in to the sampleMetadata table.

Possible options:

  • ‘NPC LIMS’ NPC LIMS files mapping files names of raw analytical data to sample IDs
  • ‘NPC Subject Info’ Map subject metadata from a NPC sample manifest file (format defined in ‘PCSOP.082’)
  • ‘Raw Data’ Extract analytical parameters from raw data files
  • ‘ISATAB’ ISATAB study designs
  • ‘Filenames’ Parses sample information out of the filenames, based on the named capture groups in the regex passed in filenamespec
  • ‘Basic CSV’ Joins the sampleMetadata table with the data in the csv file at filePath=, matching on the ‘Sample File Name’ column in both.
Parameters:
  • descriptionFormat (str) – Format of metadata to be added
  • filePath (str) – Path to the additional data to be added
  • filenameSpec (None or str) – Only used if descriptionFormat is ‘Filenames’. A regular expression that extracts sample-type information into the following named capture groups: ‘fileName’, ‘baseName’, ‘study’, ‘chromatography’ ‘ionisation’, ‘instrument’, ‘groupingKind’ ‘groupingNo’, ‘injectionKind’, ‘injectionNo’, ‘reference’, ‘exclusion’ ‘reruns’, ‘extraInjections’, ‘exclusion2’. if None is passed, use the filenameSpec key in Attributes, loaded from the SOP json
Raises:

NotImplementedError – if the descriptionFormat is not understood

updateMasks(filterSamples=True, filterFeatures=True, sampleTypes=[<SampleType.StudySample>, <SampleType.StudyPool>, <SampleType.ExternalReference>, <SampleType.MethodReference>, <SampleType.ProceduralBlank>], assayRoles=[<AssayRole.Assay>, <AssayRole.PrecisionReference>, <AssayRole.LinearityReference>, <AssayRole.Blank>], exclusionRegions=None, sampleQCChecks=[], **kwargs)

Update sampleMask and featureMask according to parameters.

updateMasks() sets sampleMask or featureMask to False for those items failing analytical criteria.

Note

To avoid reintroducing items manually excluded, this method only ever sets items to False, therefore if you wish to move from more stringent criteria to a less stringent set, you will need to reset the mask to all True using initialiseMasks().

Parameters:
  • filterSamples (bool) – If False don’t modify sampleMask
  • filterFeatures (bool) – If False don’t modify featureMask
  • sampleTypes (SampleType) – List of types of samples to retain
  • sampleRoles (AssayRole) – List of assays roles to retain
  • exclusionRegions (list of tuple) – If None Exclude ranges defined in Attributes[‘exclusionRegions’]
  • sampleQCChecks (list) – Which quality control metrics to use.
plot(spectra, labels, interactive=False)

Plots a set of nmr spectra. If interactive is False, returns a static matplotlib plot. If True, then plotly is used to generate an interactive plot.

Parameters:
  • spectra – The specific ‘labels’ of the spectra to plot. By default all spectra are plotted.
  • labels – Which labels to select
  • interactive – Use matplotlib (False) or plotly (True)
Returns:

Displays the NMR data and returns either a matplotlib axis object or a plotly figure dictionary

class nPYc.objects.TargetedDataset(dataPath, fileType='TargetLynx', sop='Generic', **kwargs)

TargetedDataset extends Dataset to represent quantitative datasets, where compounds are already identified, the exactitude of the quantification can be established, units are known and calibration curve or internal standards are employed. The TargetedDataset class include methods to apply limits of quantification (LLOQ and ULOQ), merge multiple analytical batch, and report accuracy and precision of each measurements.

In addition to the structure of Dataset, TargetedDataset requires the following attributes:

  • expectedConcentration:

    A \(n\) × \(m\) pandas dataframe of expected concentrations (matching the intensityData dimension), with column names matching featureMetadata[‘Feature Name’]

  • calibration:

    A dictionary containing pandas dataframe describing calibration samples:

    • calibration['calibIntensityData']:
      A \(r\) x \(m\) numpy matrix of measurements. Features must match features in intensityData
    • calibration['calibSampleMetadata']:
      A \(r\) x \(m\) pandas dataframe of calibration sample identifiers and metadata
    • calibration['calibFeatureMetadata']:
      A \(m\) × \(q\) pandas dataframe of feature identifiers and metadata
    • calibration['calibExpectedConcentration']:
      A \(r\) × \(m\) pandas dataframe of calibration samples expected concentrations
  • Attributes must contain the following (can be loaded from a method specific JSON on import):

    • methodName:
      A (str) name of the method
    • externalID:
      A list of external ID, each external ID must also be present in Attributes as a list of identifier (for that external ID) for each feature. For example, if externalID=['PubChem ID'], Attributes['PubChem ID']=['ID1','ID2','','ID75']
  • featureMetadata expects the following columns:
    • quantificationType:
      A QuantificationType enum specifying the exactitude of the quantification procedure employed.
    • calibrationMethod:
      A CalibrationMethod enum specifying the calibration method employed.
    • Unit:
      A (str) unit corresponding the the feature measurement value.
    • LLOQ:
      The lowest limit of quantification, used to filter concentrations < LLOQ
    • ULOQ:
      The upper limit of quantification, used to filter concentrations > ULOQ
    • externalID:
      All externalIDs listed in Attributes['externalID'] must be present as their own column

Currently targeted assay results processed using TargetLynx or Bruker quantification results can be imported. To create an import for any other form of semi-quantitative or quantitative results, the procedure is as follow:

  • Create a new fileType == 'myMethod' entry in __init__()
  • Define functions to populate all expected dataframes (using file readers, JSON,…)
  • Separate calibration samples from study samples (store in calibration). If none exist, intialise empty dataframes with the correct number of columns and column names.
  • Execute pre-processing steps if required (note: all feature values should be expressed in the unit listed in featureMetadata['Unit'])
  • Apply limits of quantification using _applyLimitsOfQuantification(). (This function does not apply limits of quantification to features marked as QuantificationType == QuantificationType.Monitored for compounds monitored for relative information.)

The resulting TargetedDatset created must satisfy to the criteria for BasicTargetedDataset, which can be checked with validatedObject() (list the minimum requirements for all class methods).

  • fileType == 'TargetLynx' to import data processed using TargetLynx

    TargetLynx import operates on xml files exported via the ‘File -> Export -> XML’ TargetLynx menu option. Import requires a calibration_report.csv providing lower and upper limits of quantification (LLOQ, ULOQ) with the calibrationReportPath keyword argument.

    Targeted data measurements as well as calibration report information are read and mapped with pre-defined SOPs. All measurments are converted to pre-defined units and measurements inferior to the lowest limits of quantification or superior to the upper limits of quantification are replaced. Once the import is finished, only analysed samples are returned (no calibration samples) and only features mapped onto the pre-defined SOP and sufficiently described.

    Instructions to created new TargetLynx SOP can be found on the generation of targeted SOPs page.

    Example: TargetedDataset(datapath, fileType='TargetLynx', sop='OxylipinMS', calibrationReportPath=calibrationReportPath, sampleTypeToProcess=['Study Sample','QC'], noiseFilled=False, onlyLLOQ=False, responseReference=None)

    • sop

      Currently implemented are ‘OxylipinMS’ and ‘AminoAcidMS’

      AminoAcidMS: Gray N. et al. Human Plasma and Serum via Precolumn Derivatization with 6‑Aminoquinolyl‑N‑hydroxysuccinimidyl Carbamate: Application to Acetaminophen-Induced Liver Failure. Analytical Chemistry, 2017, 89, 2478−87.

      OxylipinMS: Wolfer AM. et al. Development and Validation of a High-Throughput Ultrahigh-Performance Liquid Chromatography-Mass Spectrometry Approach for Screening of Oxylipins and Their Precursors. Analytical Chemistry, 2015, 87 (23),11721–31

    • calibrationReportPath

      Path to the calibration report csv following the provided report template.

      The following columns are required (leave an empty value to reject a compound):

      • Compound
        The compound name, identical to the one employed in the SOP json file.
      • TargetLynx ID
        The compound TargetLynx ID, identical to the one employed in the SOP json file.
      • LLOQ
        Lowest limit of quantification concentration, in the same unit as indicated in TargetLynx.
      • ULOQ
        Upper limit of quantification concentration, in the same unit as indicated in TargetLynx.

      The following columns are expected by _targetLynxApplyLimitsOfQuantificationNoiseFilled():

      • Noise (area)
        Area integrated in a blank sample at the same retention time as the compound of interest (if left empty noise concentration calculation cannot take place).
      • a
        \(a\) coefficient in the calibration equation (if left empty noise concentration calculation cannot take place).
      • b
        \(b\) coefficient in the calibration equation (if left empty noise concentration calculation cannot take place).

      The following columns are recommended but not expected:

      • Cpd Info
        Additional information relating to the compound (can be left empty).
      • r
        \(r\) goodness of fit measure for the calibration equation (can be left empty).
      • r2
        \(r^2\) goodness of fit measure for the calibration equation (can be left empty).
    • sampleTypeToProcess

      List of [‘Study Sample’,’Blank’,’QC’,’Other’] for the sample types to process as defined in MassLynx. Only samples in ‘sampleTypeToProcess’ are returned. Calibrants should not be processed and are not returned. Most uses should only require ‘Study Sample’ as quality controls are identified based on sample names by subsequent functions. Default value is ‘[‘Study Sample’,’QC’]’.

    • noiseFilled

      If True values <LLOQ will be replaced by a concentration equivalent to the noise level in a blank. If False <LLOQ is replaced by \(-inf\). Default value is ‘False’

    • onlyLLOQ

      If True only correct <LLOQ, if False correct <LLOQ and >ULOQ. Default value is ‘False’.

    • responseReference

      If noiseFilled=True the noise concentration needs to be calculated. Provide the ‘Sample File Name’ of a reference sample to use in order to establish the response to use, or list of samples to use (one per feature). If None, the middle of the calibration will be employed. Default value is ‘None’.

    • keepPeakInfo

      If keepPeakInfo=True (default False) adds the peakInfo dictionary to the calibration. peakInfo contains the peakResponse, peakArea, peakConcentrationDeviation, peakIntegrationFlag and peakRT.

    • keepExcluded

      If keepExcluded=True (default False), import exclusions (excludedImportSampleMetadata, excludedImportFeatureMetadata, excludedImportIntensityData and excludedImportExpectedConcentration) are kept in the object.

    • keepIS

      If keepIS=True (default False), features marked as Internal Standards (IS) are retained.

  • fileType = 'Bruker Quantification' to import Bruker quantification results

    • nmrRawDataPath
      Path to the parent folder where all result files are stored. All subfolders will be parsed and the .xml results files matching the fileNamePattern imported.
    • fileNamePattern
      Regex to recognise the result data xml files
    • pdata
      To select the right pdata folders (default 1)

    Two form of Bruker quantification results are supported and selected using the sop option: BrukerQuant-UR and Bruker BI-LISA

    • sop = 'BrukerQuant-UR'

      Example: TargetedDataset(nmrRawDataPath, fileType='Bruker Quantification', sop='BrukerQuant-UR', fileNamePattern='.*?urine_quant_report_b\.xml$', unit='mmol/mol Crea')

      • unit
        If features are duplicated with different units, unit limits the import to features matching said unit. (In case of duplication and no unit, all available units will be listed)
    • sop = ''BrukerBI-LISA'

      Example: TargetedDataset(nmrRawDataPath, fileType='Bruker Quantification', sop='BrukerBI-LISA', fileNamePattern='.*?results\.xml$')

rsdSP

Returns percentage relative standard deviations for each feature in the dataset, calculated on samples with the Assay Role PrecisionReference and Sample Type StudyPool in sampleMetadata. Implemented as a back-up to accuracyPrecision() when no expected concentrations are known

Returns:Vector of feature RSDs
Return type:numpy.ndarray
rsdSS

Returns percentage relative standard deviations for each feature in the dataset, calculated on samples with the Assay Role Assay and Sample Type StudySample in sampleMetadata.

Returns:Vector of feature RSDs
Return type:numpy.ndarray
mergeLimitsOfQuantification(keepBatchLOQ=False, onlyLLOQ=False)

Update limits of quantification and apply LLOQ/ULOQ using the lowest common denominator across all batch (after a __add__()). Keep the highest LLOQ and lowest ULOQ.

Parameters:
  • keepBatchLOQ (bool) – If True do not remove each batch LOQ (featureMetadata['LLOQ_batchX'], featureMetadata['ULOQ_batchX'])
  • onlyLLOQ (bool) – if True only correct <LLOQ, if False correct <LLOQ and >ULOQ
Raises:
  • ValueError – if targetedData does not satisfy to the BasicTargetedDataset definition on input
  • ValueError – if number of batch, LLOQ_batchX and ULOQ_batchX do not match
  • ValueError – if targetedData does not satisfy to the BasicTargetedDataset definition after LOQ merging
  • Warning – if featureMetadata['LLOQ'] or featureMetadata['ULOQ'] already exist and will be overwritten.
exportDataset(destinationPath='.', saveFormat='CSV', withExclusions=True, escapeDelimiters=False, filterMetadata=True)

Calls exportDataset() and raises a warning if normalisation is employed as TargetedDataset intensityData can be left-censored.

validateObject(verbose=True, raiseError=False, raiseWarning=True)

Checks that all the attributes specified in the class definition are present and of the required class and/or values.

Returns 4 boolean: is the object a Dataset < a basic TargetedDataset < has the object parameters for QC < has the object sample metadata.

To employ all class methods, the most inclusive (has the object sample metadata) must be successful:

  • ‘Basic TargetedDataset’ checks TargetedDataset types and uniqueness as well as additional attributes.
  • ‘has parameters for QC’ is ‘Basic TargetedDataset’ + sampleMetadata[[‘SampleType, AssayRole, Dilution, Run Order, Batch, Correction Batch, Sample Base Name]]
  • ‘has sample metadata’ is ‘has parameters for QC’ + sampleMetadata[[‘Sample ID’, ‘Subject ID’, ‘Matrix’]]

calibration['calibIntensityData'] must be initialised even if no samples are present calibration['calibSampleMetadata'] must be initialised even if no samples are present, use: pandas.DataFrame(None, columns=self.sampleMetadata.columns.values.tolist()) calibration['calibFeatureMetadata'] must be initialised even if no samples are present, use a copy of self.featureMetadata calibration['calibExpectedConcentration'] must be initialised even if no samples are present, use: pandas.DataFrame(None, columns=self.expectedConcentration.columns.values.tolist()) Calibration features must be identical to the usual features. Number of calibration samples and features must match across the 4 calibration tables If ‘sampleMetadataExcluded’, ‘intensityDataExcluded’, ‘featureMetadataExcluded’, ‘expectedConcentrationExcluded’ or ‘excludedFlag’ exist, the existence and number of exclusions (based on ‘sampleMetadataExcluded’) is checked

Column type() in pandas.DataFrame are established on the first sample (for non int/float) featureMetadata are search for column names containing ‘LLOQ’ & ‘ULOQ’ to allow for ‘LLOQ_batch…’ after __add__(), the first column matching is then checked for dtype If datasets are merged, calibration is a list of dict, and number of features is only kept constant inside each dict Does not check for uniqueness in sampleMetadata['Sample File Name'] Does not check columns inside calibration['calibSampleMetadata'] Does not check columns inside calibration['calibFeatureMetadata'] Does not currently check for Attributes['Feature Name']

Parameters:
  • verbose (bool) – if True the result of each check is printed (default True)
  • raiseError (bool) – if True an error is raised when a check fails and the validation is interrupted (default False)
  • raiseWarning (bool) – if True a warning is raised when a check fails
Returns:

A dictionary of 4 boolean with True if the Object conforms to the corresponding test. ‘Dataset’ conforms to Dataset, ‘BasicTargetedDataset’ conforms to Dataset + basic TargetedDataset, ‘QC’ BasicTargetedDataset + object has QC parameters, ‘sampleMetadata’ QC + object has sample metadata information

Return type:

dict

Raises:
  • TypeError – if the Object class is wrong
  • AttributeError – if self.Attributes[‘methodName’] does not exist
  • TypeError – if self.Attributes[‘methodName’] is not a str
  • AttributeError – if self.Attributes[‘externalID’] does not exist
  • TypeError – if self.Attributes[‘externalID’] is not a list
  • TypeError – if self.VariableType is not an enum ‘VariableType’
  • AttributeError – if self.fileName does not exist
  • TypeError – if self.fileName is not a str or list
  • AttributeError – if self.filePath does not exist
  • TypeError – if self.filePath is not a str or list
  • ValueError – if self.sampleMetadata does not have the same number of samples as self._intensityData
  • TypeError – if self.sampleMetadata[‘Sample File Name’] is not str
  • TypeError – if self.sampleMetadata[‘AssayRole’] is not an enum ‘AssayRole’
  • TypeError – if self.sampleMetadata[‘SampleType’] is not an enum ‘SampleType’
  • TypeError – if self.sampleMetadata[‘Dilution’] is not an int or float
  • TypeError – if self.sampleMetadata[‘Batch’] is not an int or float
  • TypeError – if self.sampleMetadata[‘Correction Batch’] is not an int or float
  • TypeError – if self.sampleMetadata[‘Run Order’] is not an int
  • TypeError – if self.sampleMetadata[‘Acquired Time’] is not a datetime
  • TypeError – if self.sampleMetadata[‘Sample Base Name’] is not str
  • LookupError – if self.sampleMetadata does not have a Subject ID column
  • TypeError – if self.sampleMetadata[‘Subject ID’] is not a str
  • TypeError – if self.sampleMetadata[‘Sample ID’] is not a str
  • ValueError – if self.featureMetadata does not have the same number of features as self._intensityData
  • TypeError – if self.featureMetadata[‘Feature Name’] is not a str
  • ValueError – if self.featureMetadata[‘Feature Name’] is not unique
  • LookupError – if self.featureMetadata does not have a calibrationMethod column
  • TypeError – if self.featureMetadata[‘calibrationMethod’] is not an enum ‘CalibrationMethod’
  • LookupError – if self.featureMetadata does not have a quantificationType column
  • TypeError – if self.featureMetadata[‘quantificationType’] is not an enum ‘QuantificationType’
  • LookupError – if self.featureMetadata does not have a Unit column
  • TypeError – if self.featureMetadata[‘Unit’] is not a str
  • LookupError – if self.featureMetadata does not have a LLOQ or similar column
  • TypeError – if self.featureMetadata[‘LLOQ’] or similar is not an int or float
  • LookupError – if self.featureMetadata does not have a ULOQ or similar column
  • TypeError – if self.featureMetadata[‘ULOQ’] or similar is not an int or float
  • LookupError – if self.featureMetadata does not have the ‘externalID’ as columns
  • AttributeError – if self.expectedConcentration does not exist
  • TypeError – if self.expectedConcentration is not a pandas.DataFrame
  • ValueError – if self.expectedConcentration does not have the same number of samples as self._intensityData
  • ValueError – if self.expectedConcentration does not have the same number of features as self._intensityData
  • ValueError – if self.expectedConcentration column name do not match self.featureMetadata[‘Feature Name’]
  • ValueError – if self.sampleMask is not initialised
  • ValueError – if self.sampleMask does not have the same number of samples as self._intensityData
  • ValueError – if self.featureMask has not been initialised
  • ValueError – if self.featureMask does not have the same number of features as self._intensityData
  • AttributeError – if self.calibration does not exist
  • TypeError – if self.calibration is not a dict
  • AttributeError – if self.calibration[‘calibIntensityData’] does not exist
  • TypeError – if self.calibration[‘calibIntensityData’] is not a numpy.ndarray
  • ValueError – if self.calibration[‘calibIntensityData’] does not have the same number of features as self._intensityData
  • AttributeError – if self.calibration[‘calibSampleMetadata’] does not exist
  • TypeError – if self.calibration[‘calibSampleMetadata’] is not a pandas.DataFrame
  • ValueError – if self.calibration[‘calibSampleMetadata’] does not have the same number of samples as self.calibration[‘calibIntensityData’]
  • AttributeError – if self.calibration[‘calibFeatureMetadata’] does not exist
  • TypeError – if self.calibration[‘calibFeatureMetadata’] is not a pandas.DataFrame
  • LookupError – if self.calibration[‘calibFeatureMetadata’] does not have a [‘Feature Name’] column
  • ValueError – if self.calibration[‘calibFeatureMetadata’] does not have the same number of features as self._intensityData
  • AttributeError – if self.calibration[‘calibExpectedConcentration’] does not exist
  • TypeError – if self.calibration[‘calibExpectedConcentration’] is not a pandas.DataFrame
  • ValueError – if self.calibration[‘calibExpectedConcentration’] does not have the same number of samples as self.calibration[‘calibIntensityData’]
  • ValueError – if self.calibration[‘calibExpectedConcentration’] does not have the same number of features as self.calibration[‘calibIntensityData’]
  • ValueError – if self.calibration[‘calibExpectedConcentration’] column name do not match self.featureMetadata[‘Feature Name’]
applyMasks()

Permanently delete elements masked (those set to False) in sampleMask and featureMask, from featureMetadata, sampleMetadata, intensityData and py:attr:TargetedDataset.expectedConcentration.

Features are excluded in each calibration based on the internal calibration['calibFeatureMetadata'] (iterate through the list of calibration if 2+ datasets have been joined with __add__()).

updateMasks(filterSamples=True, filterFeatures=True, sampleTypes=[<SampleType.StudySample>, <SampleType.StudyPool>], assayRoles=[<AssayRole.Assay>, <AssayRole.PrecisionReference>], quantificationTypes=[<QuantificationType.IS>, <QuantificationType.QuantOwnLabeledAnalogue>, <QuantificationType.QuantAltLabeledAnalogue>, <QuantificationType.QuantOther>, <QuantificationType.Monitored>], calibrationMethods=[<CalibrationMethod.backcalculatedIS>, <CalibrationMethod.noIS>, <CalibrationMethod.noCalibration>, <CalibrationMethod.otherCalibration>], rsdThreshold=None, **kwargs)

Update sampleMask and featureMask according to QC parameters.

updateMasks() sets sampleMask or featureMask to False for those items failing analytical criteria.

Similar to updateMasks(), without blankThreshold or artifactual filtering

Note

To avoid reintroducing items manually excluded, this method only ever sets items to False, therefore if you wish to move from more stringent criteria to a less stringent set, you will need to reset the mask to all True using initialiseMasks().

Parameters:
  • filterSamples (bool) – If False don’t modify sampleMask
  • filterFeatures (bool) – If False don’t modify featureMask
  • sampleTypes (SampleType) – List of types of samples to retain
  • assayRoles (AssayRole) – List of assays roles to retain
  • quantificationTypes (QuantificationType) – List of quantification types to retain
  • calibrationMethods (CalibrationMethod) – List of calibratio methods to retain
Raises:
  • TypeError – if sampleTypes is not a list
  • TypeError – if sampleTypes are not a SampleType enum
  • TypeError – if assayRoles is not a list
  • TypeError – if assayRoles are not an AssayRole enum
  • TypeError – if quantificationTypes is not a list
  • TypeError – if quantificationTypes are not a QuantificationType enum
  • TypeError – if calibrationMethods is not a list
  • TypeError – if calibrationMethods are not a CalibrationMethod enum
addSampleInfo(descriptionFormat=None, filePath=None, **kwargs)

Load additional metadata and map it in to the sampleMetadata table.

Possible options:

  • ‘NPC Subject Info’ Map subject metadata from a NPC sample manifest file (format defined in ‘PCSOP.082’)
  • ‘Raw Data’ Extract analytical parameters from raw data files
  • ‘ISATAB’ ISATAB study designs
  • ‘Filenames’ Parses sample information out of the filenames, based on the named capture groups in the regex passed in filenamespec
  • ‘Basic CSV’ Joins the sampleMetadata table with the data in the csv file at filePath=, matching on the ‘Sample File Name’ column in both.
  • ‘Batches’ Interpolate batch numbers for samples between those with defined batch numbers based on sample acquisitions times
Parameters:
  • descriptionFormat (str) – Format of metadata to be added
  • filePath (str) – Path to the additional data to be added
  • filenameSpec (None or str) – Only used if descriptionFormat is ‘Filenames’. A regular expression that extracts sample-type information into the following named capture groups: ‘fileName’, ‘baseName’, ‘study’, ‘chromatography’ ‘ionisation’, ‘instrument’, ‘groupingKind’ ‘groupingNo’, ‘injectionKind’, ‘injectionNo’, ‘reference’, ‘exclusion’ ‘reruns’, ‘extraInjections’, ‘exclusion2’. if None is passed, use the filenameSpec key in Attributes, loaded from the SOP json
Raises:

NotImplementedError – if the descriptionFormat is not understood

accuracyPrecision(onlyPrecisionReferences=False)

Return Precision (percent RSDs) and Accuracy for each SampleType and each unique concentration. Statistic grouped by SampleType, Feature and unique concentration.

Parameters:
  • dataset (TargetedDataset) – TargetedDataset object to generate the accuracy and precision for.
  • onlyPrecisionReference (bool) – If True only use samples with the AssayRole PrecisionReference.
Returns:

Dict of Accuracy and Precision dict for each group.

Return type:

dict(str:dict(str:pandas.DataFrame))

Raises:

TypeError – if dataset is not an instance of TargetedDataset

[1]Ralf Tautenhahn, Christoph Bottcher and Steffen Neumann. Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics, 9:504, 2008. URL: https://doi.org/10.1186/1471-2105-9-504
[2]Elizabeth J Want, Ian D Wilson, Helen Gika, Georgios Theodoridis, Robert S Plumb, John Shockcor, Elaine Holmes and Jeremy K Nicholson. Global metabolic profiling procedures for urine using UPLC-MS. Nature Protocols, 5(6):1005-18, 2010. URL: http://dx.doi.org/10.1038/nprot.2010.50
[3]Warwick B Dunn, David Broadhurst, Paul Begley, Eva Zelena, Sue Francis-McIntyre, Nadine Anderson, Marie Brown, Joshau D Knowles, Antony Halsall, John N Haselden, Andrew W Nicholls, Ian D Wilson, Douglas B Kell, Royston Goodacre and The Human Serum Metabolome (HUSERMET) Consortium. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nature Protocols, 6(7):1060-83, 2011. URL: http://dx.doi.org/10.1038/nprot.2011.335
[4]Matthew R Lewis, Jake TM Pearce, Konstantina Spagou, Martin Green, Anthony C Dona, Ada HY Yuen, Mark David, David J Berry, Katie Chappell, Verena Horneffer-van der Sluis, Rachel Shaw, Simon Lovestone, Paul Elliott, John Shockcor, John C Lindon, Olivier Cloarec, Zoltan Takats, Elaine Holmes and Jeremy K Nicholson. Development and Application of Ultra-Performance Liquid Chromatography-TOF MS for Precision Large Scale Urinary Metabolic Phenotyping. Analytical Chemistry, 88(18):9004-9013, 2016. URL: http://dx.doi.org/10.1021/acs.analchem.6b01481
[5]Jake TM Pearce, Toby J Athersuch, Timothy MD Ebbels, John C Lindon, Jeremy K Nicholson and Hector C Keun. Robust Algorithms for Automated Chemical Shift Calibration of 1D 1H NMR Spectra of Blood Serum. Analytical Chemistry, 80(18):7158-62, 2008. URL: http://dx.doi.org/10.1021/ac8011494
[6]Anthony C Dona, Beatriz Jiménez, Hartmut Schäfer, Eberhard Humpfer, Manfred Spraul, Matthew R Lewis, Jake TM Pearce, Elaine Holmes, John C Lindon and Jeremy K Nicholson. Precision High-Throughput Proton NMR Spectroscopy of Human Urine, Serum, and Plasma for Large-Scale Metabolic Phenotyping. Analytical Chemistry, 86(19):9887-9894, 2014. URL: http://dx.doi.org/10.1021/ac5025039
[7]Jean W Lee, Viswanath Devanarayan, Yu Chen Barrett, Russell Weiner, John Allinson, Scott Fountain, Stephen Keller, Ira Weinryb, Marie Green, Larry Duan, James A Rogers, Robert Millham, Peter J O’Brien, Jeff Sailstad, Masood Khan, Chad Ray and John A Wagner. Fit-for-purpose method development and validation for successful biomarker measurement. Pharmaceutical Research, 23(2):312-28, 2006. URL: http://dx.doi.org/10.1007/s11095-005-9045-3