Datasets¶

The nPYc-Toolbox is built around creating an object for each imported dataset. This object contains the metabolic profiling data itself, alongside all associated sample and feature metadata; various methods for generating, reporting and plotting important quality control parameters; and methods for pre-processing such as filtering poor quality features or correcting trends in batch and run-order.

The first step in creating an nPYc-Toolbox object is to import the acquired data, creating a Dataset specific for the data type:

MSDataset for LC-MS profiling data
NMRDataset for NMR profiling data
TargetedDataset for targeted datasets

For example, to import LC-MS data into a MSDataset object:

msData = nPYc.MSDataset('path to data')

Depending on the data type, the Dataset can be set up directly from the raw data, from common interchange formats, or from the outputs of popular data-processing tools. The supported data types are described in more detail in the data specific sections below.

When importing the data, default parameters, for example, specific parameters such as the number of points to interpolate NMR data into, or more generally the format to save figures as, are loaded from the Configuration Files. These parameters are subsequently saved in the Attributes dictionary and used throughout subsequent implementation of the pipeline.

For example, for NMR data, the nPYc-Toolbox contains two default configuration files, ‘GenericNMRUrine’ and ‘GenericNMRBlood’ for urine and blood datasets respectively, therefore, to import NMR spectra from urine samples the sop parameter would be:

nmrData = nPYc.NMRDataset('path to data', sop='GenericNMRurine')

A full list of the parameters for each dataset type is given in the Built-in Configuration SOPs. If different values are required, these can be modified directly in the appropriate SOP file, or alternatively they can be set by the user by modifying the required ‘Attribute’, either at import, or by subsequent direct modification in the pipeline. For example, to set the line width threshold (LWFailThreshold) to subsequently flag NMR spectra with line widths not meeting this value:

# EITHER, set the required value (here 0.8) at import
nmrData = nPYc.NMRDataset(rawDataPath, pulseProgram='noesygppr1d', LWFailThreshold=0.8)

# OR, set the *Attribute* directly (after importing nmrData)
nmrData.Attributes['LWFailThreshold'] = 0.8

Dataset objects have several key attributes, including:

sampleMetadata: A $n$ × $p$ pandas dataframe of sample identifiers and sample associated metadata (each row here corresponds to a row in the intensityData file)
featureMetadata: A $m$ × $q$ pandas dataframe of feature identifiers and feature associated metadata (each row here corresponds to a column in the intensityData file)
intensityData: A $n$ × $m$ numpy matrix of measurements, where each row and column respectively correspond to a the measured intensity of a specific sample feature
sampleMask: A $n$ numpy boolean vector where True and False flag samples for inclusion or exclusion respectively
featureMask: A $m$ numpy boolean vector where True and False flag features for inclusion or exclusion respectively

Structure of the key attributes of a Dataset object. Of note, rows in the featureMetadata Dataframe correspond to columns in the intensityData matrix.

Once created, you can query the number of features or samples it contains by running:

dataset.noFeatures
dataset.noSamples

Or directly inspect the sample or feature metadata, and the raw measurements:

dataset.sampleMetadata
dataset.featureMetadata
dataset.intensityData

For more details on using the sample and feature masks see Sample and Feature Masks.

It is possible to add additional study design parameters or sample metadata into the Dataset using the addSampleInfo() method (see Sample Metadata for details).

For full method specific details see Installation and Tutorials.

LC-MS Datasets¶

The toolbox is designed to be agnostic to the source of peak-picked profiling datasets, currently supporting the outputs of XCMS (Tautenhahn et al [1]), Bruker Metaboscape, and Progenesis QI, but simply expandable to data from other peak-pickers. Current best-practices in quality control of profiling LC-MS (Want et al [2], Dunn et al [3], Lewis et al [4]) data are applied, including utilising repeated injections of Study Reference samples in order to calculate analytical precision for the measurement of each feature (Relative Standard Deviation), and a serial dilution of the reference sample to asses the linearity of response (Correlation to Dilution), for full details see Feature Summary Report: LC-MS Datasets.

Study Reference samples are also used (in conjunction with Long-Term Reference samples if available) to assess and correct trends in batch and run-order (Batch & Run-Order Correction). Additionally, both RSD and correlation to dilution are used to filter features to retain only those measured with a high precision and accuracy (Sample and Feature Masks).

NMR Datasets¶

The nPYc-Toolbox supports input of processed Bruker GmbH format 1D experiments. Upon import, each spectrum’s chemical shift axis is calibrated to a reference peak (Pearce et al [5]), and all spectra interpolated onto a common scale, with full parameters as per the NMRDataset Objects configuration SOPs. The toolbox supports automated calculation of the quality control metrics described previously (Dona et al [6]), including assessments of line-width, water suppression quality, and baseline stability, for full details see Feature Summary Report: NMR Datasets.

Targeted Datasets¶

The TargetedDataset represents quantitative datasets where compounds are already identified, the exactitude of the quantification can be established, units are known and calibration curve or internal standards are employed (Lee et al [7]). It implements a set of reports and data consistency checks to assist analysts in assessing the presence of batch effects, applying limits of quantification (LOQ), standardizing the linearity range over multiple batches, and determining and visualising the accuracy and precision of each measurement, for more details see Feature Summary Report: NMR Targeted Datasets.

The nPYc-Toolbox supports input of both MS-derived targeted datasets (tutorial and further documentation in progress), and two Bruker proprietary human biofluid quantification platforms (IVDr algorithms) that generate targeted outputs from the NMR profiling data, BI-LISA for quantification of Lipoproteins (blood samples only) and BIQUANT-PS and BIQUANT-UR for small molecule metabolites (for blood and urine respectively).

Dataset Specific Syntax and Parameters¶

The main function parameters (which may be of interest to advanced users) are as follows:

Note, the Dataset object serves as a common parent to MSDataset, TargetedDataset, and NMRDataset, and should not typically be instantiated independently.

class nPYc.objects.Dataset(sop='Generic', sopPath=None, **kwargs)¶

Base class for nPYc dataset objects.

Parameters:	sop (str) – Load configuration parameters from specified SOP JSON file sopPath – By default SOPs are loaded from the `nPYc/StudyDesigns/SOP/` directory, if not `None` the directory specified in sopPath= will be searched before the builtin SOP directory.

featureMetadata = None¶

$m$ × $q$ pandas dataframe of feature identifiers and metadata

The featureMetadata table can include any datatype that can be placed in a pandas cell, However the toolbox assumes certain prerequisites on the following columns in order to function:

Column	dtype	Usage
Feature Name	str or float	ID of the feature measured in this column. Each ‘Feature Name’ must be unique in the table. If ‘Feature Name’ is numeric, the columns should be sorted in ascending or descending order.

sampleMetadata = None¶

$n$ × $p$ dataframe of sample identifiers and metadata.

The sampleMetadata table can include any datatype that can be placed in a pandas cell, However the toolbox assumes certain prerequisites on the following columns in order to function:

Column	dtype	Usage
Sample ID	str	ID of the sampling event generating this sample
AssayRole	`AssayRole`	Defines the role of this assay
SampleType	`SampleType`	Defines the type of sample acquired
Sample File Name	str	Unique file name for the analytical data
Sample Base Name	str	Common identifier that links analytical data to the Sample ID
Dilution	float	Where AssayRole is `LinearityReference`, the expected abundance is indicated here
Batch	int	Acquisition batch
Correction Batch	int	When detecting and correcting for batch and Run-Order effects, run-order effects are characterised within samples sharing the same Correction Batch, while batch effects are detected between distinct values
Acquired Time	datetime.datetime	Date and time of acquisition of raw data
Run order	int	Order of sample acquisition
Exclusion Details	str	Details of reasoning if marked for exclusion
Metadata Available	bool	Records which samples had metadata provided with the .addSampleInfo() method

featureMask = None¶: $m$ element vector, with True representing features to be included in analysis, and False those to be excluded

sampleMask = None¶: $p$ element vector, with True representing samples to be included in analysis, and False those to be excluded

AnalyticalPlatform = None¶: VariableType enum specifying the type of data represented.

Attributes = None¶

Dictionary of object configuration attributes, including those loaded from SOP files.

Defined attributes are as follows:

Key	dtype	Usage
‘dpi’	positive int	Raster resolution when plotting figures
‘figureSize’	positive (float, float)	Size to plot figures
‘figureFormat’	str	Format to save figures in
‘histBins’	positive int	Number of bins to use when drawing histograms
‘Feature Names’	Column in `featureMetadata`	ID of the primary feature name

intensityData¶: $n$ × $m$ numpy matrix of measurements

noSamples¶

Returns:	Number of samples in the dataset (n)
Return type:	int

noFeatures¶

Returns:	Number of features in the dataset (m)
Return type:	int

log¶: Return log entries as a string.

name¶: Returns or sets the name of the dataset. name must be a string

Normalisation¶: Normaliser object that transforms the measurements in intensityData.

validateObject(verbose=True, raiseError=False, raiseWarning=True)¶

Checks that all the attributes specified in the class definition are present and of the required class and/or values. Checks for attributes existence and type. Check for mandatory columns existence, but does not check the column values (type or uniqueness). If ‘sampleMetadataExcluded’, ‘intensityDataExcluded’, ‘featureMetadataExcluded’ or ‘excludedFlag’ exist, the existence and number of exclusions (based on ‘sampleMetadataExcluded’) is checked

Parameters:	verbose (bool) – if True the result of each check is printed (default True) raiseError (bool) – if True an error is raised when a check fails and the validation is interrupted (default False) raiseWarning (bool) – if True a warning is raised when a check fails
Returns:	True if the Object conforms to basic `Dataset`
Return type:	bool
Raises:	TypeError – if the Object class is wrong AttributeError – if self.Attributes does not exist TypeError – if self.Attributes is not a dict AttributeError – if self.Attributes[‘Log’] does not exist TypeError – if self.Attributes[‘Log’] is not a list AttributeError – if self.Attributes[‘dpi’] does not exist TypeError – if self.Attributes[‘dpi’] is not an int AttributeError – if self.Attributes[‘figureSize’] does not exist TypeError – if self.Attributes[‘figureSize’] is not a list ValueError – if self.Attributes[‘figureSize’] is not of length 2 TypeError – if self.Attributes[‘figureSize’][0] is not a int or float TypeError – if self.Attributes[‘figureSize’][1] is not a int or float AttributeError – if self.Attributes[‘figureFormat’] does not exist TypeError – if self.Attributes[‘figureFormat’] is not a str AttributeError – if self.Attributes[‘histBins’] does not exist TypeError – if self.Attributes[‘histBins’] is not an int AttributeError – if self.Attributes[‘noFiles’] does not exist TypeError – if self.Attributes[‘noFiles’] is not an int AttributeError – if self.Attributes[‘quantiles’] does not exist TypeError – if self.Attributes[‘quantiles’] is not a list ValueError – if self.Attributes[‘quantiles’] is not of length 2 TypeError – if self.Attributes[‘quantiles’][0] is not a int or float TypeError – if self.Attributes[‘quantiles’][1] is not a int or float AttributeError – if self.Attributes[‘sampleMetadataNotExported’] does not exist TypeError – if self.Attributes[‘sampleMetadataNotExported’] is not a list AttributeError – if self.Attributes[‘featureMetadataNotExported’] does not exist TypeError – if self.Attributes[‘featureMetadataNotExported’] is not a list AttributeError – if self.Attributes[‘analyticalMeasurements’] does not exist TypeError – if self.Attributes[‘analyticalMeasurements’] is not a dict AttributeError – if self.Attributes[‘excludeFromPlotting’] does not exist TypeError – if self.Attributes[‘excludeFromPlotting’] is not a list AttributeError – if self.VariableType does not exist AttributeError – if self._Normalisation does not exist TypeError – if self._Normalisation is not the Normaliser ABC AttributeError – if self._name does not exist TypeError – if self._name is not a str AttributeError – if self._intensityData does not exist TypeError – if self._intensityData is not a numpy.ndarray AttributeError – if self.sampleMetadata does not exist TypeError – if self.sampleMetadata is not a pandas.DataFrame LookupError – if self.sampleMetadata does not have a Sample File Name column LookupError – if self.sampleMetadata does not have an AssayRole column LookupError – if self.sampleMetadata does not have a SampleType column LookupError – if self.sampleMetadata does not have a Dilution column LookupError – if self.sampleMetadata does not have a Batch column LookupError – if self.sampleMetadata does not have a Correction Batch column LookupError – if self.sampleMetadata does not have a Run Order column LookupError – if self.sampleMetadata does not have a Sample ID column LookupError – if self.sampleMetadata does not have a Sample Base Name column LookupError – if self.sampleMetadata does not have an Acquired Time column LookupError – if self.sampleMetadata does not have an Exclusion Details column AttributeError – if self.featureMetadata does not exist TypeError – if self.featureMetadata is not a pandas.DataFrame LookupError – if self.featureMetadata does not have a Feature Name column AttributeError – if self.sampleMask does not exist TypeError – if self.sampleMask is not a numpy.ndarray ValueError – if self.sampleMask are not bool AttributeError – if self.featureMask does not exist TypeError – if self.featureMask is not a numpy.ndarray ValueError – if self.featureMask are not bool AttributeError – if self.sampleMetadataExcluded does not exist TypeError – if self.sampleMetadataExcluded is not a list AttributeError – if self.intensityDataExcluded does not exist TypeError – if self.intensityDataExcluded is not a list ValueError – if self.intensityDataExcluded does not have the same number of exclusions as self.sampleMetadataExcluded AttributeError – if self.featureMetadataExcluded does not exist TypeError – if self.featureMetadataExcluded is not a list ValueError – if self.featureMetadataExcluded does not have the same number of exclusions as self.sampleMetadataExcluded AttributeError – if self.excludedFlag does not exist TypeError – if self.excludedFlag is not a list ValueError – if self.excludedFlag does not have the same number of exclusions as self.sampleMetadataExcluded

initialiseMasks()¶: Re-initialise featureMask and sampleMask to match the current dimensions of intensityData, and include all samples.

updateMasks(filterSamples=True, filterFeatures=True, sampleTypes=[<SampleType.StudySample>, <SampleType.StudyPool>, <SampleType.ExternalReference>, <SampleType.MethodReference>, <SampleType.ProceduralBlank>], assayRoles=[<AssayRole.Assay>, <AssayRole.PrecisionReference>, <AssayRole.LinearityReference>, <AssayRole.Blank>], **kwargs)¶

Update sampleMask and featureMask according to parameters.

updateMasks() sets sampleMask or featureMask to False for those items failing analytical criteria.

Note

To avoid reintroducing items manually excluded, this method only ever sets items to False, therefore if you wish to move from more stringent criteria to a less stringent set, you will need to reset the mask to all True using initialiseMasks().

Parameters:	filterSamples (bool) – If `False` don’t modify sampleMask filterFeatures (bool) – If `False` don’t modify featureMask sampleTypes (SampleType) – List of types of samples to retain sampleRoles (AssayRole) – List of assays roles to retain

applyMasks()¶: Permanently delete elements masked (those set to False) in sampleMask and featureMask, from featureMetadata, sampleMetadata, and intensityData.

addSampleInfo(descriptionFormat=None, filePath=None, filetype=None, **kwargs)¶

Load additional metadata and map it in to the sampleMetadata table.

Possible options:

‘Basic CSV’ Joins the sampleMetadata table with the data in the csv file at filePath=, matching on the ‘Sample File Name’ column in both (see Sample Metadata).
‘Filenames’ Parses sample information out of the filenames, based on the named capture groups in the regex passed in filenamespec
‘Raw Data’ Extract analytical parameters from raw data files
‘ISATAB’ ISATAB study designs

Parameters:	descriptionFormat (str) – Format of metadata to be added filePath (str) – Path to the additional data to be added
Raises:	NotImplementedError – if the descriptionFormat is not understood

addFeatureInfo(filePath=None, descriptionFormat=None, featureId=None, **kwargs)¶

Load additional metadata and map it in to the featureMetadata table.

Possible options:

‘Reference Ranges’ JSON file specifying upper and lower reference ranges for a feature.

Parameters:	filePath (str) – Path to the additional data to be added descriptionFormat (str) – featureId (str) – Unique feature Id field in the metadata file provided to match with internal Feature Name
Raises:	NotImplementedError – if the descriptionFormat is not understood

excludeSamples(sampleList, on='Sample File Name', message='User Excluded')¶

Sets the sampleMask for the samples listed in sampleList to False to mask them from the dataset.

Parameters:	sampleList (list) – A list of sample IDs to be excluded on (str) – name of the column in `sampleMetadata` to match sampleList against, defaults to ‘Sample File Name’ message (str) – append this message to the ‘Exclusion Details’ field for each sample excluded, defaults to ‘User Excluded’
Returns:	a list of IDs passed in sampleList that could not be matched against the sample IDs present
Return type:	list

excludeFeatures(featureList, on='Feature Name', message='User Excluded')¶

Masks the features listed in featureList from the dataset.

Parameters:	featureList (list) – A list of feature IDs to be excluded on (str) – name of the column in `featureMetadata` to match featureList against, defaults to ‘Feature Name’ message (str) – append this message to the ‘Exclusion Details’ field for each feature excluded, defaults to ‘User Excluded’
Returns:	A list of ID passed in featureList that could not be matched against the feature IDs present.
Return type:	list

exportDataset(destinationPath='.', saveFormat='CSV', isaDetailsDict={}, withExclusions=True, escapeDelimiters=False, filterMetadata=True)¶

Export dataset object in a variety of formats for import in other software, the export is named according to the name attribute of the Dataset object.

Possible save formats are:

CSV Basic CSV output, featureMetadata, sampleMetadata and intensityData are written to three separate CSV files in desitinationPath
UnifiedCSV Exports featureMetadata, sampleMetadata and intensityData concatenated into a single CSV file
ISATAB Exports the sampleMetadata in the ISATAB format

Parameters:	destinationPath (str) – Save data into the directory specified here format (str) – File format for saved data, defaults to CSV. detailsDict (dict) – Contains several key: value pairs required to for exporting ISATAB.

detailsDict should have the format: detailsDict = {

‘investigation_identifier’ : “i1”, ‘investigation_title’ : “Give it a title”, ‘investigation_description’ : “Add a description”, ‘investigation_submission_date’ : “2016-11-03”, ‘investigation_public_release_date’ : “2016-11-03”, ‘first_name’ : “Noureddin”, ‘last_name’ : “Sadawi”, ‘affiliation’ : “University”, ‘study_filename’ : “my_ms_study”, ‘study_material_type’ : “Serum”, ‘study_identifier’ : “s1”, ‘study_title’ : “Give the study a title”, ‘study_description’ : “Add study description”, ‘study_submission_date’ : “2016-11-03”, ‘study_public_release_date’ : “2016-11-03”, ‘assay_filename’ : “my_ms_assay”

}

Parameters:

withExclusions (bool) – If True mask features and samples will be excluded
escapeDelimiters (bool) – If True remove characters commonly used as delimiters in csv files from metadata
filterMetadata (bool) – If True does not export the sampleMetadata and featureMetadata columns listed in self.Attributes[‘sampleMetadataNotExported’] and self.Attributes[‘featureMetadataNotExported’]

Raises:

ValueError – if saveFormat is not understood

getFeatures(featureIDs, by=None, useMasks=True)¶

Get a feature or list of features by name or ranges.

If VariableType is Discrete, getFeature() expects either a single or list of values, and matching features are returned. If VariableType is Spectral, pass either a single, or list of (min, max) tuples, the features returned will be a slice of the combined ranges. If the ranges passed overlap, the union will be returned.

Parameters:	featureIDs – A single or list of feature IDs to return by (None or str) – Column in `featureMetadata` to search in, `None` use the column defined in `Attributes`[‘Feature Names’]
Returns:	(featureMetadata, intensityData)
Return type:	(pandas.Dataframe, numpy.ndarray)

class nPYc.objects.MSDataset(datapath, fileType='QI', sop='GenericMS', **kwargs)¶

MSDataset extends Dataset to represent both peak-picked LC- or DI-MS datasets (discrete variables), and Continuum mode (spectral) DI-MS datasets.

Objects can be initialised from a variety of common data formats, currently peak-picked data from Progenesis QI or XCMS, and targeted Biocrates datasets.

Progenesis QI

QI import operates on csv files exported via the ‘Export Compound Measurements’ menu option in QI. Import requires the presence of both normalised and raw datasets, but will only import the raw meaturenents.
XCMS

XCMS import operates on the csv files generated by XCMS with the peakTable() method. By default, the csv is expected to have 14 columns of feature parameters, with the intensity values for the first sample coming on the 15 column. However, the number of columns to skip is dataset dependent and can be set with the noFeatureParams= keyword argument. This method assumes that the retention time value in the XCMS exported peak list is specified in seconds.
XCMSOnline

XCMS Online download output supplies an unannotated and an annotated xlsx file stored by default in “XCMS results” folder. By default, the table is expected to have 10 columns of feature parameters, with the intensity values for the first sample coming on the 11th column. However, the number of columns to skip is dataset dependent and can be set with the (e noFeatureParams= keyword argument.
MZmine

MZmine2: import operates on csv files exported via the ‘Export to CSV’ file’ menu option. Field separator should be comma “,” and all export elements should be chosen for export. MZmine3: choose ‘Export feature list’ -> ‘CSV (legacy MZmine 2)’ menu option. Field separator should be comma “,” and all export elements should be chosen for export.
MS-DIAL

MS-DIAL import operates on the .txt (MSP) files exported via the ‘Export -> Alignment result’ menu option. Export options to choose are preferably ‘Raw data matrix (Area)’ or ‘Raw data matrix (Height)’. This method will also import the accompanying experimental metadata information such as File Type, Injection Order and Batch ID.
Biocrates

Operates on spreadsheets exported from Biocrates MetIDQ. By default loads data from the sheet named ‘Data Export’, this may be overridden with the sheetName= argument, If the number of sample metadata columns differes from the default, this can be overridden with the noSampleParams= argument.
nPYc

nPYc import operates on the csv file generated using nPYc exportDataset function (‘combinedData’ file). This reimport function is meant for further filtering or normalisation without having to run whole process again. Note that metadata does not need to be imported again.

correlationToDilution¶

Returns the correlation of features to dilution as calculated on samples marked as ‘Dilution Series’ in sampleMetadata, with dilution expressed in ‘Dilution’.

Returns:	Vector of feature correlations to dilution
Return type:	numpy.ndarray

artifactualLinkageMatrix¶: Gets overlapping artifactual features.

rsdSP¶

Returns percentage relative standard deviations for each feature in the dataset, calculated on samples with the Assay Role PrecisionReference and Sample Type StudyPool in sampleMetadata.

Returns:	Vector of feature RSDs
Return type:	numpy.ndarray

rsdSS¶

Returns percentage relative standard deviations for each feature in the dataset, calculated on samples with the Assay Role Assay and Sample Type StudySample in sampleMetadata.

Returns:	Vector of feature RSDs
Return type:	numpy.ndarray

applyMasks()¶

Permanently delete elements masked (those set to False) in sampleMask and featureMask, from featureMetadata, sampleMetadata, and intensityData.

Resets feature linkage matrix and feature correlations.

updateMasks(filterSamples=True, filterFeatures=True, sampleTypes=[<SampleType.StudySample>, <SampleType.StudyPool>, <SampleType.ExternalReference>, <SampleType.MethodReference>, <SampleType.ProceduralBlank>], assayRoles=[<AssayRole.Assay>, <AssayRole.PrecisionReference>, <AssayRole.LinearityReference>, <AssayRole.Blank>], featureFilters={'artifactualFilter': False, 'blankFilter': False, 'correlationToDilutionFilter': True, 'rsdFilter': True, 'varianceRatioFilter': True}, **kwargs)¶

Update sampleMask and featureMask according to QC parameters.

updateMasks() sets sampleMask or featureMask to False for those items failing analytical criteria.

Note

To avoid reintroducing items manually excluded, this method only ever sets items to False, therefore if you wish to move from more stringent criteria to a less stringent set, you will need to reset the mask to all True using initialiseMasks().

Parameters:

filterSamples (bool) – If False don’t modify sampleMask
filterFeatures (bool) – If False don’t modify featureMask
sampleTypes (SampleType) – List of types of samples to retain
assayRoles (AssayRole) – List of assays roles to retain
correlationThreshold (None or float) – Mask features with a correlation below this value. If None, use the value from Attributes[‘corrThreshold’]
rsdThreshold (None or float) – Mask features with a RSD below this value. If None, use the value from Attributes[‘rsdThreshold’]
varianceRatio (None or float) – Mask features where the RSD measured in study samples is below that measured in study reference samples multiplied by varianceRatio
withArtifactualFiltering (None or bool) – If None use the value from Attributes['artifactualFilter']. If False doesn’t apply artifactual filtering. If Attributes['artifactualFilter'] is set to False artifactual filtering will not take place even if withArtifactualFiltering is set to True.
deltaMzArtifactual (None or float) – Maximum allowed m/z distance between two grouped features. If None, use the value from Attributes[‘deltaMzArtifactual’]
overlapThresholdArtifactual (None or float) – Minimum peak overlap between two grouped features. If None, use the value from Attributes[‘overlapThresholdArtifactual’]
corrThresholdArtifactual (None or float) – Minimum correlation between two grouped features. If None, use the value from Attributes[‘corrThresholdArtifactual’]
blankThreshold (None, False, or float) – Mask features thats median intesity falls below blankThreshold x the level in the blank. If False do not filter, if None use the cutoff from Attributes[‘blankThreshold’], otherwise us the cutoff scaling factor provided

saveFeatureMask()¶: Updates featureMask and saves as ‘Passing Selection’ in self.featureMetadata

addSampleInfo(descriptionFormat=None, filePath=None, filenameSpec=None, filetype='Waters .raw', **kwargs)¶

Load additional metadata and map it in to the sampleMetadata table.

Possible options:

‘NPC LIMS’ NPC LIMS files mapping files names of raw analytical data to sample IDs
‘NPC Subject Info’ Map subject metadata from a NPC sample manifest file (format defined in ‘PCSOP.082’)
‘Raw Data’ Extract analytical parameters from raw data files
‘ISATAB’ ISATAB study designs
‘Filenames’ Parses sample information out of the filenames, based on the named capture groups in the regex passed in filenamespec
‘Basic CSV’ Joins the sampleMetadata table with the data in the csv file at filePath=, matching on the ‘Sample File Name’ column in both.

Parameters:

descriptionFormat (str) – Format of metadata to be added
filePath (str) – Path to the additional data to be added
filenameSpec (None or str) – Only used if descriptionFormat is ‘Filenames’. A regular expression that extracts sample-type information into the following named capture groups: ‘fileName’, ‘baseName’, ‘study’, ‘chromatography’ ‘ionisation’, ‘instrument’, ‘groupingKind’ ‘groupingNo’, ‘injectionKind’, ‘injectionNo’, ‘reference’, ‘exclusion’ ‘reruns’, ‘extraInjections’, ‘exclusion2’. if None is passed, use the filenameSpec key in Attributes, loaded from the SOP json

Raises:

NotImplementedError – if the descriptionFormat is not understood

amendBatches(sampleRunOrder)¶

Creates a new batch starting at the sample index in sampleRunOrder, and amends subsequent batch numbers in sampleMetadata[‘Correction Batch’]

Parameters:	sampleRunOrder (int) – Index of first sample in new batch

artifactualFilter(featMask=None)¶

Filter artifactual features on top of the featureMask already present if none given as input Keep feature with the highest intensity on the mean spectra

Parameters:	featMask (numpy.ndarray or None) – A featureMask (`True` for inclusion), if `None`, use `featureMask`
Returns:	Amended featureMask
Return type:	numpy.ndarray

excludeFeatures(featureList, on='Feature Name', message='User Excluded')¶

Masks the features listed in featureList from the dataset.

Parameters:	featureList (list) – A list of feature IDs to be excluded on (str) – name of the column in `featureMetadata` to match featureList against, defaults to ‘Feature Name’ message (str) – append this message to the ‘Exclusion Details’ field for each feature excluded, defaults to ‘User Excluded’
Returns:	A list of ID passed in featureList that could not be matched against the feature IDs present.
Return type:	list

initialiseMasks()¶: Re-initialise featureMask and sampleMask to match the current dimensions of intensityData, and include all samples.

validateObject(verbose=True, raiseError=False, raiseWarning=True)¶

Checks that all the attributes specified in the class definition are present and of the required class and/or values.

Returns 4 boolean: is the object a Dataset < a basic MSDataset < has the object parameters for QC < has the object sample metadata.

To employ all class methods, the most inclusive (has the object sample metadata) must be successful:

‘Basic MSDataset’ checks Dataset types and uniqueness as well as additional attributes.
‘has parameters for QC’ is ‘Basic MSDataset’ + sampleMetadata[[‘SampleType, AssayRole, Dilution, Run Order, Batch, Correction Batch, Sample Base Name]]
‘has sample metadata’ is ‘has parameters for QC’ + sampleMetadata[[‘Sample ID’, ‘Subject ID’, ‘Matrix’]]

Column type() in pandas.DataFrame are established on the first sample when necessary Does not check for uniqueness in sampleMetadata['Sample File Name'] Does not currently check Attributes['Raw Data Path'] type Does not currently check corrExclusions type

Parameters:	verbose (bool) – if True the result of each check is printed (default True) raiseError (bool) – if True an error is raised when a check fails and the validation is interrupted (default False) raiseWarning (bool) – if True a warning is raised when a check fails
Returns:	A dictionary of 4 boolean with True if the Object conforms to the corresponding test. ‘Dataset’ conforms to `Dataset`, ‘BasicMSDataset’ conforms to `Dataset` + basic `MSDataset`, ‘QC’ BasicMSDataset + object has QC parameters, ‘sampleMetadata’ QC + object has sample metadata information
Return type:	dict
Raises:	TypeError – if the Object class is wrong AttributeError – if self.Attributes[‘rtWindow’] does not exist TypeError – if self.Attributes[‘rtWindow’] is not an int or float AttributeError – if self.Attributes[‘msPrecision’] does not exist TypeError – if self.Attributes[‘msPrecision’] is not an int or float AttributeError – if self.Attributes[‘varianceRatio’] does not exist TypeError – if self.Attributes[‘varianceRatio’] is not an int or float AttributeError – if self.Attributes[‘blankThreshold’] does not exist TypeError – if self.Attributes[‘blankThreshold’] is not an int or float AttributeError – if self.Attributes[‘corrMethod’] does not exist TypeError – if self.Attributes[‘corrMethod’] is not a str AttributeError – if self.Attributes[‘corrThreshold’] does not exist TypeError – if self.Attributes[‘corrThreshold’] is not an int or float AttributeError – if self.Attributes[‘rsdThreshold’] does not exist TypeError – if self.Attributes[‘rsdThreshold’] is not an int or float AttributeError – if self.Attributes[‘artifactualFilter’] does not exist TypeError – if self.Attributes[‘artifactualFilter’] is not a bool AttributeError – if self.Attributes[‘deltaMzArtifactual’] does not exist TypeError – if self.Attributes[‘deltaMzArtifactual’] is not an int or float AttributeError – if self.Attributes[‘overlapThresholdArtifactual’] does not exist TypeError – if self.Attributes[‘overlapThresholdArtifactual’] is not an int or float AttributeError – if self.Attributes[‘corrThresholdArtifactual’] does not exist TypeError – if self.Attributes[‘corrThresholdArtifactual’] is not an int or float AttributeError – if self.Attributes[‘FeatureExtractionSoftware’] does not exist TypeError – if self.Attributes[‘FeatureExtractionSoftware’] is not a str AttributeError – if self.Attributes[‘Raw Data Path’] does not exist TypeError – if self.Attributes[‘Raw Data Path’] is not a str AttributeError – if self.Attributes[‘Feature Names’] does not exist TypeError – if self.Attributes[‘Feature Names’] is not a str TypeError – if self.VariableType is not an enum ‘VariableType’ AttributeError – if self.corrExclusions does not exist AttributeError – if self._correlationToDilution does not exist TypeError – if self._correlationToDilution is not a numpy.ndarray AttributeError – if self._artifactualLinkageMatrix does not exist TypeError – if self._artifactualLinkageMatrix is not a pandas.DataFrame AttributeError – if self._tempArtifactualLinkageMatrix does not exist TypeError – if self._tempArtifactualLinkageMatrix is not a pandas.DataFrame AttributeError – if self.fileName does not exist TypeError – if self.fileName is not a str AttributeError – if self.filePath does not exist TypeError – if self.filePath is not a str ValueError – if self.sampleMetadata does not have the same number of samples as self._intensityData TypeError – if self.sampleMetadata[‘Sample File Name’] is not str TypeError – if self.sampleMetadata[‘AssayRole’] is not an enum ‘AssayRole’ TypeError – if self.sampleMetadata[‘SampleType’] is not an enum ‘SampleType’ TypeError – if self.sampleMetadata[‘Dilution’] is not an int or float TypeError – if self.sampleMetadata[‘Batch’] is not an int or float TypeError – if self.sampleMetadata[‘Correction Batch’] is not an int or float TypeError – if self.sampleMetadata[‘Run Order’] is not an int TypeError – if self.sampleMetadata[‘Acquired Time’] is not a datetime TypeError – if self.sampleMetadata[‘Sample Base Name’] is not str LookupError – if self.sampleMetadata does not have a Matrix column TypeError – if self.sampleMetadata[‘Matrix’] is not a str LookupError – if self.sampleMetadata does not have a Subject ID column TypeError – if self.sampleMetadata[‘Subject ID’] is not a str TypeError – if self.sampleMetadata[‘Sample ID’] is not a str ValueError – if self.featureMetadata does not have the same number of features as self._intensityData TypeError – if self.featureMetadata[‘Feature Name’] is not a str ValueError – if self.featureMetadata[‘Feature Name’] is not unique LookupError – if self.featureMetadata does not have a m/z column TypeError – if self.featureMetadata[‘m/z’] is not an int or float LookupError – if self.featureMetadata does not have a Retention Time column TypeError – if self.featureMetadata[‘Retention Time’] is not an int or float ValueError – if self.sampleMask has not been initialised ValueError – if self.sampleMask does not have the same number of samples as self._intensityData ValueError – if self.featureMask has not been initialised ValueError – if self.featureMask does not have the same number of features as self._intensityData

class nPYc.objects.NMRDataset(datapath, fileType='Bruker', sop='GenericNMRurine', pulseprogram= 'noesygpp1d', **kwargs)¶

NMRDataset extends Dataset to represent both spectral and peak-picked NMR datasets.

Objects can be initialised from a variety of common data formats, including Bruker-format raw data, and BI-LISA targeted lipoprotein analysis.

Bruker

When loading Bruker format raw spectra (1r files), all directores below datapath will be scanned for valid raw data, and those matching pulseprogram loaded and aligned onto a common scale as defined in sop.
BI-LISA

BI-LISA data can be read from Excel workbooks, the name of the sheet containing the data to be loaded should be passed in the pulseProgram argument. Feature descriptors will be loaded from the ‘Analytes’ sheet, and file names converted back to the ExperimentName/expno format from ExperimentName_EXPNO_expno.

Parameters:	fileType (str) – Type of data to be loaded sheetname (str) – Load data from the specifed sheet of the Excel workbook pulseprogram (str) – When loading raw data, only import spectra aquired with pulseprogram

addSampleInfo(descriptionFormat=None, filePath=None, filenameSpec=None, **kwargs)¶

Load additional metadata and map it in to the sampleMetadata table.

Possible options:

‘NPC LIMS’ NPC LIMS files mapping files names of raw analytical data to sample IDs
‘NPC Subject Info’ Map subject metadata from a NPC sample manifest file (format defined in ‘PCSOP.082’)
‘Raw Data’ Extract analytical parameters from raw data files
‘ISATAB’ ISATAB study designs
‘Filenames’ Parses sample information out of the filenames, based on the named capture groups in the regex passed in filenamespec
‘Basic CSV’ Joins the sampleMetadata table with the data in the csv file at filePath=, matching on the ‘Sample File Name’ column in both.

Parameters:

descriptionFormat (str) – Format of metadata to be added
filePath (str) – Path to the additional data to be added
filenameSpec (None or str) – Only used if descriptionFormat is ‘Filenames’. A regular expression that extracts sample-type information into the following named capture groups: ‘fileName’, ‘baseName’, ‘study’, ‘chromatography’ ‘ionisation’, ‘instrument’, ‘groupingKind’ ‘groupingNo’, ‘injectionKind’, ‘injectionNo’, ‘reference’, ‘exclusion’ ‘reruns’, ‘extraInjections’, ‘exclusion2’. if None is passed, use the filenameSpec key in Attributes, loaded from the SOP json

Raises:

NotImplementedError – if the descriptionFormat is not understood

updateMasks(filterSamples=True, filterFeatures=True, sampleTypes=[<SampleType.StudySample>, <SampleType.StudyPool>, <SampleType.ExternalReference>, <SampleType.MethodReference>, <SampleType.ProceduralBlank>], assayRoles=[<AssayRole.Assay>, <AssayRole.PrecisionReference>, <AssayRole.LinearityReference>, <AssayRole.Blank>], exclusionRegions=None, sampleQCChecks=[], **kwargs)¶

Update sampleMask and featureMask according to parameters.

updateMasks() sets sampleMask or featureMask to False for those items failing analytical criteria.

Note

To avoid reintroducing items manually excluded, this method only ever sets items to False, therefore if you wish to move from more stringent criteria to a less stringent set, you will need to reset the mask to all True using initialiseMasks().

Parameters:

filterSamples (bool) – If False don’t modify sampleMask
filterFeatures (bool) – If False don’t modify featureMask
sampleTypes (SampleType) – List of types of samples to retain
sampleRoles (AssayRole) – List of assays roles to retain
exclusionRegions (list of tuple) – If None Exclude ranges defined in Attributes[‘exclusionRegions’]
sampleQCChecks (list) – Which quality control metrics to use.

plot(spectra, labels, interactive=False)¶

Plots a set of nmr spectra. If interactive is False, returns a static matplotlib plot. If True, then plotly is used to generate an interactive plot.

Parameters:	spectra – The specific ‘labels’ of the spectra to plot. By default all spectra are plotted. labels – Which labels to select interactive – Use matplotlib (False) or plotly (True)
Returns:	Displays the NMR data and returns either a matplotlib axis object or a plotly figure dictionary

class nPYc.objects.TargetedDataset(dataPath, fileType='TargetLynx', sop='Generic', **kwargs)¶

TargetedDataset extends Dataset to represent quantitative datasets, where compounds are already identified, the exactitude of the quantification can be established, units are known and calibration curve or internal standards are employed. The TargetedDataset class include methods to apply limits of quantification (LLOQ and ULOQ), merge multiple analytical batch, and report accuracy and precision of each measurements.

In addition to the structure of Dataset, TargetedDataset requires the following attributes:

expectedConcentration:

A $n$ × $m$ pandas dataframe of expected concentrations (matching the intensityData dimension), with column names matching featureMetadata[‘Feature Name’]
calibration:
A dictionary containing pandas dataframe describing calibration samples:
- calibration['calibIntensityData']:
  
  A $r$ x $m$ numpy matrix of measurements. Features must match features in intensityData
- calibration['calibSampleMetadata']:
  
  A $r$ x $m$ pandas dataframe of calibration sample identifiers and metadata
- calibration['calibFeatureMetadata']:
  
  A $m$ × $q$ pandas dataframe of feature identifiers and metadata
- calibration['calibExpectedConcentration']:
  
  A $r$ × $m$ pandas dataframe of calibration samples expected concentrations
Attributes must contain the following (can be loaded from a method specific JSON on import):
- methodName:
  
  A (str) name of the method
- externalID:
  
  A list of external ID, each external ID must also be present in Attributes as a list of identifier (for that external ID) for each feature. For example, if externalID=['PubChem ID'], Attributes['PubChem ID']=['ID1','ID2','','ID75']
featureMetadata expects the following columns:
- quantificationType:
  
  A QuantificationType enum specifying the exactitude of the quantification procedure employed.
- calibrationMethod:
  
  A CalibrationMethod enum specifying the calibration method employed.
- Unit:
  
  A (str) unit corresponding the the feature measurement value.
- LLOQ:
  
  The lowest limit of quantification, used to filter concentrations < LLOQ
- ULOQ:
  
  The upper limit of quantification, used to filter concentrations > ULOQ
- externalID:
  
  All externalIDs listed in Attributes['externalID'] must be present as their own column

Currently targeted assay results processed using TargetLynx or Bruker quantification results can be imported. To create an import for any other form of semi-quantitative or quantitative results, the procedure is as follow:

Create a new fileType == 'myMethod' entry in __init__()

Define functions to populate all expected dataframes (using file readers, JSON,…)

Separate calibration samples from study samples (store in calibration). If none exist, intialise empty dataframes with the correct number of columns and column names.

Execute pre-processing steps if required (note: all feature values should be expressed in the unit listed in featureMetadata['Unit'])

Apply limits of quantification using _applyLimitsOfQuantification(). (This function does not apply limits of quantification to features marked as QuantificationType == QuantificationType.Monitored for compounds monitored for relative information.)

The resulting TargetedDatset created must satisfy to the criteria for BasicTargetedDataset, which can be checked with validatedObject() (list the minimum requirements for all class methods).

fileType == 'TargetLynx' to import data processed using TargetLynx
TargetLynx import operates on xml files exported via the ‘File -> Export -> XML’ TargetLynx menu option. Import requires a calibration_report.csv providing lower and upper limits of quantification (LLOQ, ULOQ) with the calibrationReportPath keyword argument.

Targeted data measurements as well as calibration report information are read and mapped with pre-defined SOPs. All measurments are converted to pre-defined units and measurements inferior to the lowest limits of quantification or superior to the upper limits of quantification are replaced. Once the import is finished, only analysed samples are returned (no calibration samples) and only features mapped onto the pre-defined SOP and sufficiently described.

Instructions to created new TargetLynx SOP can be found on the generation of targeted SOPs page.

Example: TargetedDataset(datapath, fileType='TargetLynx', sop='OxylipinMS', calibrationReportPath=calibrationReportPath, sampleTypeToProcess=['Study Sample','QC'], noiseFilled=False, onlyLLOQ=False, responseReference=None)
- sop
  
  Currently implemented are ‘OxylipinMS’ and ‘AminoAcidMS’
  
  AminoAcidMS: Gray N. et al. Human Plasma and Serum via Precolumn Derivatization with 6‑Aminoquinolyl‑N‑hydroxysuccinimidyl Carbamate: Application to Acetaminophen-Induced Liver Failure. Analytical Chemistry, 2017, 89, 2478−87.
  
  OxylipinMS: Wolfer AM. et al. Development and Validation of a High-Throughput Ultrahigh-Performance Liquid Chromatography-Mass Spectrometry Approach for Screening of Oxylipins and Their Precursors. Analytical Chemistry, 2015, 87 (23),11721–31
- calibrationReportPath
  
  Path to the calibration report csv following the provided report template.
  
  The following columns are required (leave an empty value to reject a compound):
  
  Compound
  
  The compound name, identical to the one employed in the SOP json file.
  
  TargetLynx ID
  
  The compound TargetLynx ID, identical to the one employed in the SOP json file.
  
  LLOQ
  
  Lowest limit of quantification concentration, in the same unit as indicated in TargetLynx.
  
  ULOQ
  
  Upper limit of quantification concentration, in the same unit as indicated in TargetLynx.
  
  The following columns are expected by _targetLynxApplyLimitsOfQuantificationNoiseFilled():
  
  Noise (area)
  
  Area integrated in a blank sample at the same retention time as the compound of interest (if left empty noise concentration calculation cannot take place).
  
  a
  
  $a$ coefficient in the calibration equation (if left empty noise concentration calculation cannot take place).
  
  b
  
  $b$ coefficient in the calibration equation (if left empty noise concentration calculation cannot take place).
  
  The following columns are recommended but not expected:
  
  Cpd Info
  
  Additional information relating to the compound (can be left empty).
  
  r
  
  $r$ goodness of fit measure for the calibration equation (can be left empty).
  
  r2
  
  $r^2$ goodness of fit measure for the calibration equation (can be left empty).
- sampleTypeToProcess
  
  List of [‘Study Sample’,’Blank’,’QC’,’Other’] for the sample types to process as defined in MassLynx. Only samples in ‘sampleTypeToProcess’ are returned. Calibrants should not be processed and are not returned. Most uses should only require ‘Study Sample’ as quality controls are identified based on sample names by subsequent functions. Default value is ‘[‘Study Sample’,’QC’]’.
- noiseFilled
  
  If True values <LLOQ will be replaced by a concentration equivalent to the noise level in a blank. If False <LLOQ is replaced by $-inf$. Default value is ‘False’
- onlyLLOQ
  
  If True only correct <LLOQ, if False correct <LLOQ and >ULOQ. Default value is ‘False’.
- responseReference
  
  If noiseFilled=True the noise concentration needs to be calculated. Provide the ‘Sample File Name’ of a reference sample to use in order to establish the response to use, or list of samples to use (one per feature). If None, the middle of the calibration will be employed. Default value is ‘None’.
- keepPeakInfo
  
  If keepPeakInfo=True (default False) adds the peakInfo dictionary to the calibration. peakInfo contains the peakResponse, peakArea, peakConcentrationDeviation, peakIntegrationFlag and peakRT.
- keepExcluded
  
  If keepExcluded=True (default False), import exclusions (excludedImportSampleMetadata, excludedImportFeatureMetadata, excludedImportIntensityData and excludedImportExpectedConcentration) are kept in the object.
- keepIS
  
  If keepIS=True (default False), features marked as Internal Standards (IS) are retained.
fileType = 'Bruker Quantification' to import Bruker quantification results
- nmrRawDataPath
  
  Path to the parent folder where all result files are stored. All subfolders will be parsed and the .xml results files matching the fileNamePattern imported.
- fileNamePattern
  
  Regex to recognise the result data xml files
- pdata
  
  To select the right pdata folders (default 1)
Two form of Bruker quantification results are supported and selected using the sop option: BrukerQuant-UR and Bruker BI-LISA
- sop = 'BrukerQuant-UR'
  Example: TargetedDataset(nmrRawDataPath, fileType='Bruker Quantification', sop='BrukerQuant-UR', fileNamePattern='.*?urine_quant_report_b\.xml$', unit='mmol/mol Crea')
  
  unit
  
  If features are duplicated with different units, unit limits the import to features matching said unit. (In case of duplication and no unit, all available units will be listed)
- sop = ''BrukerBI-LISA'
  
  Example: TargetedDataset(nmrRawDataPath, fileType='Bruker Quantification', sop='BrukerBI-LISA', fileNamePattern='.*?results\.xml$')

rsdSP¶

Returns percentage relative standard deviations for each feature in the dataset, calculated on samples with the Assay Role PrecisionReference and Sample Type StudyPool in sampleMetadata. Implemented as a back-up to accuracyPrecision() when no expected concentrations are known

Returns:	Vector of feature RSDs
Return type:	numpy.ndarray

rsdSS¶

Returns percentage relative standard deviations for each feature in the dataset, calculated on samples with the Assay Role Assay and Sample Type StudySample in sampleMetadata.

Returns:	Vector of feature RSDs
Return type:	numpy.ndarray

mergeLimitsOfQuantification(keepBatchLOQ=False, onlyLLOQ=False)¶

Update limits of quantification and apply LLOQ/ULOQ using the lowest common denominator across all batch (after a __add__()). Keep the highest LLOQ and lowest ULOQ.

Parameters:

keepBatchLOQ (bool) – If True do not remove each batch LOQ (featureMetadata['LLOQ_batchX'], featureMetadata['ULOQ_batchX'])
onlyLLOQ (bool) – if True only correct <LLOQ, if False correct <LLOQ and >ULOQ

Raises:

ValueError – if targetedData does not satisfy to the BasicTargetedDataset definition on input
ValueError – if number of batch, LLOQ_batchX and ULOQ_batchX do not match
ValueError – if targetedData does not satisfy to the BasicTargetedDataset definition after LOQ merging
Warning – if featureMetadata['LLOQ'] or featureMetadata['ULOQ'] already exist and will be overwritten.

exportDataset(destinationPath='.', saveFormat='CSV', withExclusions=True, escapeDelimiters=False, filterMetadata=True)¶: Calls exportDataset() and raises a warning if normalisation is employed as TargetedDataset intensityData can be left-censored.

validateObject(verbose=True, raiseError=False, raiseWarning=True)¶

Checks that all the attributes specified in the class definition are present and of the required class and/or values.

Returns 4 boolean: is the object a Dataset < a basic TargetedDataset < has the object parameters for QC < has the object sample metadata.

To employ all class methods, the most inclusive (has the object sample metadata) must be successful:

‘Basic TargetedDataset’ checks TargetedDataset types and uniqueness as well as additional attributes.
‘has parameters for QC’ is ‘Basic TargetedDataset’ + sampleMetadata[[‘SampleType, AssayRole, Dilution, Run Order, Batch, Correction Batch, Sample Base Name]]
‘has sample metadata’ is ‘has parameters for QC’ + sampleMetadata[[‘Sample ID’, ‘Subject ID’, ‘Matrix’]]

calibration['calibIntensityData'] must be initialised even if no samples are present calibration['calibSampleMetadata'] must be initialised even if no samples are present, use: pandas.DataFrame(None, columns=self.sampleMetadata.columns.values.tolist()) calibration['calibFeatureMetadata'] must be initialised even if no samples are present, use a copy of self.featureMetadata calibration['calibExpectedConcentration'] must be initialised even if no samples are present, use: pandas.DataFrame(None, columns=self.expectedConcentration.columns.values.tolist()) Calibration features must be identical to the usual features. Number of calibration samples and features must match across the 4 calibration tables If ‘sampleMetadataExcluded’, ‘intensityDataExcluded’, ‘featureMetadataExcluded’, ‘expectedConcentrationExcluded’ or ‘excludedFlag’ exist, the existence and number of exclusions (based on ‘sampleMetadataExcluded’) is checked

Column type() in pandas.DataFrame are established on the first sample (for non int/float) featureMetadata are search for column names containing ‘LLOQ’ & ‘ULOQ’ to allow for ‘LLOQ_batch…’ after __add__(), the first column matching is then checked for dtype If datasets are merged, calibration is a list of dict, and number of features is only kept constant inside each dict Does not check for uniqueness in sampleMetadata['Sample File Name'] Does not check columns inside calibration['calibSampleMetadata'] Does not check columns inside calibration['calibFeatureMetadata'] Does not currently check for Attributes['Feature Name']

Parameters:	verbose (bool) – if True the result of each check is printed (default True) raiseError (bool) – if True an error is raised when a check fails and the validation is interrupted (default False) raiseWarning (bool) – if True a warning is raised when a check fails
Returns:	A dictionary of 4 boolean with True if the Object conforms to the corresponding test. ‘Dataset’ conforms to `Dataset`, ‘BasicTargetedDataset’ conforms to `Dataset` + basic `TargetedDataset`, ‘QC’ BasicTargetedDataset + object has QC parameters, ‘sampleMetadata’ QC + object has sample metadata information
Return type:	dict
Raises:	TypeError – if the Object class is wrong AttributeError – if self.Attributes[‘methodName’] does not exist TypeError – if self.Attributes[‘methodName’] is not a str AttributeError – if self.Attributes[‘externalID’] does not exist TypeError – if self.Attributes[‘externalID’] is not a list TypeError – if self.VariableType is not an enum ‘VariableType’ AttributeError – if self.fileName does not exist TypeError – if self.fileName is not a str or list AttributeError – if self.filePath does not exist TypeError – if self.filePath is not a str or list ValueError – if self.sampleMetadata does not have the same number of samples as self._intensityData TypeError – if self.sampleMetadata[‘Sample File Name’] is not str TypeError – if self.sampleMetadata[‘AssayRole’] is not an enum ‘AssayRole’ TypeError – if self.sampleMetadata[‘SampleType’] is not an enum ‘SampleType’ TypeError – if self.sampleMetadata[‘Dilution’] is not an int or float TypeError – if self.sampleMetadata[‘Batch’] is not an int or float TypeError – if self.sampleMetadata[‘Correction Batch’] is not an int or float TypeError – if self.sampleMetadata[‘Run Order’] is not an int TypeError – if self.sampleMetadata[‘Acquired Time’] is not a datetime TypeError – if self.sampleMetadata[‘Sample Base Name’] is not str LookupError – if self.sampleMetadata does not have a Subject ID column TypeError – if self.sampleMetadata[‘Subject ID’] is not a str TypeError – if self.sampleMetadata[‘Sample ID’] is not a str ValueError – if self.featureMetadata does not have the same number of features as self._intensityData TypeError – if self.featureMetadata[‘Feature Name’] is not a str ValueError – if self.featureMetadata[‘Feature Name’] is not unique LookupError – if self.featureMetadata does not have a calibrationMethod column TypeError – if self.featureMetadata[‘calibrationMethod’] is not an enum ‘CalibrationMethod’ LookupError – if self.featureMetadata does not have a quantificationType column TypeError – if self.featureMetadata[‘quantificationType’] is not an enum ‘QuantificationType’ LookupError – if self.featureMetadata does not have a Unit column TypeError – if self.featureMetadata[‘Unit’] is not a str LookupError – if self.featureMetadata does not have a LLOQ or similar column TypeError – if self.featureMetadata[‘LLOQ’] or similar is not an int or float LookupError – if self.featureMetadata does not have a ULOQ or similar column TypeError – if self.featureMetadata[‘ULOQ’] or similar is not an int or float LookupError – if self.featureMetadata does not have the ‘externalID’ as columns AttributeError – if self.expectedConcentration does not exist TypeError – if self.expectedConcentration is not a pandas.DataFrame ValueError – if self.expectedConcentration does not have the same number of samples as self._intensityData ValueError – if self.expectedConcentration does not have the same number of features as self._intensityData ValueError – if self.expectedConcentration column name do not match self.featureMetadata[‘Feature Name’] ValueError – if self.sampleMask is not initialised ValueError – if self.sampleMask does not have the same number of samples as self._intensityData ValueError – if self.featureMask has not been initialised ValueError – if self.featureMask does not have the same number of features as self._intensityData AttributeError – if self.calibration does not exist TypeError – if self.calibration is not a dict AttributeError – if self.calibration[‘calibIntensityData’] does not exist TypeError – if self.calibration[‘calibIntensityData’] is not a numpy.ndarray ValueError – if self.calibration[‘calibIntensityData’] does not have the same number of features as self._intensityData AttributeError – if self.calibration[‘calibSampleMetadata’] does not exist TypeError – if self.calibration[‘calibSampleMetadata’] is not a pandas.DataFrame ValueError – if self.calibration[‘calibSampleMetadata’] does not have the same number of samples as self.calibration[‘calibIntensityData’] AttributeError – if self.calibration[‘calibFeatureMetadata’] does not exist TypeError – if self.calibration[‘calibFeatureMetadata’] is not a pandas.DataFrame LookupError – if self.calibration[‘calibFeatureMetadata’] does not have a [‘Feature Name’] column ValueError – if self.calibration[‘calibFeatureMetadata’] does not have the same number of features as self._intensityData AttributeError – if self.calibration[‘calibExpectedConcentration’] does not exist TypeError – if self.calibration[‘calibExpectedConcentration’] is not a pandas.DataFrame ValueError – if self.calibration[‘calibExpectedConcentration’] does not have the same number of samples as self.calibration[‘calibIntensityData’] ValueError – if self.calibration[‘calibExpectedConcentration’] does not have the same number of features as self.calibration[‘calibIntensityData’] ValueError – if self.calibration[‘calibExpectedConcentration’] column name do not match self.featureMetadata[‘Feature Name’]

applyMasks()¶

Permanently delete elements masked (those set to False) in sampleMask and featureMask, from featureMetadata, sampleMetadata, intensityData and py:attr:TargetedDataset.expectedConcentration.

Features are excluded in each calibration based on the internal calibration['calibFeatureMetadata'] (iterate through the list of calibration if 2+ datasets have been joined with __add__()).

updateMasks(filterSamples=True, filterFeatures=True, sampleTypes=[<SampleType.StudySample>, <SampleType.StudyPool>], assayRoles=[<AssayRole.Assay>, <AssayRole.PrecisionReference>], quantificationTypes=[<QuantificationType.IS>, <QuantificationType.QuantOwnLabeledAnalogue>, <QuantificationType.QuantAltLabeledAnalogue>, <QuantificationType.QuantOther>, <QuantificationType.Monitored>], calibrationMethods=[<CalibrationMethod.backcalculatedIS>, <CalibrationMethod.noIS>, <CalibrationMethod.noCalibration>, <CalibrationMethod.otherCalibration>], rsdThreshold=None, **kwargs)¶

Update sampleMask and featureMask according to QC parameters.

updateMasks() sets sampleMask or featureMask to False for those items failing analytical criteria.

Similar to updateMasks(), without blankThreshold or artifactual filtering

Note

To avoid reintroducing items manually excluded, this method only ever sets items to False, therefore if you wish to move from more stringent criteria to a less stringent set, you will need to reset the mask to all True using initialiseMasks().

Parameters:

filterSamples (bool) – If False don’t modify sampleMask
filterFeatures (bool) – If False don’t modify featureMask
sampleTypes (SampleType) – List of types of samples to retain
assayRoles (AssayRole) – List of assays roles to retain
quantificationTypes (QuantificationType) – List of quantification types to retain
calibrationMethods (CalibrationMethod) – List of calibratio methods to retain

Raises:

TypeError – if sampleTypes is not a list
TypeError – if sampleTypes are not a SampleType enum
TypeError – if assayRoles is not a list
TypeError – if assayRoles are not an AssayRole enum
TypeError – if quantificationTypes is not a list
TypeError – if quantificationTypes are not a QuantificationType enum
TypeError – if calibrationMethods is not a list
TypeError – if calibrationMethods are not a CalibrationMethod enum

addSampleInfo(descriptionFormat=None, filePath=None, **kwargs)¶

Load additional metadata and map it in to the sampleMetadata table.

Possible options:

‘NPC Subject Info’ Map subject metadata from a NPC sample manifest file (format defined in ‘PCSOP.082’)
‘Raw Data’ Extract analytical parameters from raw data files
‘ISATAB’ ISATAB study designs
‘Filenames’ Parses sample information out of the filenames, based on the named capture groups in the regex passed in filenamespec
‘Basic CSV’ Joins the sampleMetadata table with the data in the csv file at filePath=, matching on the ‘Sample File Name’ column in both.
‘Batches’ Interpolate batch numbers for samples between those with defined batch numbers based on sample acquisitions times

Parameters:

descriptionFormat (str) – Format of metadata to be added
filePath (str) – Path to the additional data to be added
filenameSpec (None or str) – Only used if descriptionFormat is ‘Filenames’. A regular expression that extracts sample-type information into the following named capture groups: ‘fileName’, ‘baseName’, ‘study’, ‘chromatography’ ‘ionisation’, ‘instrument’, ‘groupingKind’ ‘groupingNo’, ‘injectionKind’, ‘injectionNo’, ‘reference’, ‘exclusion’ ‘reruns’, ‘extraInjections’, ‘exclusion2’. if None is passed, use the filenameSpec key in Attributes, loaded from the SOP json

Raises:

NotImplementedError – if the descriptionFormat is not understood

accuracyPrecision(onlyPrecisionReferences=False)¶

Return Precision (percent RSDs) and Accuracy for each SampleType and each unique concentration. Statistic grouped by SampleType, Feature and unique concentration.

Parameters:	dataset (TargetedDataset) – TargetedDataset object to generate the accuracy and precision for. onlyPrecisionReference (bool) – If `True` only use samples with the AssayRole PrecisionReference.
Returns:	Dict of Accuracy and Precision dict for each group.
Return type:	dict(str:dict(str:pandas.DataFrame))
Raises:	TypeError – if dataset is not an instance of TargetedDataset

[1]	Ralf Tautenhahn, Christoph Bottcher and Steffen Neumann. Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics, 9:504, 2008. URL: https://doi.org/10.1186/1471-2105-9-504

[2]	Elizabeth J Want, Ian D Wilson, Helen Gika, Georgios Theodoridis, Robert S Plumb, John Shockcor, Elaine Holmes and Jeremy K Nicholson. Global metabolic profiling procedures for urine using UPLC-MS. Nature Protocols, 5(6):1005-18, 2010. URL: http://dx.doi.org/10.1038/nprot.2010.50

[3]

Warwick B Dunn, David Broadhurst, Paul Begley, Eva Zelena, Sue Francis-McIntyre, Nadine Anderson, Marie Brown, Joshau D Knowles, Antony Halsall, John N Haselden, Andrew W Nicholls, Ian D Wilson, Douglas B Kell, Royston Goodacre and The Human Serum Metabolome (HUSERMET) Consortium. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nature Protocols, 6(7):1060-83, 2011. URL: http://dx.doi.org/10.1038/nprot.2011.335

[4]

Matthew R Lewis, Jake TM Pearce, Konstantina Spagou, Martin Green, Anthony C Dona, Ada HY Yuen, Mark David, David J Berry, Katie Chappell, Verena Horneffer-van der Sluis, Rachel Shaw, Simon Lovestone, Paul Elliott, John Shockcor, John C Lindon, Olivier Cloarec, Zoltan Takats, Elaine Holmes and Jeremy K Nicholson. Development and Application of Ultra-Performance Liquid Chromatography-TOF MS for Precision Large Scale Urinary Metabolic Phenotyping. Analytical Chemistry, 88(18):9004-9013, 2016. URL: http://dx.doi.org/10.1021/acs.analchem.6b01481

[5]	Jake TM Pearce, Toby J Athersuch, Timothy MD Ebbels, John C Lindon, Jeremy K Nicholson and Hector C Keun. Robust Algorithms for Automated Chemical Shift Calibration of 1D 1H NMR Spectra of Blood Serum. Analytical Chemistry, 80(18):7158-62, 2008. URL: http://dx.doi.org/10.1021/ac8011494

[6]

Anthony C Dona, Beatriz Jiménez, Hartmut Schäfer, Eberhard Humpfer, Manfred Spraul, Matthew R Lewis, Jake TM Pearce, Elaine Holmes, John C Lindon and Jeremy K Nicholson. Precision High-Throughput Proton NMR Spectroscopy of Human Urine, Serum, and Plasma for Large-Scale Metabolic Phenotyping. Analytical Chemistry, 86(19):9887-9894, 2014. URL: http://dx.doi.org/10.1021/ac5025039

[7]

Jean W Lee, Viswanath Devanarayan, Yu Chen Barrett, Russell Weiner, John Allinson, Scott Fountain, Stephen Keller, Ira Weinryb, Marie Green, Larry Duan, James A Rogers, Robert Millham, Peter J O’Brien, Jeff Sailstad, Masood Khan, Chad Ray and John A Wagner. Fit-for-purpose method development and validation for successful biomarker measurement. Pharmaceutical Research, 23(2):312-28, 2006. URL: http://dx.doi.org/10.1007/s11095-005-9045-3

Datasets¶

LC-MS Datasets¶

NMR Datasets¶

Targeted Datasets¶

Dataset Specific Syntax and Parameters¶

nPYc Toolbox

Navigation

Related Topics