Datasets¶
The nPYc-Toolbox is built around creating an object for each imported dataset. This object contains the metabolic profiling data itself, alongside all associated sample and feature metadata; various methods for generating, reporting and plotting important quality control parameters; and methods for pre-processing such as filtering poor quality features or correcting trends in batch and run-order.
The first step in creating an nPYc-Toolbox object is to import the acquired data, creating a Dataset
specific for the data type:
MSDataset
for LC-MS profiling dataNMRDataset
for NMR profiling dataTargetedDataset
for targeted datasets
For example, to import LC-MS data into a MSDataset object:
msData = nPYc.MSDataset('path to data')
Depending on the data type, the Dataset can be set up directly from the raw data, from common interchange formats, or from the outputs of popular data-processing tools. The supported data types are described in more detail in the data specific sections below.
When importing the data, default parameters, for example, specific parameters such as the number of points to interpolate NMR data into, or more generally the format to save figures as, are loaded from the Configuration Files. These parameters are subsequently saved in the Attributes
dictionary and used throughout subsequent implementation of the pipeline.
For example, for NMR data, the nPYc-Toolbox contains two default configuration files, ‘GenericNMRUrine’ and ‘GenericNMRBlood’ for urine and blood datasets respectively, therefore, to import NMR spectra from urine samples the sop parameter would be:
nmrData = nPYc.NMRDataset('path to data', sop='GenericNMRurine')
A full list of the parameters for each dataset type is given in the Built-in Configuration SOPs. If different values are required, these can be modified directly in the appropriate SOP file, or alternatively they can be set by the user by modifying the required ‘Attribute’, either at import, or by subsequent direct modification in the pipeline. For example, to set the line width threshold (LWFailThreshold) to subsequently flag NMR spectra with line widths not meeting this value:
# EITHER, set the required value (here 0.8) at import
nmrData = nPYc.NMRDataset(rawDataPath, pulseProgram='noesygppr1d', LWFailThreshold=0.8)
# OR, set the *Attribute* directly (after importing nmrData)
nmrData.Attributes['LWFailThreshold'] = 0.8
Dataset objects have several key attributes, including:
sampleMetadata
: A \(n\) × \(p\) pandas dataframe of sample identifiers and sample associated metadata (each row here corresponds to a row in the intensityData file)featureMetadata
: A \(m\) × \(q\) pandas dataframe of feature identifiers and feature associated metadata (each row here corresponds to a column in the intensityData file)intensityData
: A \(n\) × \(m\) numpy matrix of measurements, where each row and column respectively correspond to a the measured intensity of a specific sample featuresampleMask
: A \(n\) numpy boolean vector where True and False flag samples for inclusion or exclusion respectivelyfeatureMask
: A \(m\) numpy boolean vector where True and False flag features for inclusion or exclusion respectively
Once created, you can query the number of features or samples it contains by running:
dataset.noFeatures
dataset.noSamples
Or directly inspect the sample or feature metadata, and the raw measurements:
dataset.sampleMetadata
dataset.featureMetadata
dataset.intensityData
For more details on using the sample and feature masks see Sample and Feature Masks.
It is possible to add additional study design parameters or sample metadata into the Dataset using the addSampleInfo()
method (see Sample Metadata for details).
For full method specific details see Installation and Tutorials.
LC-MS Datasets¶
The toolbox is designed to be agnostic to the source of peak-picked profiling datasets, currently supporting the outputs of XCMS (Tautenhahn et al [1]), Bruker Metaboscape, and Progenesis QI, but simply expandable to data from other peak-pickers. Current best-practices in quality control of profiling LC-MS (Want et al [2], Dunn et al [3], Lewis et al [4]) data are applied, including utilising repeated injections of Study Reference samples in order to calculate analytical precision for the measurement of each feature (Relative Standard Deviation), and a serial dilution of the reference sample to asses the linearity of response (Correlation to Dilution), for full details see Feature Summary Report: LC-MS Datasets.
Study Reference samples are also used (in conjunction with Long-Term Reference samples if available) to assess and correct trends in batch and run-order (Batch & Run-Order Correction). Additionally, both RSD and correlation to dilution are used to filter features to retain only those measured with a high precision and accuracy (Sample and Feature Masks).
NMR Datasets¶
The nPYc-Toolbox supports input of processed Bruker GmbH format 1D experiments. Upon import, each spectrum’s chemical shift axis is calibrated to a reference peak (Pearce et al [5]), and all spectra interpolated onto a common scale, with full parameters as per the NMRDataset Objects configuration SOPs. The toolbox supports automated calculation of the quality control metrics described previously (Dona et al [6]), including assessments of line-width, water suppression quality, and baseline stability, for full details see Feature Summary Report: NMR Datasets.
Targeted Datasets¶
The TargetedDataset represents quantitative datasets where compounds are already identified, the exactitude of the quantification can be established, units are known and calibration curve or internal standards are employed (Lee et al [7]). It implements a set of reports and data consistency checks to assist analysts in assessing the presence of batch effects, applying limits of quantification (LOQ), standardizing the linearity range over multiple batches, and determining and visualising the accuracy and precision of each measurement, for more details see Feature Summary Report: NMR Targeted Datasets.
The nPYc-Toolbox supports input of both MS-derived targeted datasets (tutorial and further documentation in progress), and two Bruker proprietary human biofluid quantification platforms (IVDr algorithms) that generate targeted outputs from the NMR profiling data, BI-LISA for quantification of Lipoproteins (blood samples only) and BIQUANT-PS and BIQUANT-UR for small molecule metabolites (for blood and urine respectively).
Dataset Specific Syntax and Parameters¶
The main function parameters (which may be of interest to advanced users) are as follows:
Note, the Dataset object serves as a common parent to MSDataset
, TargetedDataset
, and NMRDataset
, and should not typically be instantiated independently.
-
class
nPYc.objects.
Dataset
(sop='Generic', sopPath=None, **kwargs)¶ Base class for nPYc dataset objects.
Parameters: - sop (str) – Load configuration parameters from specified SOP JSON file
- sopPath – By default SOPs are loaded from the
nPYc/StudyDesigns/SOP/
directory, if notNone
the directory specified in sopPath= will be searched before the builtin SOP directory.
-
featureMetadata
= None¶ \(m\) × \(q\) pandas dataframe of feature identifiers and metadata
The featureMetadata table can include any datatype that can be placed in a pandas cell, However the toolbox assumes certain prerequisites on the following columns in order to function:
Column dtype Usage Feature Name str or float ID of the feature measured in this column. Each ‘Feature Name’ must be unique in the table. If ‘Feature Name’ is numeric, the columns should be sorted in ascending or descending order.
-
sampleMetadata
= None¶ \(n\) × \(p\) dataframe of sample identifiers and metadata.
The sampleMetadata table can include any datatype that can be placed in a pandas cell, However the toolbox assumes certain prerequisites on the following columns in order to function:
Column dtype Usage Sample ID str ID of the sampling event generating this sample AssayRole AssayRole
Defines the role of this assay SampleType SampleType
Defines the type of sample acquired Sample File Name str Unique file name for the analytical data Sample Base Name str Common identifier that links analytical data to the Sample ID Dilution float Where AssayRole is LinearityReference
, the expected abundance is indicated hereBatch int Acquisition batch Correction Batch int When detecting and correcting for batch and Run-Order effects, run-order effects are characterised within samples sharing the same Correction Batch, while batch effects are detected between distinct values Acquired Time datetime.datetime Date and time of acquisition of raw data Run order int Order of sample acquisition Exclusion Details str Details of reasoning if marked for exclusion Metadata Available bool Records which samples had metadata provided with the .addSampleInfo() method
-
featureMask
= None¶ \(m\) element vector, with
True
representing features to be included in analysis, andFalse
those to be excluded
-
sampleMask
= None¶ \(p\) element vector, with
True
representing samples to be included in analysis, andFalse
those to be excluded
-
AnalyticalPlatform
= None¶ VariableType
enum specifying the type of data represented.
-
Attributes
= None¶ Dictionary of object configuration attributes, including those loaded from SOP files.
Defined attributes are as follows:
Key dtype Usage ‘dpi’ positive int Raster resolution when plotting figures ‘figureSize’ positive (float, float) Size to plot figures ‘figureFormat’ str Format to save figures in ‘histBins’ positive int Number of bins to use when drawing histograms ‘Feature Names’ Column in featureMetadata
ID of the primary feature name
-
intensityData
¶ \(n\) × \(m\) numpy matrix of measurements
-
noSamples
¶ Returns: Number of samples in the dataset (n) Return type: int
-
noFeatures
¶ Returns: Number of features in the dataset (m) Return type: int
-
log
¶ Return log entries as a string.
-
name
¶ Returns or sets the name of the dataset. name must be a string
-
Normalisation
¶ Normaliser
object that transforms the measurements inintensityData
.
-
validateObject
(verbose=True, raiseError=False, raiseWarning=True)¶ Checks that all the attributes specified in the class definition are present and of the required class and/or values. Checks for attributes existence and type. Check for mandatory columns existence, but does not check the column values (type or uniqueness). If ‘sampleMetadataExcluded’, ‘intensityDataExcluded’, ‘featureMetadataExcluded’ or ‘excludedFlag’ exist, the existence and number of exclusions (based on ‘sampleMetadataExcluded’) is checked
Parameters: - verbose (bool) – if True the result of each check is printed (default True)
- raiseError (bool) – if True an error is raised when a check fails and the validation is interrupted (default False)
- raiseWarning (bool) – if True a warning is raised when a check fails
Returns: True if the Object conforms to basic
Dataset
Return type: bool
Raises: - TypeError – if the Object class is wrong
- AttributeError – if self.Attributes does not exist
- TypeError – if self.Attributes is not a dict
- AttributeError – if self.Attributes[‘Log’] does not exist
- TypeError – if self.Attributes[‘Log’] is not a list
- AttributeError – if self.Attributes[‘dpi’] does not exist
- TypeError – if self.Attributes[‘dpi’] is not an int
- AttributeError – if self.Attributes[‘figureSize’] does not exist
- TypeError – if self.Attributes[‘figureSize’] is not a list
- ValueError – if self.Attributes[‘figureSize’] is not of length 2
- TypeError – if self.Attributes[‘figureSize’][0] is not a int or float
- TypeError – if self.Attributes[‘figureSize’][1] is not a int or float
- AttributeError – if self.Attributes[‘figureFormat’] does not exist
- TypeError – if self.Attributes[‘figureFormat’] is not a str
- AttributeError – if self.Attributes[‘histBins’] does not exist
- TypeError – if self.Attributes[‘histBins’] is not an int
- AttributeError – if self.Attributes[‘noFiles’] does not exist
- TypeError – if self.Attributes[‘noFiles’] is not an int
- AttributeError – if self.Attributes[‘quantiles’] does not exist
- TypeError – if self.Attributes[‘quantiles’] is not a list
- ValueError – if self.Attributes[‘quantiles’] is not of length 2
- TypeError – if self.Attributes[‘quantiles’][0] is not a int or float
- TypeError – if self.Attributes[‘quantiles’][1] is not a int or float
- AttributeError – if self.Attributes[‘sampleMetadataNotExported’] does not exist
- TypeError – if self.Attributes[‘sampleMetadataNotExported’] is not a list
- AttributeError – if self.Attributes[‘featureMetadataNotExported’] does not exist
- TypeError – if self.Attributes[‘featureMetadataNotExported’] is not a list
- AttributeError – if self.Attributes[‘analyticalMeasurements’] does not exist
- TypeError – if self.Attributes[‘analyticalMeasurements’] is not a dict
- AttributeError – if self.Attributes[‘excludeFromPlotting’] does not exist
- TypeError – if self.Attributes[‘excludeFromPlotting’] is not a list
- AttributeError – if self.VariableType does not exist
- AttributeError – if self._Normalisation does not exist
- TypeError – if self._Normalisation is not the Normaliser ABC
- AttributeError – if self._name does not exist
- TypeError – if self._name is not a str
- AttributeError – if self._intensityData does not exist
- TypeError – if self._intensityData is not a numpy.ndarray
- AttributeError – if self.sampleMetadata does not exist
- TypeError – if self.sampleMetadata is not a pandas.DataFrame
- LookupError – if self.sampleMetadata does not have a Sample File Name column
- LookupError – if self.sampleMetadata does not have an AssayRole column
- LookupError – if self.sampleMetadata does not have a SampleType column
- LookupError – if self.sampleMetadata does not have a Dilution column
- LookupError – if self.sampleMetadata does not have a Batch column
- LookupError – if self.sampleMetadata does not have a Correction Batch column
- LookupError – if self.sampleMetadata does not have a Run Order column
- LookupError – if self.sampleMetadata does not have a Sample ID column
- LookupError – if self.sampleMetadata does not have a Sample Base Name column
- LookupError – if self.sampleMetadata does not have an Acquired Time column
- LookupError – if self.sampleMetadata does not have an Exclusion Details column
- AttributeError – if self.featureMetadata does not exist
- TypeError – if self.featureMetadata is not a pandas.DataFrame
- LookupError – if self.featureMetadata does not have a Feature Name column
- AttributeError – if self.sampleMask does not exist
- TypeError – if self.sampleMask is not a numpy.ndarray
- ValueError – if self.sampleMask are not bool
- AttributeError – if self.featureMask does not exist
- TypeError – if self.featureMask is not a numpy.ndarray
- ValueError – if self.featureMask are not bool
- AttributeError – if self.sampleMetadataExcluded does not exist
- TypeError – if self.sampleMetadataExcluded is not a list
- AttributeError – if self.intensityDataExcluded does not exist
- TypeError – if self.intensityDataExcluded is not a list
- ValueError – if self.intensityDataExcluded does not have the same number of exclusions as self.sampleMetadataExcluded
- AttributeError – if self.featureMetadataExcluded does not exist
- TypeError – if self.featureMetadataExcluded is not a list
- ValueError – if self.featureMetadataExcluded does not have the same number of exclusions as self.sampleMetadataExcluded
- AttributeError – if self.excludedFlag does not exist
- TypeError – if self.excludedFlag is not a list
- ValueError – if self.excludedFlag does not have the same number of exclusions as self.sampleMetadataExcluded
-
initialiseMasks
()¶ Re-initialise
featureMask
andsampleMask
to match the current dimensions ofintensityData
, and include all samples.
-
updateMasks
(filterSamples=True, filterFeatures=True, sampleTypes=[<SampleType.StudySample>, <SampleType.StudyPool>, <SampleType.ExternalReference>, <SampleType.MethodReference>, <SampleType.ProceduralBlank>], assayRoles=[<AssayRole.Assay>, <AssayRole.PrecisionReference>, <AssayRole.LinearityReference>, <AssayRole.Blank>], **kwargs)¶ Update
sampleMask
andfeatureMask
according to parameters.updateMasks()
setssampleMask
orfeatureMask
toFalse
for those items failing analytical criteria.Note
To avoid reintroducing items manually excluded, this method only ever sets items to
False
, therefore if you wish to move from more stringent criteria to a less stringent set, you will need to reset the mask to allTrue
usinginitialiseMasks()
.Parameters: - filterSamples (bool) – If
False
don’t modify sampleMask - filterFeatures (bool) – If
False
don’t modify featureMask - sampleTypes (SampleType) – List of types of samples to retain
- sampleRoles (AssayRole) – List of assays roles to retain
- filterSamples (bool) – If
-
applyMasks
()¶ Permanently delete elements masked (those set to
False
) insampleMask
andfeatureMask
, fromfeatureMetadata
,sampleMetadata
, andintensityData
.
-
addSampleInfo
(descriptionFormat=None, filePath=None, filetype=None, **kwargs)¶ Load additional metadata and map it in to the
sampleMetadata
table.Possible options:
- ‘Basic CSV’ Joins the
sampleMetadata
table with the data in thecsv
file at filePath=, matching on the ‘Sample File Name’ column in both (see Sample Metadata). - ‘Filenames’ Parses sample information out of the filenames, based on the named capture groups in the regex passed in filenamespec
- ‘Raw Data’ Extract analytical parameters from raw data files
- ‘ISATAB’ ISATAB study designs
Parameters: - descriptionFormat (str) – Format of metadata to be added
- filePath (str) – Path to the additional data to be added
Raises: NotImplementedError – if the descriptionFormat is not understood
- ‘Basic CSV’ Joins the
-
addFeatureInfo
(filePath=None, descriptionFormat=None, featureId=None, **kwargs)¶ Load additional metadata and map it in to the
featureMetadata
table.Possible options:
- ‘Reference Ranges’ JSON file specifying upper and lower reference ranges for a feature.
Parameters: - filePath (str) – Path to the additional data to be added
- descriptionFormat (str) –
- featureId (str) – Unique feature Id field in the metadata file provided to match with internal Feature Name
Raises: NotImplementedError – if the descriptionFormat is not understood
-
excludeSamples
(sampleList, on='Sample File Name', message='User Excluded')¶ Sets the
sampleMask
for the samples listed in sampleList toFalse
to mask them from the dataset.Parameters: - sampleList (list) – A list of sample IDs to be excluded
- on (str) – name of the column in
sampleMetadata
to match sampleList against, defaults to ‘Sample File Name’ - message (str) – append this message to the ‘Exclusion Details’ field for each sample excluded, defaults to ‘User Excluded’
Returns: a list of IDs passed in sampleList that could not be matched against the sample IDs present
Return type: list
-
excludeFeatures
(featureList, on='Feature Name', message='User Excluded')¶ Masks the features listed in featureList from the dataset.
Parameters: - featureList (list) – A list of feature IDs to be excluded
- on (str) – name of the column in
featureMetadata
to match featureList against, defaults to ‘Feature Name’ - message (str) – append this message to the ‘Exclusion Details’ field for each feature excluded, defaults to ‘User Excluded’
Returns: A list of ID passed in featureList that could not be matched against the feature IDs present.
Return type: list
-
exportDataset
(destinationPath='.', saveFormat='CSV', isaDetailsDict={}, withExclusions=True, escapeDelimiters=False, filterMetadata=True)¶ Export dataset object in a variety of formats for import in other software, the export is named according to the
name
attribute of the Dataset object.Possible save formats are:
- CSV Basic CSV output,
featureMetadata
,sampleMetadata
andintensityData
are written to three separate CSV files in desitinationPath - UnifiedCSV Exports
featureMetadata
,sampleMetadata
andintensityData
concatenated into a single CSV file - ISATAB Exports the sampleMetadata in the ISATAB format
Parameters: - destinationPath (str) – Save data into the directory specified here
- format (str) – File format for saved data, defaults to CSV.
- detailsDict (dict) – Contains several key: value pairs required to for exporting ISATAB.
detailsDict should have the format: detailsDict = {
‘investigation_identifier’ : “i1”, ‘investigation_title’ : “Give it a title”, ‘investigation_description’ : “Add a description”, ‘investigation_submission_date’ : “2016-11-03”, ‘investigation_public_release_date’ : “2016-11-03”, ‘first_name’ : “Noureddin”, ‘last_name’ : “Sadawi”, ‘affiliation’ : “University”, ‘study_filename’ : “my_ms_study”, ‘study_material_type’ : “Serum”, ‘study_identifier’ : “s1”, ‘study_title’ : “Give the study a title”, ‘study_description’ : “Add study description”, ‘study_submission_date’ : “2016-11-03”, ‘study_public_release_date’ : “2016-11-03”, ‘assay_filename’ : “my_ms_assay”}
Parameters: - withExclusions (bool) – If
True
mask features and samples will be excluded - escapeDelimiters (bool) – If
True
remove characters commonly used as delimiters in csv files from metadata - filterMetadata (bool) – If
True
does not export the sampleMetadata and featureMetadata columns listed in self.Attributes[‘sampleMetadataNotExported’] and self.Attributes[‘featureMetadataNotExported’]
Raises: ValueError – if saveFormat is not understood
- CSV Basic CSV output,
-
getFeatures
(featureIDs, by=None, useMasks=True)¶ Get a feature or list of features by name or ranges.
If
VariableType
isDiscrete
,getFeature()
expects either a single or list of values, and matching features are returned. IfVariableType
isSpectral
, pass either a single, or list of (min, max) tuples, the features returned will be a slice of the combined ranges. If the ranges passed overlap, the union will be returned.Parameters: - featureIDs – A single or list of feature IDs to return
- by (None or str) – Column in
featureMetadata
to search in,None
use the column defined inAttributes
[‘Feature Names’]
Returns: (featureMetadata, intensityData)
Return type: (pandas.Dataframe, numpy.ndarray)
-
class
nPYc.objects.
MSDataset
(datapath, fileType='QI', sop='GenericMS', **kwargs)¶ MSDataset
extendsDataset
to represent both peak-picked LC- or DI-MS datasets (discrete variables), and Continuum mode (spectral) DI-MS datasets.Objects can be initialised from a variety of common data formats, currently peak-picked data from Progenesis QI or XCMS, and targeted Biocrates datasets.
- Progenesis QI
- QI import operates on csv files exported via the ‘Export Compound Measurements’ menu option in QI. Import requires the presence of both normalised and raw datasets, but will only import the raw meaturenents.
- XCMS
- XCMS import operates on the csv files generated by XCMS with the peakTable() method. By default, the csv is expected to have 14 columns of feature parameters, with the intensity values for the first sample coming on the 15 column. However, the number of columns to skip is dataset dependent and can be set with the
noFeatureParams=
keyword argument. This method assumes that the retention time value in the XCMS exported peak list is specified in seconds.
- XCMSOnline
- XCMS Online download output supplies an unannotated and an annotated xlsx file stored by default in “XCMS results” folder.
By default, the table is expected to have 10 columns of feature parameters, with the intensity values for the first sample coming on the 11th column.
However, the number of columns to skip is dataset dependent and can be set with the (e
noFeatureParams=
keyword argument.
- MZmine
- MZmine2: import operates on csv files exported via the ‘Export to CSV’ file’ menu option. Field separator should be comma “,” and all export elements should be chosen for export. MZmine3: choose ‘Export feature list’ -> ‘CSV (legacy MZmine 2)’ menu option. Field separator should be comma “,” and all export elements should be chosen for export.
- MS-DIAL
- MS-DIAL import operates on the .txt (MSP) files exported via the ‘Export -> Alignment result’ menu option. Export options to choose are preferably ‘Raw data matrix (Area)’ or ‘Raw data matrix (Height)’. This method will also import the accompanying experimental metadata information such as File Type, Injection Order and Batch ID.
- Biocrates
- Operates on spreadsheets exported from Biocrates MetIDQ. By default loads data from the sheet named ‘Data Export’, this may be overridden with the
sheetName=
argument, If the number of sample metadata columns differes from the default, this can be overridden with thenoSampleParams=
argument.
- nPYc
- nPYc import operates on the csv file generated using nPYc exportDataset function (‘combinedData’ file). This reimport function is meant for further filtering or normalisation without having to run whole process again. Note that metadata does not need to be imported again.
-
correlationToDilution
¶ Returns the correlation of features to dilution as calculated on samples marked as ‘Dilution Series’ in
sampleMetadata
, with dilution expressed in ‘Dilution’.Returns: Vector of feature correlations to dilution Return type: numpy.ndarray
-
artifactualLinkageMatrix
¶ Gets overlapping artifactual features.
-
rsdSP
¶ Returns percentage relative standard deviations for each feature in the dataset, calculated on samples with the Assay Role
PrecisionReference
and Sample TypeStudyPool
insampleMetadata
.Returns: Vector of feature RSDs Return type: numpy.ndarray
-
rsdSS
¶ Returns percentage relative standard deviations for each feature in the dataset, calculated on samples with the Assay Role
Assay
and Sample TypeStudySample
insampleMetadata
.Returns: Vector of feature RSDs Return type: numpy.ndarray
-
applyMasks
()¶ Permanently delete elements masked (those set to
False
) insampleMask
andfeatureMask
, fromfeatureMetadata
,sampleMetadata
, andintensityData
.Resets feature linkage matrix and feature correlations.
-
updateMasks
(filterSamples=True, filterFeatures=True, sampleTypes=[<SampleType.StudySample>, <SampleType.StudyPool>, <SampleType.ExternalReference>, <SampleType.MethodReference>, <SampleType.ProceduralBlank>], assayRoles=[<AssayRole.Assay>, <AssayRole.PrecisionReference>, <AssayRole.LinearityReference>, <AssayRole.Blank>], featureFilters={'artifactualFilter': False, 'blankFilter': False, 'correlationToDilutionFilter': True, 'rsdFilter': True, 'varianceRatioFilter': True}, **kwargs)¶ Update
sampleMask
andfeatureMask
according to QC parameters.updateMasks()
setssampleMask
orfeatureMask
toFalse
for those items failing analytical criteria.Note
To avoid reintroducing items manually excluded, this method only ever sets items to
False
, therefore if you wish to move from more stringent criteria to a less stringent set, you will need to reset the mask to allTrue
usinginitialiseMasks()
.Parameters: - filterSamples (bool) – If
False
don’t modify sampleMask - filterFeatures (bool) – If
False
don’t modify featureMask - sampleTypes (SampleType) – List of types of samples to retain
- assayRoles (AssayRole) – List of assays roles to retain
- correlationThreshold (None or float) – Mask features with a correlation below this value. If
None
, use the value from Attributes[‘corrThreshold’] - rsdThreshold (None or float) – Mask features with a RSD below this value. If
None
, use the value from Attributes[‘rsdThreshold’] - varianceRatio (None or float) – Mask features where the RSD measured in study samples is below that measured in study reference samples multiplied by varianceRatio
- withArtifactualFiltering (None or bool) – If
None
use the value fromAttributes['artifactualFilter']
. IfFalse
doesn’t apply artifactual filtering. IfAttributes['artifactualFilter']
is set toFalse
artifactual filtering will not take place even ifwithArtifactualFiltering
is set toTrue
. - deltaMzArtifactual (None or float) – Maximum allowed m/z distance between two grouped features. If
None
, use the value from Attributes[‘deltaMzArtifactual’] - overlapThresholdArtifactual (None or float) – Minimum peak overlap between two grouped features. If
None
, use the value from Attributes[‘overlapThresholdArtifactual’] - corrThresholdArtifactual (None or float) – Minimum correlation between two grouped features. If
None
, use the value from Attributes[‘corrThresholdArtifactual’] - blankThreshold (None, False, or float) – Mask features thats median intesity falls below blankThreshold x the level in the blank. If
False
do not filter, ifNone
use the cutoff from Attributes[‘blankThreshold’], otherwise us the cutoff scaling factor provided
- filterSamples (bool) – If
-
saveFeatureMask
()¶ Updates featureMask and saves as ‘Passing Selection’ in self.featureMetadata
-
addSampleInfo
(descriptionFormat=None, filePath=None, filenameSpec=None, filetype='Waters .raw', **kwargs)¶ Load additional metadata and map it in to the
sampleMetadata
table.Possible options:
- ‘NPC LIMS’ NPC LIMS files mapping files names of raw analytical data to sample IDs
- ‘NPC Subject Info’ Map subject metadata from a NPC sample manifest file (format defined in ‘PCSOP.082’)
- ‘Raw Data’ Extract analytical parameters from raw data files
- ‘ISATAB’ ISATAB study designs
- ‘Filenames’ Parses sample information out of the filenames, based on the named capture groups in the regex passed in filenamespec
- ‘Basic CSV’ Joins the
sampleMetadata
table with the data in thecsv
file at filePath=, matching on the ‘Sample File Name’ column in both.
Parameters: - descriptionFormat (str) – Format of metadata to be added
- filePath (str) – Path to the additional data to be added
- filenameSpec (None or str) – Only used if descriptionFormat is ‘Filenames’. A regular expression that extracts sample-type information into the following named capture groups: ‘fileName’, ‘baseName’, ‘study’, ‘chromatography’ ‘ionisation’, ‘instrument’, ‘groupingKind’ ‘groupingNo’, ‘injectionKind’, ‘injectionNo’, ‘reference’, ‘exclusion’ ‘reruns’, ‘extraInjections’, ‘exclusion2’. if
None
is passed, use the filenameSpec key in Attributes, loaded from the SOP json
Raises: NotImplementedError – if the descriptionFormat is not understood
-
amendBatches
(sampleRunOrder)¶ Creates a new batch starting at the sample index in sampleRunOrder, and amends subsequent batch numbers in
sampleMetadata
[‘Correction Batch’]Parameters: sampleRunOrder (int) – Index of first sample in new batch
-
artifactualFilter
(featMask=None)¶ Filter artifactual features on top of the featureMask already present if none given as input Keep feature with the highest intensity on the mean spectra
Parameters: featMask (numpy.ndarray or None) – A featureMask ( True
for inclusion), ifNone
, usefeatureMask
Returns: Amended featureMask Return type: numpy.ndarray
-
excludeFeatures
(featureList, on='Feature Name', message='User Excluded')¶ Masks the features listed in featureList from the dataset.
Parameters: - featureList (list) – A list of feature IDs to be excluded
- on (str) – name of the column in
featureMetadata
to match featureList against, defaults to ‘Feature Name’ - message (str) – append this message to the ‘Exclusion Details’ field for each feature excluded, defaults to ‘User Excluded’
Returns: A list of ID passed in featureList that could not be matched against the feature IDs present.
Return type: list
-
initialiseMasks
()¶ Re-initialise
featureMask
andsampleMask
to match the current dimensions ofintensityData
, and include all samples.
-
validateObject
(verbose=True, raiseError=False, raiseWarning=True)¶ Checks that all the attributes specified in the class definition are present and of the required class and/or values.
Returns 4 boolean: is the object a Dataset < a basic MSDataset < has the object parameters for QC < has the object sample metadata.
To employ all class methods, the most inclusive (has the object sample metadata) must be successful:
- ‘Basic MSDataset’ checks Dataset types and uniqueness as well as additional attributes.
- ‘has parameters for QC’ is ‘Basic MSDataset’ + sampleMetadata[[‘SampleType, AssayRole, Dilution, Run Order, Batch, Correction Batch, Sample Base Name]]
- ‘has sample metadata’ is ‘has parameters for QC’ + sampleMetadata[[‘Sample ID’, ‘Subject ID’, ‘Matrix’]]
Column type() in pandas.DataFrame are established on the first sample when necessary Does not check for uniqueness in
sampleMetadata['Sample File Name']
Does not currently checkAttributes['Raw Data Path']
type Does not currently checkcorrExclusions
typeParameters: - verbose (bool) – if True the result of each check is printed (default True)
- raiseError (bool) – if True an error is raised when a check fails and the validation is interrupted (default False)
- raiseWarning (bool) – if True a warning is raised when a check fails
Returns: A dictionary of 4 boolean with True if the Object conforms to the corresponding test. ‘Dataset’ conforms to
Dataset
, ‘BasicMSDataset’ conforms toDataset
+ basicMSDataset
, ‘QC’ BasicMSDataset + object has QC parameters, ‘sampleMetadata’ QC + object has sample metadata informationReturn type: dict
Raises: - TypeError – if the Object class is wrong
- AttributeError – if self.Attributes[‘rtWindow’] does not exist
- TypeError – if self.Attributes[‘rtWindow’] is not an int or float
- AttributeError – if self.Attributes[‘msPrecision’] does not exist
- TypeError – if self.Attributes[‘msPrecision’] is not an int or float
- AttributeError – if self.Attributes[‘varianceRatio’] does not exist
- TypeError – if self.Attributes[‘varianceRatio’] is not an int or float
- AttributeError – if self.Attributes[‘blankThreshold’] does not exist
- TypeError – if self.Attributes[‘blankThreshold’] is not an int or float
- AttributeError – if self.Attributes[‘corrMethod’] does not exist
- TypeError – if self.Attributes[‘corrMethod’] is not a str
- AttributeError – if self.Attributes[‘corrThreshold’] does not exist
- TypeError – if self.Attributes[‘corrThreshold’] is not an int or float
- AttributeError – if self.Attributes[‘rsdThreshold’] does not exist
- TypeError – if self.Attributes[‘rsdThreshold’] is not an int or float
- AttributeError – if self.Attributes[‘artifactualFilter’] does not exist
- TypeError – if self.Attributes[‘artifactualFilter’] is not a bool
- AttributeError – if self.Attributes[‘deltaMzArtifactual’] does not exist
- TypeError – if self.Attributes[‘deltaMzArtifactual’] is not an int or float
- AttributeError – if self.Attributes[‘overlapThresholdArtifactual’] does not exist
- TypeError – if self.Attributes[‘overlapThresholdArtifactual’] is not an int or float
- AttributeError – if self.Attributes[‘corrThresholdArtifactual’] does not exist
- TypeError – if self.Attributes[‘corrThresholdArtifactual’] is not an int or float
- AttributeError – if self.Attributes[‘FeatureExtractionSoftware’] does not exist
- TypeError – if self.Attributes[‘FeatureExtractionSoftware’] is not a str
- AttributeError – if self.Attributes[‘Raw Data Path’] does not exist
- TypeError – if self.Attributes[‘Raw Data Path’] is not a str
- AttributeError – if self.Attributes[‘Feature Names’] does not exist
- TypeError – if self.Attributes[‘Feature Names’] is not a str
- TypeError – if self.VariableType is not an enum ‘VariableType’
- AttributeError – if self.corrExclusions does not exist
- AttributeError – if self._correlationToDilution does not exist
- TypeError – if self._correlationToDilution is not a numpy.ndarray
- AttributeError – if self._artifactualLinkageMatrix does not exist
- TypeError – if self._artifactualLinkageMatrix is not a pandas.DataFrame
- AttributeError – if self._tempArtifactualLinkageMatrix does not exist
- TypeError – if self._tempArtifactualLinkageMatrix is not a pandas.DataFrame
- AttributeError – if self.fileName does not exist
- TypeError – if self.fileName is not a str
- AttributeError – if self.filePath does not exist
- TypeError – if self.filePath is not a str
- ValueError – if self.sampleMetadata does not have the same number of samples as self._intensityData
- TypeError – if self.sampleMetadata[‘Sample File Name’] is not str
- TypeError – if self.sampleMetadata[‘AssayRole’] is not an enum ‘AssayRole’
- TypeError – if self.sampleMetadata[‘SampleType’] is not an enum ‘SampleType’
- TypeError – if self.sampleMetadata[‘Dilution’] is not an int or float
- TypeError – if self.sampleMetadata[‘Batch’] is not an int or float
- TypeError – if self.sampleMetadata[‘Correction Batch’] is not an int or float
- TypeError – if self.sampleMetadata[‘Run Order’] is not an int
- TypeError – if self.sampleMetadata[‘Acquired Time’] is not a datetime
- TypeError – if self.sampleMetadata[‘Sample Base Name’] is not str
- LookupError – if self.sampleMetadata does not have a Matrix column
- TypeError – if self.sampleMetadata[‘Matrix’] is not a str
- LookupError – if self.sampleMetadata does not have a Subject ID column
- TypeError – if self.sampleMetadata[‘Subject ID’] is not a str
- TypeError – if self.sampleMetadata[‘Sample ID’] is not a str
- ValueError – if self.featureMetadata does not have the same number of features as self._intensityData
- TypeError – if self.featureMetadata[‘Feature Name’] is not a str
- ValueError – if self.featureMetadata[‘Feature Name’] is not unique
- LookupError – if self.featureMetadata does not have a m/z column
- TypeError – if self.featureMetadata[‘m/z’] is not an int or float
- LookupError – if self.featureMetadata does not have a Retention Time column
- TypeError – if self.featureMetadata[‘Retention Time’] is not an int or float
- ValueError – if self.sampleMask has not been initialised
- ValueError – if self.sampleMask does not have the same number of samples as self._intensityData
- ValueError – if self.featureMask has not been initialised
- ValueError – if self.featureMask does not have the same number of features as self._intensityData
-
class
nPYc.objects.
NMRDataset
(datapath, fileType='Bruker', sop='GenericNMRurine', pulseprogram= 'noesygpp1d', **kwargs)¶ NMRDataset
extendsDataset
to represent both spectral and peak-picked NMR datasets.Objects can be initialised from a variety of common data formats, including Bruker-format raw data, and BI-LISA targeted lipoprotein analysis.
- Bruker
- When loading Bruker format raw spectra (
1r
files), all directores belowdatapath
will be scanned for valid raw data, and those matching pulseprogram loaded and aligned onto a common scale as defined in sop.
- BI-LISA
- BI-LISA data can be read from Excel workbooks, the name of the sheet containing the data to be loaded should be passed in the pulseProgram argument. Feature descriptors will be loaded from the ‘Analytes’ sheet, and file names converted back to the ExperimentName/expno format from ExperimentName_EXPNO_expno.
Parameters: - fileType (str) – Type of data to be loaded
- sheetname (str) – Load data from the specifed sheet of the Excel workbook
- pulseprogram (str) – When loading raw data, only import spectra aquired with pulseprogram
-
addSampleInfo
(descriptionFormat=None, filePath=None, filenameSpec=None, **kwargs)¶ Load additional metadata and map it in to the
sampleMetadata
table.Possible options:
- ‘NPC LIMS’ NPC LIMS files mapping files names of raw analytical data to sample IDs
- ‘NPC Subject Info’ Map subject metadata from a NPC sample manifest file (format defined in ‘PCSOP.082’)
- ‘Raw Data’ Extract analytical parameters from raw data files
- ‘ISATAB’ ISATAB study designs
- ‘Filenames’ Parses sample information out of the filenames, based on the named capture groups in the regex passed in filenamespec
- ‘Basic CSV’ Joins the
sampleMetadata
table with the data in thecsv
file at filePath=, matching on the ‘Sample File Name’ column in both.
Parameters: - descriptionFormat (str) – Format of metadata to be added
- filePath (str) – Path to the additional data to be added
- filenameSpec (None or str) – Only used if descriptionFormat is ‘Filenames’. A regular expression that extracts sample-type information into the following named capture groups: ‘fileName’, ‘baseName’, ‘study’, ‘chromatography’ ‘ionisation’, ‘instrument’, ‘groupingKind’ ‘groupingNo’, ‘injectionKind’, ‘injectionNo’, ‘reference’, ‘exclusion’ ‘reruns’, ‘extraInjections’, ‘exclusion2’. if
None
is passed, use the filenameSpec key in Attributes, loaded from the SOP json
Raises: NotImplementedError – if the descriptionFormat is not understood
-
updateMasks
(filterSamples=True, filterFeatures=True, sampleTypes=[<SampleType.StudySample>, <SampleType.StudyPool>, <SampleType.ExternalReference>, <SampleType.MethodReference>, <SampleType.ProceduralBlank>], assayRoles=[<AssayRole.Assay>, <AssayRole.PrecisionReference>, <AssayRole.LinearityReference>, <AssayRole.Blank>], exclusionRegions=None, sampleQCChecks=[], **kwargs)¶ Update
sampleMask
andfeatureMask
according to parameters.updateMasks()
setssampleMask
orfeatureMask
toFalse
for those items failing analytical criteria.Note
To avoid reintroducing items manually excluded, this method only ever sets items to
False
, therefore if you wish to move from more stringent criteria to a less stringent set, you will need to reset the mask to allTrue
usinginitialiseMasks()
.Parameters: - filterSamples (bool) – If
False
don’t modify sampleMask - filterFeatures (bool) – If
False
don’t modify featureMask - sampleTypes (SampleType) – List of types of samples to retain
- sampleRoles (AssayRole) – List of assays roles to retain
- exclusionRegions (list of tuple) – If
None
Exclude ranges defined inAttributes
[‘exclusionRegions’] - sampleQCChecks (list) – Which quality control metrics to use.
- filterSamples (bool) – If
-
plot
(spectra, labels, interactive=False)¶ Plots a set of nmr spectra. If interactive is False, returns a static matplotlib plot. If True, then plotly is used to generate an interactive plot.
Parameters: - spectra – The specific ‘labels’ of the spectra to plot. By default all spectra are plotted.
- labels – Which labels to select
- interactive – Use matplotlib (False) or plotly (True)
Returns: Displays the NMR data and returns either a matplotlib axis object or a plotly figure dictionary
-
class
nPYc.objects.
TargetedDataset
(dataPath, fileType='TargetLynx', sop='Generic', **kwargs)¶ TargetedDataset
extendsDataset
to represent quantitative datasets, where compounds are already identified, the exactitude of the quantification can be established, units are known and calibration curve or internal standards are employed. TheTargetedDataset
class include methods to apply limits of quantification (LLOQ and ULOQ), merge multiple analytical batch, and report accuracy and precision of each measurements.In addition to the structure of
Dataset
,TargetedDataset
requires the following attributes:expectedConcentration
:A \(n\) × \(m\) pandas dataframe of expected concentrations (matching the
intensityData
dimension), with column names matchingfeatureMetadata[‘Feature Name’]
calibration
:A dictionary containing pandas dataframe describing calibration samples:
calibration['calibIntensityData']
:- A \(r\) x \(m\) numpy matrix of measurements. Features must match features in
intensityData
calibration['calibSampleMetadata']
:- A \(r\) x \(m\) pandas dataframe of calibration sample identifiers and metadata
calibration['calibFeatureMetadata']
:- A \(m\) × \(q\) pandas dataframe of feature identifiers and metadata
calibration['calibExpectedConcentration']
:- A \(r\) × \(m\) pandas dataframe of calibration samples expected concentrations
Attributes
must contain the following (can be loaded from a method specific JSON on import):methodName
:- A (str) name of the method
externalID
:- A list of external ID, each external ID must also be present in Attributes as a list of identifier (for that external ID) for each feature. For example, if
externalID=['PubChem ID']
,Attributes['PubChem ID']=['ID1','ID2','','ID75']
featureMetadata
expects the following columns:quantificationType
:- A
QuantificationType
enum specifying the exactitude of the quantification procedure employed.
calibrationMethod
:- A
CalibrationMethod
enum specifying the calibration method employed.
Unit
:- A (str) unit corresponding the the feature measurement value.
LLOQ
:- The lowest limit of quantification, used to filter concentrations < LLOQ
ULOQ
:- The upper limit of quantification, used to filter concentrations > ULOQ
- externalID:
- All externalIDs listed in
Attributes['externalID']
must be present as their own column
Currently targeted assay results processed using TargetLynx or Bruker quantification results can be imported. To create an import for any other form of semi-quantitative or quantitative results, the procedure is as follow:
- Create a new
fileType == 'myMethod'
entry in__init__()
- Define functions to populate all expected dataframes (using file readers, JSON,…)
- Separate calibration samples from study samples (store in
calibration
). If none exist, intialise empty dataframes with the correct number of columns and column names. - Execute pre-processing steps if required (note: all feature values should be expressed in the unit listed in
featureMetadata['Unit']
) - Apply limits of quantification using
_applyLimitsOfQuantification()
. (This function does not apply limits of quantification to features marked asQuantificationType
== QuantificationType.Monitored for compounds monitored for relative information.)
The resulting
TargetedDatset
created must satisfy to the criteria for BasicTargetedDataset, which can be checked withvalidatedObject()
(list the minimum requirements for all class methods).fileType == 'TargetLynx'
to import data processed using TargetLynxTargetLynx import operates on
xml
files exported via the ‘File -> Export -> XML’ TargetLynx menu option. Import requires acalibration_report.csv
providing lower and upper limits of quantification (LLOQ, ULOQ) with thecalibrationReportPath
keyword argument.Targeted data measurements as well as calibration report information are read and mapped with pre-defined SOPs. All measurments are converted to pre-defined units and measurements inferior to the lowest limits of quantification or superior to the upper limits of quantification are replaced. Once the import is finished, only analysed samples are returned (no calibration samples) and only features mapped onto the pre-defined SOP and sufficiently described.
Instructions to created new
TargetLynx
SOP can be found on the generation of targeted SOPs page.Example:
TargetedDataset(datapath, fileType='TargetLynx', sop='OxylipinMS', calibrationReportPath=calibrationReportPath, sampleTypeToProcess=['Study Sample','QC'], noiseFilled=False, onlyLLOQ=False, responseReference=None)
sop
Currently implemented are ‘OxylipinMS’ and ‘AminoAcidMS’
AminoAcidMS: Gray N. et al. Human Plasma and Serum via Precolumn Derivatization with 6‑Aminoquinolyl‑N‑hydroxysuccinimidyl Carbamate: Application to Acetaminophen-Induced Liver Failure. Analytical Chemistry, 2017, 89, 2478−87.
OxylipinMS: Wolfer AM. et al. Development and Validation of a High-Throughput Ultrahigh-Performance Liquid Chromatography-Mass Spectrometry Approach for Screening of Oxylipins and Their Precursors. Analytical Chemistry, 2015, 87 (23),11721–31
calibrationReportPath
Path to the calibration report csv following the provided report template.
The following columns are required (leave an empty value to reject a compound):
- Compound
- The compound name, identical to the one employed in the SOP json file.
- TargetLynx ID
- The compound TargetLynx ID, identical to the one employed in the SOP json file.
- LLOQ
- Lowest limit of quantification concentration, in the same unit as indicated in TargetLynx.
- ULOQ
- Upper limit of quantification concentration, in the same unit as indicated in TargetLynx.
The following columns are expected by
_targetLynxApplyLimitsOfQuantificationNoiseFilled()
:- Noise (area)
- Area integrated in a blank sample at the same retention time as the compound of interest (if left empty noise concentration calculation cannot take place).
- a
- \(a\) coefficient in the calibration equation (if left empty noise concentration calculation cannot take place).
- b
- \(b\) coefficient in the calibration equation (if left empty noise concentration calculation cannot take place).
The following columns are recommended but not expected:
- Cpd Info
- Additional information relating to the compound (can be left empty).
- r
- \(r\) goodness of fit measure for the calibration equation (can be left empty).
- r2
- \(r^2\) goodness of fit measure for the calibration equation (can be left empty).
sampleTypeToProcess
List of [‘Study Sample’,’Blank’,’QC’,’Other’] for the sample types to process as defined in MassLynx. Only samples in ‘sampleTypeToProcess’ are returned. Calibrants should not be processed and are not returned. Most uses should only require ‘Study Sample’ as quality controls are identified based on sample names by subsequent functions. Default value is ‘[‘Study Sample’,’QC’]’.
noiseFilled
If True values <LLOQ will be replaced by a concentration equivalent to the noise level in a blank. If False <LLOQ is replaced by \(-inf\). Default value is ‘False’
onlyLLOQ
If True only correct <LLOQ, if False correct <LLOQ and >ULOQ. Default value is ‘False’.
responseReference
If noiseFilled=True the noise concentration needs to be calculated. Provide the ‘Sample File Name’ of a reference sample to use in order to establish the response to use, or list of samples to use (one per feature). If None, the middle of the calibration will be employed. Default value is ‘None’.
keepPeakInfo
If keepPeakInfo=True (default False) adds the
peakInfo
dictionary to thecalibration
.peakInfo
contains the peakResponse, peakArea, peakConcentrationDeviation, peakIntegrationFlag and peakRT.
keepExcluded
If keepExcluded=True (default False), import exclusions (
excludedImportSampleMetadata
,excludedImportFeatureMetadata
,excludedImportIntensityData
andexcludedImportExpectedConcentration
) are kept in the object.
keepIS
If keepIS=True (default False), features marked as Internal Standards (IS) are retained.
fileType = 'Bruker Quantification'
to import Bruker quantification resultsnmrRawDataPath
- Path to the parent folder where all result files are stored. All subfolders will be parsed and the
.xml
results files matching thefileNamePattern
imported.
fileNamePattern
- Regex to recognise the result data xml files
pdata
- To select the right pdata folders (default 1)
Two form of Bruker quantification results are supported and selected using the
sop
option: BrukerQuant-UR and Bruker BI-LISAsop = 'BrukerQuant-UR'
Example:
TargetedDataset(nmrRawDataPath, fileType='Bruker Quantification', sop='BrukerQuant-UR', fileNamePattern='.*?urine_quant_report_b\.xml$', unit='mmol/mol Crea')
unit
- If features are duplicated with different units,
unit
limits the import to features matching said unit. (In case of duplication and nounit
, all available units will be listed)
sop = ''BrukerBI-LISA'
Example:
TargetedDataset(nmrRawDataPath, fileType='Bruker Quantification', sop='BrukerBI-LISA', fileNamePattern='.*?results\.xml$')
-
rsdSP
¶ Returns percentage relative standard deviations for each feature in the dataset, calculated on samples with the Assay Role
PrecisionReference
and Sample TypeStudyPool
insampleMetadata
. Implemented as a back-up toaccuracyPrecision()
when no expected concentrations are knownReturns: Vector of feature RSDs Return type: numpy.ndarray
-
rsdSS
¶ Returns percentage relative standard deviations for each feature in the dataset, calculated on samples with the Assay Role
Assay
and Sample TypeStudySample
insampleMetadata
.Returns: Vector of feature RSDs Return type: numpy.ndarray
-
mergeLimitsOfQuantification
(keepBatchLOQ=False, onlyLLOQ=False)¶ Update limits of quantification and apply LLOQ/ULOQ using the lowest common denominator across all batch (after a
__add__()
). Keep the highest LLOQ and lowest ULOQ.Parameters: - keepBatchLOQ (bool) – If
True
do not remove each batch LOQ (featureMetadata['LLOQ_batchX']
,featureMetadata['ULOQ_batchX']
) - onlyLLOQ (bool) – if True only correct <LLOQ, if False correct <LLOQ and >ULOQ
Raises: - ValueError – if targetedData does not satisfy to the BasicTargetedDataset definition on input
- ValueError – if number of batch, LLOQ_batchX and ULOQ_batchX do not match
- ValueError – if targetedData does not satisfy to the BasicTargetedDataset definition after LOQ merging
- Warning – if
featureMetadata['LLOQ']
orfeatureMetadata['ULOQ']
already exist and will be overwritten.
- keepBatchLOQ (bool) – If
-
exportDataset
(destinationPath='.', saveFormat='CSV', withExclusions=True, escapeDelimiters=False, filterMetadata=True)¶ Calls
exportDataset()
and raises a warning if normalisation is employed asTargetedDataset
intensityData
can be left-censored.
-
validateObject
(verbose=True, raiseError=False, raiseWarning=True)¶ Checks that all the attributes specified in the class definition are present and of the required class and/or values.
Returns 4 boolean: is the object a Dataset < a basic TargetedDataset < has the object parameters for QC < has the object sample metadata.
To employ all class methods, the most inclusive (has the object sample metadata) must be successful:
- ‘Basic TargetedDataset’ checks
TargetedDataset
types and uniqueness as well as additional attributes. - ‘has parameters for QC’ is ‘Basic TargetedDataset’ + sampleMetadata[[‘SampleType, AssayRole, Dilution, Run Order, Batch, Correction Batch, Sample Base Name]]
- ‘has sample metadata’ is ‘has parameters for QC’ + sampleMetadata[[‘Sample ID’, ‘Subject ID’, ‘Matrix’]]
calibration['calibIntensityData']
must be initialised even if no samples are presentcalibration['calibSampleMetadata']
must be initialised even if no samples are present, use:pandas.DataFrame(None, columns=self.sampleMetadata.columns.values.tolist())
calibration['calibFeatureMetadata']
must be initialised even if no samples are present, use a copy ofself.featureMetadata
calibration['calibExpectedConcentration']
must be initialised even if no samples are present, use:pandas.DataFrame(None, columns=self.expectedConcentration.columns.values.tolist())
Calibration features must be identical to the usual features. Number of calibration samples and features must match across the 4 calibration tables If ‘sampleMetadataExcluded’, ‘intensityDataExcluded’, ‘featureMetadataExcluded’, ‘expectedConcentrationExcluded’ or ‘excludedFlag’ exist, the existence and number of exclusions (based on ‘sampleMetadataExcluded’) is checkedColumn type() in pandas.DataFrame are established on the first sample (for non int/float) featureMetadata are search for column names containing ‘LLOQ’ & ‘ULOQ’ to allow for ‘LLOQ_batch…’ after
__add__()
, the first column matching is then checked for dtype If datasets are merged, calibration is a list of dict, and number of features is only kept constant inside each dict Does not check for uniqueness insampleMetadata['Sample File Name']
Does not check columns insidecalibration['calibSampleMetadata']
Does not check columns insidecalibration['calibFeatureMetadata']
Does not currently check forAttributes['Feature Name']
Parameters: - verbose (bool) – if True the result of each check is printed (default True)
- raiseError (bool) – if True an error is raised when a check fails and the validation is interrupted (default False)
- raiseWarning (bool) – if True a warning is raised when a check fails
Returns: A dictionary of 4 boolean with True if the Object conforms to the corresponding test. ‘Dataset’ conforms to
Dataset
, ‘BasicTargetedDataset’ conforms toDataset
+ basicTargetedDataset
, ‘QC’ BasicTargetedDataset + object has QC parameters, ‘sampleMetadata’ QC + object has sample metadata informationReturn type: dict
Raises: - TypeError – if the Object class is wrong
- AttributeError – if self.Attributes[‘methodName’] does not exist
- TypeError – if self.Attributes[‘methodName’] is not a str
- AttributeError – if self.Attributes[‘externalID’] does not exist
- TypeError – if self.Attributes[‘externalID’] is not a list
- TypeError – if self.VariableType is not an enum ‘VariableType’
- AttributeError – if self.fileName does not exist
- TypeError – if self.fileName is not a str or list
- AttributeError – if self.filePath does not exist
- TypeError – if self.filePath is not a str or list
- ValueError – if self.sampleMetadata does not have the same number of samples as self._intensityData
- TypeError – if self.sampleMetadata[‘Sample File Name’] is not str
- TypeError – if self.sampleMetadata[‘AssayRole’] is not an enum ‘AssayRole’
- TypeError – if self.sampleMetadata[‘SampleType’] is not an enum ‘SampleType’
- TypeError – if self.sampleMetadata[‘Dilution’] is not an int or float
- TypeError – if self.sampleMetadata[‘Batch’] is not an int or float
- TypeError – if self.sampleMetadata[‘Correction Batch’] is not an int or float
- TypeError – if self.sampleMetadata[‘Run Order’] is not an int
- TypeError – if self.sampleMetadata[‘Acquired Time’] is not a datetime
- TypeError – if self.sampleMetadata[‘Sample Base Name’] is not str
- LookupError – if self.sampleMetadata does not have a Subject ID column
- TypeError – if self.sampleMetadata[‘Subject ID’] is not a str
- TypeError – if self.sampleMetadata[‘Sample ID’] is not a str
- ValueError – if self.featureMetadata does not have the same number of features as self._intensityData
- TypeError – if self.featureMetadata[‘Feature Name’] is not a str
- ValueError – if self.featureMetadata[‘Feature Name’] is not unique
- LookupError – if self.featureMetadata does not have a calibrationMethod column
- TypeError – if self.featureMetadata[‘calibrationMethod’] is not an enum ‘CalibrationMethod’
- LookupError – if self.featureMetadata does not have a quantificationType column
- TypeError – if self.featureMetadata[‘quantificationType’] is not an enum ‘QuantificationType’
- LookupError – if self.featureMetadata does not have a Unit column
- TypeError – if self.featureMetadata[‘Unit’] is not a str
- LookupError – if self.featureMetadata does not have a LLOQ or similar column
- TypeError – if self.featureMetadata[‘LLOQ’] or similar is not an int or float
- LookupError – if self.featureMetadata does not have a ULOQ or similar column
- TypeError – if self.featureMetadata[‘ULOQ’] or similar is not an int or float
- LookupError – if self.featureMetadata does not have the ‘externalID’ as columns
- AttributeError – if self.expectedConcentration does not exist
- TypeError – if self.expectedConcentration is not a pandas.DataFrame
- ValueError – if self.expectedConcentration does not have the same number of samples as self._intensityData
- ValueError – if self.expectedConcentration does not have the same number of features as self._intensityData
- ValueError – if self.expectedConcentration column name do not match self.featureMetadata[‘Feature Name’]
- ValueError – if self.sampleMask is not initialised
- ValueError – if self.sampleMask does not have the same number of samples as self._intensityData
- ValueError – if self.featureMask has not been initialised
- ValueError – if self.featureMask does not have the same number of features as self._intensityData
- AttributeError – if self.calibration does not exist
- TypeError – if self.calibration is not a dict
- AttributeError – if self.calibration[‘calibIntensityData’] does not exist
- TypeError – if self.calibration[‘calibIntensityData’] is not a numpy.ndarray
- ValueError – if self.calibration[‘calibIntensityData’] does not have the same number of features as self._intensityData
- AttributeError – if self.calibration[‘calibSampleMetadata’] does not exist
- TypeError – if self.calibration[‘calibSampleMetadata’] is not a pandas.DataFrame
- ValueError – if self.calibration[‘calibSampleMetadata’] does not have the same number of samples as self.calibration[‘calibIntensityData’]
- AttributeError – if self.calibration[‘calibFeatureMetadata’] does not exist
- TypeError – if self.calibration[‘calibFeatureMetadata’] is not a pandas.DataFrame
- LookupError – if self.calibration[‘calibFeatureMetadata’] does not have a [‘Feature Name’] column
- ValueError – if self.calibration[‘calibFeatureMetadata’] does not have the same number of features as self._intensityData
- AttributeError – if self.calibration[‘calibExpectedConcentration’] does not exist
- TypeError – if self.calibration[‘calibExpectedConcentration’] is not a pandas.DataFrame
- ValueError – if self.calibration[‘calibExpectedConcentration’] does not have the same number of samples as self.calibration[‘calibIntensityData’]
- ValueError – if self.calibration[‘calibExpectedConcentration’] does not have the same number of features as self.calibration[‘calibIntensityData’]
- ValueError – if self.calibration[‘calibExpectedConcentration’] column name do not match self.featureMetadata[‘Feature Name’]
- ‘Basic TargetedDataset’ checks
-
applyMasks
()¶ Permanently delete elements masked (those set to
False
) insampleMask
andfeatureMask
, fromfeatureMetadata
,sampleMetadata
,intensityData
and py:attr:TargetedDataset.expectedConcentration.Features are excluded in each
calibration
based on the internalcalibration['calibFeatureMetadata']
(iterate through the list of calibration if 2+ datasets have been joined with__add__()
).
-
updateMasks
(filterSamples=True, filterFeatures=True, sampleTypes=[<SampleType.StudySample>, <SampleType.StudyPool>], assayRoles=[<AssayRole.Assay>, <AssayRole.PrecisionReference>], quantificationTypes=[<QuantificationType.IS>, <QuantificationType.QuantOwnLabeledAnalogue>, <QuantificationType.QuantAltLabeledAnalogue>, <QuantificationType.QuantOther>, <QuantificationType.Monitored>], calibrationMethods=[<CalibrationMethod.backcalculatedIS>, <CalibrationMethod.noIS>, <CalibrationMethod.noCalibration>, <CalibrationMethod.otherCalibration>], rsdThreshold=None, **kwargs)¶ Update
sampleMask
andfeatureMask
according to QC parameters.updateMasks()
setssampleMask
orfeatureMask
toFalse
for those items failing analytical criteria.Similar to
updateMasks()
, without blankThreshold or artifactual filteringNote
To avoid reintroducing items manually excluded, this method only ever sets items to
False
, therefore if you wish to move from more stringent criteria to a less stringent set, you will need to reset the mask to allTrue
usinginitialiseMasks()
.Parameters: - filterSamples (bool) – If
False
don’t modify sampleMask - filterFeatures (bool) – If
False
don’t modify featureMask - sampleTypes (SampleType) – List of types of samples to retain
- assayRoles (AssayRole) – List of assays roles to retain
- quantificationTypes (QuantificationType) – List of quantification types to retain
- calibrationMethods (CalibrationMethod) – List of calibratio methods to retain
Raises: - TypeError – if sampleTypes is not a list
- TypeError – if sampleTypes are not a SampleType enum
- TypeError – if assayRoles is not a list
- TypeError – if assayRoles are not an AssayRole enum
- TypeError – if quantificationTypes is not a list
- TypeError – if quantificationTypes are not a QuantificationType enum
- TypeError – if calibrationMethods is not a list
- TypeError – if calibrationMethods are not a CalibrationMethod enum
- filterSamples (bool) – If
-
addSampleInfo
(descriptionFormat=None, filePath=None, **kwargs)¶ Load additional metadata and map it in to the
sampleMetadata
table.Possible options:
- ‘NPC Subject Info’ Map subject metadata from a NPC sample manifest file (format defined in ‘PCSOP.082’)
- ‘Raw Data’ Extract analytical parameters from raw data files
- ‘ISATAB’ ISATAB study designs
- ‘Filenames’ Parses sample information out of the filenames, based on the named capture groups in the regex passed in filenamespec
- ‘Basic CSV’ Joins the
sampleMetadata
table with the data in thecsv
file at filePath=, matching on the ‘Sample File Name’ column in both. - ‘Batches’ Interpolate batch numbers for samples between those with defined batch numbers based on sample acquisitions times
Parameters: - descriptionFormat (str) – Format of metadata to be added
- filePath (str) – Path to the additional data to be added
- filenameSpec (None or str) – Only used if descriptionFormat is ‘Filenames’. A regular expression that extracts sample-type information into the following named capture groups: ‘fileName’, ‘baseName’, ‘study’, ‘chromatography’ ‘ionisation’, ‘instrument’, ‘groupingKind’ ‘groupingNo’, ‘injectionKind’, ‘injectionNo’, ‘reference’, ‘exclusion’ ‘reruns’, ‘extraInjections’, ‘exclusion2’. if
None
is passed, use the filenameSpec key in Attributes, loaded from the SOP json
Raises: NotImplementedError – if the descriptionFormat is not understood
-
accuracyPrecision
(onlyPrecisionReferences=False)¶ Return Precision (percent RSDs) and Accuracy for each SampleType and each unique concentration. Statistic grouped by SampleType, Feature and unique concentration.
Parameters: - dataset (TargetedDataset) – TargetedDataset object to generate the accuracy and precision for.
- onlyPrecisionReference (bool) – If
True
only use samples with the AssayRole PrecisionReference.
Returns: Dict of Accuracy and Precision dict for each group.
Return type: dict(str:dict(str:pandas.DataFrame))
Raises: TypeError – if dataset is not an instance of TargetedDataset
[1] | Ralf Tautenhahn, Christoph Bottcher and Steffen Neumann. Highly sensitive feature detection for high resolution LC/MS. BMC Bioinformatics, 9:504, 2008. URL: https://doi.org/10.1186/1471-2105-9-504 |
[2] | Elizabeth J Want, Ian D Wilson, Helen Gika, Georgios Theodoridis, Robert S Plumb, John Shockcor, Elaine Holmes and Jeremy K Nicholson. Global metabolic profiling procedures for urine using UPLC-MS. Nature Protocols, 5(6):1005-18, 2010. URL: http://dx.doi.org/10.1038/nprot.2010.50 |
[3] | Warwick B Dunn, David Broadhurst, Paul Begley, Eva Zelena, Sue Francis-McIntyre, Nadine Anderson, Marie Brown, Joshau D Knowles, Antony Halsall, John N Haselden, Andrew W Nicholls, Ian D Wilson, Douglas B Kell, Royston Goodacre and The Human Serum Metabolome (HUSERMET) Consortium. Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry. Nature Protocols, 6(7):1060-83, 2011. URL: http://dx.doi.org/10.1038/nprot.2011.335 |
[4] | Matthew R Lewis, Jake TM Pearce, Konstantina Spagou, Martin Green, Anthony C Dona, Ada HY Yuen, Mark David, David J Berry, Katie Chappell, Verena Horneffer-van der Sluis, Rachel Shaw, Simon Lovestone, Paul Elliott, John Shockcor, John C Lindon, Olivier Cloarec, Zoltan Takats, Elaine Holmes and Jeremy K Nicholson. Development and Application of Ultra-Performance Liquid Chromatography-TOF MS for Precision Large Scale Urinary Metabolic Phenotyping. Analytical Chemistry, 88(18):9004-9013, 2016. URL: http://dx.doi.org/10.1021/acs.analchem.6b01481 |
[5] | Jake TM Pearce, Toby J Athersuch, Timothy MD Ebbels, John C Lindon, Jeremy K Nicholson and Hector C Keun. Robust Algorithms for Automated Chemical Shift Calibration of 1D 1H NMR Spectra of Blood Serum. Analytical Chemistry, 80(18):7158-62, 2008. URL: http://dx.doi.org/10.1021/ac8011494 |
[6] | Anthony C Dona, Beatriz Jiménez, Hartmut Schäfer, Eberhard Humpfer, Manfred Spraul, Matthew R Lewis, Jake TM Pearce, Elaine Holmes, John C Lindon and Jeremy K Nicholson. Precision High-Throughput Proton NMR Spectroscopy of Human Urine, Serum, and Plasma for Large-Scale Metabolic Phenotyping. Analytical Chemistry, 86(19):9887-9894, 2014. URL: http://dx.doi.org/10.1021/ac5025039 |
[7] | Jean W Lee, Viswanath Devanarayan, Yu Chen Barrett, Russell Weiner, John Allinson, Scott Fountain, Stephen Keller, Ira Weinryb, Marie Green, Larry Duan, James A Rogers, Robert Millham, Peter J O’Brien, Jeff Sailstad, Masood Khan, Chad Ray and John A Wagner. Fit-for-purpose method development and validation for successful biomarker measurement. Pharmaceutical Research, 23(2):312-28, 2006. URL: http://dx.doi.org/10.1007/s11095-005-9045-3 |