DistanceMetricClassifier module

DistanceMetricClassifier module#

A module containing the distance metric classifier.

This module contains the DistanceMetricClassifier introduced by Chaini et al. (2024) in “Light Curve Classification with DistClassiPy: a new distance-based classifier”

class distclassipy.classifier.DistanceMetricClassifier(metric: str | Callable = None, scale: bool = True, central_stat: str = 'median', dispersion_stat: str = 'std')#

A distance-based classifier that supports different distance metrics.

The distance metric classifier determines the similarity between features in a dataset by leveraging the use of different distance metrics to. A specified distance metric is used to compute the distance between a given object and a centroid for every training class in the feature space. The classifier supports the use of different statistical measures for constructing the centroid and scaling the computed distance. Additionally, the distance metric classifier also optionally provides an estimate of the confidence of the classifier’s predictions.

Parameters:

scale (bool, default=True) – Whether to scale the distance between the test object and the centroid for a class in the feature space. If True, the data will be scaled based on the specified dispersion statistic.
central_stat ({"mean", "median"}, default="median") – The statistic used to calculate the central tendency of the data to construct the feature-space centroid. Supported statistics are “mean” and “median”.
dispersion_stat ({"std", "iqr"}, default="std") –
The statistic used to calculate the dispersion of the data for scaling the distance. Supported statistics are “std” for standard deviation and “iqr” for inter-quartile range.

Added in version 0.1.0.

References

Examples

>>> import distclassipy as dcpy
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=1000, n_features=4,
...                            n_informative=2, n_redundant=0,
...                            random_state=0, shuffle=False)
>>> clf = dcpy.DistanceMetricClassifier()
>>> clf.fit(X, y)
DistanceMetricClassifier(...)
>>> print(clf.predict([[0, 0, 0, 0]], metric="canberra"))
[0]

calculate_confidence()#

Calculate the confidence for each prediction.

The confidence is calculated as the inverse of the distance of each data point to the centroids of the training data.

fit(X: array, y: array, feat_labels: list[str] = None) → DistanceMetricClassifier#

Calculate the feature space centroid for all classes.

This function calculates the feature space centroid in the training set (X, y) for all classes using the central statistic. If scaling is enabled, it also calculates the appropriate dispersion statistic. This involves computing the centroid for every class in the feature space and optionally calculating the kernel density estimate and 1-dimensional distance.

Parameters:

X (array-like of shape (n_samples, n_features)) – The training input samples.
y (array-like of shape (n_samples,)) – The target values (class labels).
feat_labels (list of str, optional, default=None) – The feature labels. If not provided, default labels representing feature number will be used.

Returns:

self – Fitted estimator.

Return type:

object

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

predict(X: array, metric: str | Callable = None) → ndarray#

Predict the class labels for the provided X.

The prediction is based on the distance of each data point in the input sample to the centroid for each class in the feature space. The predicted class is the one whose centroid is the closest to the input sample.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input samples.
metric (str or callable, default="euclidean") –
The distance metric to use for calculating the distance between features.

Changed in version 0.2.0: The metric is now specified at prediction time rather than during initialization, providing greater flexibility.

Returns:

y – The predicted classes.

Return type:

ndarray of shape (n_samples,)

See also

scipy.spatial.dist: Other distance metrics provided in SciPy
distclassipy.distances: Distance metrics included with DistClassiPy

Notes

If using distance metrics supported by SciPy, it is desirable to pass a string, which allows SciPy to use an optimized C version of the code instead of the slower Python version.

predict_and_analyse(X: array, metric: str | Callable = None) → ndarray#

Predict the class labels for the provided X and perform analysis.

The analysis involves saving all calculated distances and confidences as an attribute for inspection and analysis later.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input samples.
metric (str or callable, default="euclidean") – The distance metric to use for calculating the distance between features.

Returns:

y – The predicted classes.

Return type:

ndarray of shape (n_samples,)

See also

scipy.spatial.dist: Other distance metrics provided in SciPy
distclassipy.distances: Distance metrics included with DistClassiPy

Notes

If using distance metrics supported by SciPy, it is desirable to pass a string, which allows SciPy to use an optimized C version of the code instead of the slower Python version.

score(X, y, metric: str | Callable = None) → float#

Return the mean accuracy on the given test data and labels.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,)) – True labels for X.
metric (str or callable, default="euclidean") – The distance metric to use for calculating the distance between features.

Returns:

score – Mean accuracy of self.predict(X) wrt. y.

Return type:

float

set_params(**params)#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

set_score_request(*, metric: bool | None | str = '$UNCHANGED$') → DistanceMetricClassifier#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to score.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

metricstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for metric parameter in score.

selfobject
The updated object.

class distclassipy.classifier.EnsembleDistanceClassifier(feat_idx: int, scale: bool = True, central_stat: str = 'median', dispersion_stat: str = 'std', metrics_to_consider: list[str] = None, random_state: int = None)#

An ensemble classifier that uses different metrics for each quantile.

This classifier splits the data into quantiles based on a specified feature and uses different distance metrics for each quantile to construct an ensemble classifier for each quantile, generally leading to better performance. Note, however, this involves fitting the training set for each metric to evaluate performance, making this more computationally expensive.

Added in version 0.2.0.

evaluate_metrics(X: ndarray, y: ndarray, n_quantiles: int = 4) → Tuple[DataFrame, Series, ndarray]#

Evaluate and find the best distance metrics for the specified feature.

This method uses the standalone find_best_metrics function to evaluate different distance metrics and determine the best-performing ones for each quantile.

Parameters:

X (np.ndarray) – The input feature matrix.
y (np.ndarray) – The target labels.
n_quantiles (int, default=4) – The number of quantiles to split the data into.

Returns:

quantile_scores_df (pd.DataFrame) – A DataFrame containing the accuracy scores for each metric across different quantiles.
best_metrics_per_quantile (pd.Series) – A Series indicating the best-performing metric for each quantile.
group_bins (np.ndarray) – The bins used for quantile splitting.

fit(X: ndarray, y: ndarray, n_quantiles: int = 4) → EnsembleDistanceClassifier#

Fit the ensemble classifier using the best metrics for each quantile.

Parameters:

X (np.ndarray) – The input feature matrix.
y (np.ndarray) – The target labels.
n_quantiles (int, default=4) – The number of quantiles to split the data into.

Returns:

self – Fitted estimator.

Return type:

object

get_metadata_routing()#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:: routing – A MetadataRequest encapsulating routing information.
Return type:: MetadataRequest

get_params(deep=True)#

Get parameters for this estimator.

Parameters:: deep (bool, default=True) – If True, will return the parameters for this estimator and contained subobjects that are estimators.
Returns:: params – Parameter names mapped to their values.
Return type:: dict

predict(X: ndarray) → ndarray#

Predict class labels using the best metric for each quantile.

Parameters:: X (np.ndarray) – The input samples.
Returns:: predictions – The predicted class labels.
Return type:: np.ndarray

score(X, y, sample_weight=None)#

Return accuracy on provided data and labels.

In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample that each label set be correctly predicted.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.

Returns:

score – Mean accuracy of self.predict(X) w.r.t. y.

Return type:

float

set_params(**params)#

Set the parameters of this estimator.

Parameters:: **params (dict) – Estimator parameters.
Returns:: self – Estimator instance.
Return type:: estimator instance

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → EnsembleDistanceClassifier#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to score.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for sample_weight parameter in score.

selfobject
The updated object.

Copyright (C) 2024 Siddharth Chaini#

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <https://www.gnu.org/licenses/>.

class distclassipy.classifier.DistanceMetricClassifier(metric: str | Callable = None, scale: bool = True, central_stat: str = 'median', dispersion_stat: str = 'std')#

Bases: ClassifierMixin, BaseEstimator

A distance-based classifier that supports different distance metrics.

Parameters:

scale (bool, default=True) – Whether to scale the distance between the test object and the centroid for a class in the feature space. If True, the data will be scaled based on the specified dispersion statistic.
central_stat ({"mean", "median"}, default="median") – The statistic used to calculate the central tendency of the data to construct the feature-space centroid. Supported statistics are “mean” and “median”.
dispersion_stat ({"std", "iqr"}, default="std") –
The statistic used to calculate the dispersion of the data for scaling the distance. Supported statistics are “std” for standard deviation and “iqr” for inter-quartile range.

Added in version 0.1.0.

References

Examples

>>> import distclassipy as dcpy
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=1000, n_features=4,
...                            n_informative=2, n_redundant=0,
...                            random_state=0, shuffle=False)
>>> clf = dcpy.DistanceMetricClassifier()
>>> clf.fit(X, y)
DistanceMetricClassifier(...)
>>> print(clf.predict([[0, 0, 0, 0]], metric="canberra"))
[0]

calculate_confidence()#

Calculate the confidence for each prediction.

The confidence is calculated as the inverse of the distance of each data point to the centroids of the training data.

fit(X: array, y: array, feat_labels: list[str] = None) → DistanceMetricClassifier#

Calculate the feature space centroid for all classes.

Parameters:

X (array-like of shape (n_samples, n_features)) – The training input samples.
y (array-like of shape (n_samples,)) – The target values (class labels).
feat_labels (list of str, optional, default=None) – The feature labels. If not provided, default labels representing feature number will be used.

Returns:

self – Fitted estimator.

Return type:

object

predict(X: array, metric: str | Callable = None) → ndarray#

Predict the class labels for the provided X.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input samples.
metric (str or callable, default="euclidean") –
The distance metric to use for calculating the distance between features.

Changed in version 0.2.0: The metric is now specified at prediction time rather than during initialization, providing greater flexibility.

Returns:

y – The predicted classes.

Return type:

ndarray of shape (n_samples,)

See also

scipy.spatial.dist: Other distance metrics provided in SciPy
distclassipy.distances: Distance metrics included with DistClassiPy

Notes

If using distance metrics supported by SciPy, it is desirable to pass a string, which allows SciPy to use an optimized C version of the code instead of the slower Python version.

predict_and_analyse(X: array, metric: str | Callable = None) → ndarray#

Predict the class labels for the provided X and perform analysis.

The analysis involves saving all calculated distances and confidences as an attribute for inspection and analysis later.

Parameters:

X (array-like of shape (n_samples, n_features)) – The input samples.
metric (str or callable, default="euclidean") – The distance metric to use for calculating the distance between features.

Returns:

y – The predicted classes.

Return type:

ndarray of shape (n_samples,)

See also

scipy.spatial.dist: Other distance metrics provided in SciPy
distclassipy.distances: Distance metrics included with DistClassiPy

Notes

If using distance metrics supported by SciPy, it is desirable to pass a string, which allows SciPy to use an optimized C version of the code instead of the slower Python version.

score(X, y, metric: str | Callable = None) → float#

Return the mean accuracy on the given test data and labels.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,)) – True labels for X.
metric (str or callable, default="euclidean") – The distance metric to use for calculating the distance between features.

Returns:

score – Mean accuracy of self.predict(X) wrt. y.

Return type:

float

set_fit_request(*, feat_labels: bool | None | str = '$UNCHANGED$') → DistanceMetricClassifier#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to fit.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

feat_labelsstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for feat_labels parameter in fit.

selfobject
The updated object.

set_predict_request(*, metric: bool | None | str = '$UNCHANGED$') → DistanceMetricClassifier#

Configure whether metadata should be requested to be passed to the predict method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to predict.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

metricstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for metric parameter in predict.

selfobject
The updated object.

set_score_request(*, metric: bool | None | str = '$UNCHANGED$') → DistanceMetricClassifier#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to score.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

metricstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for metric parameter in score.

selfobject
The updated object.

Bases: ClassifierMixin, BaseEstimator

An ensemble classifier that uses different metrics for each quantile.

Added in version 0.2.0.

evaluate_metrics(X: ndarray, y: ndarray, n_quantiles: int = 4) → Tuple[DataFrame, Series, ndarray]#

Evaluate and find the best distance metrics for the specified feature.

This method uses the standalone find_best_metrics function to evaluate different distance metrics and determine the best-performing ones for each quantile.

Parameters:

X (np.ndarray) – The input feature matrix.
y (np.ndarray) – The target labels.
n_quantiles (int, default=4) – The number of quantiles to split the data into.

Returns:

quantile_scores_df (pd.DataFrame) – A DataFrame containing the accuracy scores for each metric across different quantiles.
best_metrics_per_quantile (pd.Series) – A Series indicating the best-performing metric for each quantile.
group_bins (np.ndarray) – The bins used for quantile splitting.

fit(X: ndarray, y: ndarray, n_quantiles: int = 4) → EnsembleDistanceClassifier#

Fit the ensemble classifier using the best metrics for each quantile.

Parameters:

X (np.ndarray) – The input feature matrix.
y (np.ndarray) – The target labels.
n_quantiles (int, default=4) – The number of quantiles to split the data into.

Returns:

self – Fitted estimator.

Return type:

object

predict(X: ndarray) → ndarray#

Predict class labels using the best metric for each quantile.

Parameters:: X (np.ndarray) – The input samples.
Returns:: predictions – The predicted class labels.
Return type:: np.ndarray

set_fit_request(*, n_quantiles: bool | None | str = '$UNCHANGED$') → EnsembleDistanceClassifier#

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to fit.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

n_quantilesstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for n_quantiles parameter in fit.

selfobject
The updated object.

set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → EnsembleDistanceClassifier#

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

False: metadata is not requested and the meta-estimator will not pass it to score.

None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

sample_weightstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for sample_weight parameter in score.

selfobject
The updated object.

distclassipy.classifier.find_best_metrics(clf: DistanceMetricClassifier, X: ndarray, y: ndarray, feat_idx: int, n_quantiles: int = 4, metrics_to_consider: list[str] = None, random_state: int = None) → Tuple[DataFrame, Series, ndarray]#

Evaluate and find the best distance metrics for a given feature.

This function evaluates different distance metrics to determine which performs best for a specific feature in the dataset. It splits the data into quantiles based on the specified feature and calculates the accuracy of the classifier for each metric within these quantiles.

Added in version 0.2.0.

Parameters:

clf (DistanceMetricClassifier) – The classifier instance to be used for evaluation.
X (np.ndarray) – The input feature matrix.
y (np.ndarray) – The target labels.
feat_idx (int) – The index of the feature to be used for quantile splitting.
n_quantiles (int, default=4) – The number of quantiles to split the data into.
metrics_to_consider (list of str, optional) – A list of distance metrics to evaluate. If None, all available metrics within DistClassiPy will be considered.
random_state (int, RandomState instance or None, optional (default=None)) –
Controls the randomness of the estimator. Pass an int for reproducible output across multiple function calls.

Added in version 0.2.1.

Returns:

quantile_scores_df (pd.DataFrame) – A DataFrame containing the accuracy scores for each metric across different quantiles.
best_metrics_per_quantile (pd.Series) – A Series indicating the best-performing metric for each quantile.
group_bins (np.ndarray) – The bins used for quantile splitting.

distclassipy.classifier.initialize_metric_function(metric)#

Set the metric function based on the provided metric.

If the metric is a string, the function will look for a corresponding function in scipy.spatial.distance or distclassipy.distances. If the metric is a function, it will be used directly.

DistanceMetricClassifier module

Contents

DistanceMetricClassifier module#

Copyright (C) 2024 Siddharth Chaini#