A module containing the distance metric classifier.
This module contains the DistanceMetricClassifier introduced by Chaini et al. (2024)
in “Light Curve Classification with DistClassiPy: a new distance-based classifier”
A distance-based classifier that supports different distance metrics.
The distance metric classifier determines the similarity between features in a
dataset by leveraging the use of different distance metrics to. A specified
distance metric is used to compute the distance between a given object and a
centroid for every training class in the feature space. The classifier supports
the use of different statistical measures for constructing the centroid and scaling
the computed distance. Additionally, the distance metric classifier also
optionally provides an estimate of the confidence of the classifier’s predictions.
Parameters:
scale (bool, default=True) – Whether to scale the distance between the test object and the centroid for a
class in the feature space. If True, the data will be scaled based on the
specified dispersion statistic.
central_stat ({"mean", "median"}, default="median") – The statistic used to calculate the central tendency of the data to construct
the feature-space centroid. Supported statistics are “mean” and “median”.
dispersion_stat ({"std", "iqr"}, default="std") –
The statistic used to calculate the dispersion of the data for scaling the
distance. Supported statistics are “std” for standard deviation and “iqr”
for inter-quartile range.
Calculate the feature space centroid for all classes.
This function calculates the feature space centroid in the training
set (X, y) for all classes using the central statistic. If scaling
is enabled, it also calculates the appropriate dispersion statistic.
This involves computing the centroid for every class in the feature space and
optionally calculating the kernel density estimate and 1-dimensional distance.
Parameters:
X (array-like of shape (n_samples, n_features)) – The training input samples.
y (array-like of shape (n_samples,)) – The target values (class labels).
feat_labels (list of str, optional, default=None) – The feature labels. If not provided, default labels representing feature
number will be used.
The prediction is based on the distance of each data point in the input sample
to the centroid for each class in the feature space. The predicted class is the
one whose centroid is the closest to the input sample.
Parameters:
X (array-like of shape (n_samples, n_features)) – The input samples.
metric (str or callable, default="euclidean") –
The distance metric to use for calculating the distance between features.
Changed in version 0.2.0: The metric is now specified at prediction time rather
than during initialization, providing greater flexibility.
Returns:
y – The predicted classes.
Return type:
ndarray of shape (n_samples,)
See also
scipy.spatial.dist
Other distance metrics provided in SciPy
distclassipy.Distance
Distance metrics included with DistClassiPy
Notes
If using distance metrics supported by SciPy, it is desirable to pass a string,
which allows SciPy to use an optimized C version of the code instead of the
slower Python version.
Predict the class labels for the provided X and perform analysis.
The prediction is based on the distance of each data point in the input sample
to the centroid for each class in the feature space. The predicted class is the
one whose centroid is the closest to the input sample.
The analysis involves saving all calculated distances and confidences as an
attribute for inspection and analysis later.
Parameters:
X (array-like of shape (n_samples, n_features)) – The input samples.
metric (str or callable, default="euclidean") – The distance metric to use for calculating the distance between features.
Returns:
y – The predicted classes.
Return type:
ndarray of shape (n_samples,)
See also
scipy.spatial.dist
Other distance metrics provided in SciPy
distclassipy.Distance
Distance metrics included with DistClassiPy
Notes
If using distance metrics supported by SciPy, it is desirable to pass a string,
which allows SciPy to use an optimized C version of the code instead
of the slower Python version.
score(X, y, metric:str|Callable='euclidean')→float#
Return the mean accuracy on the given test data and labels.
Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,)) – True labels for X.
metric (str or callable, default="euclidean") – The distance metric to use for calculating the distance between features.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form <component>__<parameter> so that it’s
possible to update each component of a nested object.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config()).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for metric parameter in score.
An ensemble classifier that uses different metrics for each quantile.
This classifier splits the data into quantiles based on a specified
feature and uses different distance metrics for each quantile to
construct an ensemble classifier for each quantile, generally leading
to better performance.
Note, however, this involves fitting the training set for each metric
to evaluate performance, making this more computationally expensive.
Evaluate and find the best distance metrics for the specified feature.
This method uses the standalone find_best_metrics function to evaluate
different distance metrics and determine the best-performing ones for
each quantile.
Parameters:
X (np.ndarray) – The input feature matrix.
y (np.ndarray) – The target labels.
n_quantiles (int, default=4) – The number of quantiles to split the data into.
Returns:
quantile_scores_df (pd.DataFrame) – A DataFrame containing the accuracy scores for each metric across
different quantiles.
best_metrics_per_quantile (pd.Series) – A Series indicating the best-performing metric for each quantile.
group_bins (np.ndarray) – The bins used for quantile splitting.
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.
Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs)) – True labels for X.
sample_weight (array-like of shape (n_samples,), default=None) – Sample weights.
Returns:
score – Mean accuracy of self.predict(X) w.r.t. y.
The method works on simple estimators as well as on nested objects
(such as Pipeline). The latter have
parameters of the form <component>__<parameter> so that it’s
possible to update each component of a nested object.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config()).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
A distance-based classifier that supports different distance metrics.
The distance metric classifier determines the similarity between features in a
dataset by leveraging the use of different distance metrics to. A specified
distance metric is used to compute the distance between a given object and a
centroid for every training class in the feature space. The classifier supports
the use of different statistical measures for constructing the centroid and scaling
the computed distance. Additionally, the distance metric classifier also
optionally provides an estimate of the confidence of the classifier’s predictions.
Parameters:
scale (bool, default=True) – Whether to scale the distance between the test object and the centroid for a
class in the feature space. If True, the data will be scaled based on the
specified dispersion statistic.
central_stat ({"mean", "median"}, default="median") – The statistic used to calculate the central tendency of the data to construct
the feature-space centroid. Supported statistics are “mean” and “median”.
dispersion_stat ({"std", "iqr"}, default="std") –
The statistic used to calculate the dispersion of the data for scaling the
distance. Supported statistics are “std” for standard deviation and “iqr”
for inter-quartile range.
Calculate the feature space centroid for all classes.
This function calculates the feature space centroid in the training
set (X, y) for all classes using the central statistic. If scaling
is enabled, it also calculates the appropriate dispersion statistic.
This involves computing the centroid for every class in the feature space and
optionally calculating the kernel density estimate and 1-dimensional distance.
Parameters:
X (array-like of shape (n_samples, n_features)) – The training input samples.
y (array-like of shape (n_samples,)) – The target values (class labels).
feat_labels (list of str, optional, default=None) – The feature labels. If not provided, default labels representing feature
number will be used.
The prediction is based on the distance of each data point in the input sample
to the centroid for each class in the feature space. The predicted class is the
one whose centroid is the closest to the input sample.
Parameters:
X (array-like of shape (n_samples, n_features)) – The input samples.
metric (str or callable, default="euclidean") –
The distance metric to use for calculating the distance between features.
Changed in version 0.2.0: The metric is now specified at prediction time rather
than during initialization, providing greater flexibility.
Returns:
y – The predicted classes.
Return type:
ndarray of shape (n_samples,)
See also
scipy.spatial.dist
Other distance metrics provided in SciPy
distclassipy.Distance
Distance metrics included with DistClassiPy
Notes
If using distance metrics supported by SciPy, it is desirable to pass a string,
which allows SciPy to use an optimized C version of the code instead of the
slower Python version.
Predict the class labels for the provided X and perform analysis.
The prediction is based on the distance of each data point in the input sample
to the centroid for each class in the feature space. The predicted class is the
one whose centroid is the closest to the input sample.
The analysis involves saving all calculated distances and confidences as an
attribute for inspection and analysis later.
Parameters:
X (array-like of shape (n_samples, n_features)) – The input samples.
metric (str or callable, default="euclidean") – The distance metric to use for calculating the distance between features.
Returns:
y – The predicted classes.
Return type:
ndarray of shape (n_samples,)
See also
scipy.spatial.dist
Other distance metrics provided in SciPy
distclassipy.Distance
Distance metrics included with DistClassiPy
Notes
If using distance metrics supported by SciPy, it is desirable to pass a string,
which allows SciPy to use an optimized C version of the code instead
of the slower Python version.
score(X, y, metric:str|Callable='euclidean')→float#
Return the mean accuracy on the given test data and labels.
Parameters:
X (array-like of shape (n_samples, n_features)) – Test samples.
y (array-like of shape (n_samples,)) – True labels for X.
metric (str or callable, default="euclidean") – The distance metric to use for calculating the distance between features.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config()).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
feat_labels (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for feat_labels parameter in fit.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config()).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to predict if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to predict.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for metric parameter in predict.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config()).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
metric (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for metric parameter in score.
An ensemble classifier that uses different metrics for each quantile.
This classifier splits the data into quantiles based on a specified
feature and uses different distance metrics for each quantile to
construct an ensemble classifier for each quantile, generally leading
to better performance.
Note, however, this involves fitting the training set for each metric
to evaluate performance, making this more computationally expensive.
Evaluate and find the best distance metrics for the specified feature.
This method uses the standalone find_best_metrics function to evaluate
different distance metrics and determine the best-performing ones for
each quantile.
Parameters:
X (np.ndarray) – The input feature matrix.
y (np.ndarray) – The target labels.
n_quantiles (int, default=4) – The number of quantiles to split the data into.
Returns:
quantile_scores_df (pd.DataFrame) – A DataFrame containing the accuracy scores for each metric across
different quantiles.
best_metrics_per_quantile (pd.Series) – A Series indicating the best-performing metric for each quantile.
group_bins (np.ndarray) – The bins used for quantile splitting.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config()).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
n_quantiles (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for n_quantiles parameter in fit.
Note that this method is only relevant if
enable_metadata_routing=True (see sklearn.set_config()).
Please see User Guide on how the routing
mechanism works.
The options for each parameter are:
True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (sklearn.utils.metadata_routing.UNCHANGED) retains the
existing request. This allows you to change the request for some
parameters and not others.
Added in version 1.3.
Note
This method is only relevant if this estimator is used as a
sub-estimator of a meta-estimator, e.g. used inside a
Pipeline. Otherwise it has no effect.
Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
Evaluate and find the best distance metrics for a given feature.
This function evaluates different distance metrics to determine which
performs best for a specific feature in the dataset. It splits the data
into quantiles based on the specified feature and calculates the accuracy
of the classifier for each metric within these quantiles.
feat_idx (int) – The index of the feature to be used for quantile splitting.
n_quantiles (int, default=4) – The number of quantiles to split the data into.
metrics_to_consider (list of str, optional) – A list of distance metrics to evaluate. If None, all available
metrics within DistClassiPy will be considered.
random_state (int, RandomState instance or None, optional (default=None)) –
Controls the randomness of the estimator. Pass an int for reproducible
output across multiple function calls.
Added in version 0.2.1.
Returns:
quantile_scores_df (pd.DataFrame) – A DataFrame containing the accuracy scores for each metric across
different quantiles.
best_metrics_per_quantile (pd.Series) – A Series indicating the best-performing metric for each quantile.
group_bins (np.ndarray) – The bins used for quantile splitting.
Set the metric function based on the provided metric.
If the metric is a string, the function will look for a corresponding
function in scipy.spatial.distance or distances.Distance. If the metric
is a function, it will be used directly.