Python 离群点检测算法 -- KNN

267 2024-10-16

我们非常重视原创文章，为尊重知识产权并避免潜在的版权问题，我们在此提供文章的摘要供您初步了解。如果您想要查阅更为详尽的内容，访问作者的公众号页面获取完整文章。

查看原文：Python 离群点检测算法 -- KNN

文章来源：

数据STUDIO

扫码关注公众号

K-nearest neighbor (KNN) is one of the most popular algorithms in machine learning, widely used in both supervised and unsupervised learning scenarios. In supervised learning, KNN is used to compute distances to k neighbors to define outliers. In unsupervised learning, it also calculates neighbor distances to define outliers. PyOD primarily uses KNN for unsupervised learning. This article discusses the applications of KNN in both learning types and how to define anomaly scores.

KNN in Unsupervised Learning

The unsupervised KNN method uses Euclidean distance to calculate the distance between observations, improving performance without the need for parameter tuning. The steps are as follows:

Calculate the distance from each data point to other points.
Sort data points by distance in ascending order.
Select the top K entries.

Euclidean distance is the most commonly used distance calculation method in this approach.

KNN in Supervised Learning

KNN is a widely-used supervised learning classification algorithm that predicts the category of new data points by assuming similar points are usually close to each other. It calculates the distance between a new data point and other points, picks the nearest 5 neighbors, performs a category count, and then determines the category using majority voting. If a new data point is surrounded by neighbors of 4 red and 1 blue, it will be classified as red.

In addition to steps 1 through 3, supervised learning KNN includes:

Counting the number of categories among the K neighbors.
Assigning the new data point to the majority class.

Defining Anomaly Scores

Outliers are points that are distant from their neighbors, and their anomaly score is defined as the distance to their kth nearest neighbor. In PyOD, KNN uses one of three distance metrics to calculate anomaly scores: maximum, mean, or median. The maximum value method uses the largest distance to K neighbors as the anomaly score, while mean and median use the respective statistical measures.

Modeling Steps

The modeling process involves establishing a model to identify outliers, choosing a threshold to separate outliers from normal observations, and using descriptive statistics of both groups to ensure model validity. If the average of any feature in the anomaly group is not as expected, investigate, modify, or abandon the feature and repeat until the model meets expectations.

Example code is provided to generate data with outliers using PyOD's generate_data() function and to build a KNN model. The code includes steps to calculate anomaly scores, predict outliers, and evaluate the model using training and testing data. The PyOD KNN model considers a default contamination rate of 10%, but this does not affect the calculation of anomaly scores.

Finally, the article suggests using histograms of anomaly scores to determine a reasonable threshold for identifying outliers. It also emphasizes analyzing the normal and anomaly groups to ensure the model's reasonableness, suggesting iterative refinement of the model and features based on domain knowledge and the data provided.

想要了解更多内容？

查看原文：Python 离群点检测算法 -- KNN

文章来源：

数据STUDIO

扫码关注公众号