Python离群值检测算法 -- Isolate Forest

发布于 2024-10-16

390

版权声明

我们非常重视原创文章，为尊重知识产权并避免潜在的版权问题，我们在此提供文章的摘要供您初步了解。如果您想要查阅更为详尽的内容，访问作者的公众号页面获取完整文章。

查看原文：Python离群值检测算法 -- Isolate Forest

文章来源：

数据STUDIO

扫码关注公众号

扫码阅读

手机扫码阅读

Isolate Forest Summary

Isolate Forest Overview

Isolate Forest (IForest) is a distinct outlier detection method proposed by Liu, Ting, and Zhou in 2008, which identifies anomalies directly rather than analyzing normal data patterns first. It utilizes a collection of Isolate Trees (iTrees) to isolate observations, where anomalies are easily separated due to their shorter path lengths within the trees. The method is unsupervised and does not require the full development of iTrees since it focuses on the points near the root, which are likely to be outliers.

Why "Forest"?

The concept of a "forest" in machine learning refers to an ensemble of trees, which helps overcome the overfitting problem of individual decision trees. Isolation Forest takes advantage of this ensemble approach, and due to its non-reliance on distance measures for detecting outliers, it is efficient for large datasets and high-dimensional problems.

Modeling Process

To illustrate the Isolation Forest, a simulated dataset with six variables and 500 observations was created, with 5% designated as outliers. An Isolation Forest model was trained, with max_samples set to 40 to construct better iTrees. The model's performance was assessed through predictions and decision functions, where a contamination rate of 5% was set to determine the threshold for outlier classification.

Contamination Rate

The contamination rate is important in practice for setting a reasonable threshold for outlier detection when the percentage of outliers is unknown. In this example, the rate was set to 5% to match the training sample, influencing the threshold calculated by the built-in function threshold_.

Hyperparameters

Key hyperparameters of the Isolation Forest model include "max_samples" for the number of samples per base estimator, "n_estimators" for the number of trees in the ensemble, "max_features" for the number of features per base estimator, and "n_jobs" for the number of parallel jobs to run.

Feature Importance

The Isolation Forest can measure the relative importance of features in detecting anomalies, which is quantified using the Gini impurity index. A feature importance plot reveals the relative strengths of features in determining anomalies.

Threshold Determination and Descriptive Statistics

A reasonable threshold for the model should be based on the histogram of outlier scores. Subsequently, descriptive statistics can validate the model's reasonableness by comparing the normal and anomaly groups, using the threshold to distinguish between them.