Python离群值检测算法 -- Isolate Forest
我们非常重视原创文章,为尊重知识产权并避免潜在的版权问题,我们在此提供文章的摘要供您初步了解。如果您想要查阅更为详尽的内容,访问作者的公众号页面获取完整文章。
Isolate Forest Overview
Isolate Forest (IForest) is a distinct outlier detection method proposed by Liu, Ting, and Zhou in 2008, which identifies anomalies directly rather than analyzing normal data patterns first. It utilizes a collection of Isolate Trees (iTrees) to isolate observations, where anomalies are easily separated due to their shorter path lengths within the trees. The method is unsupervised and does not require the full development of iTrees since it focuses on the points near the root, which are likely to be outliers.
Why "Forest"?
The concept of a "forest" in machine learning refers to an ensemble of trees, which helps overcome the overfitting problem of individual decision trees. Isolation Forest takes advantage of this ensemble approach, and due to its non-reliance on distance measures for detecting outliers, it is efficient for large datasets and high-dimensional problems.
Modeling Process
To illustrate the Isolation Forest, a simulated dataset with six variables and 500 observations was created, with 5% designated as outliers. An Isolation Forest model was trained, with max_samples set to 40 to construct better iTrees. The model's performance was assessed through predictions and decision functions, where a contamination rate of 5% was set to determine the threshold for outlier classification.
Contamination Rate
The contamination rate is important in practice for setting a reasonable threshold for outlier detection when the percentage of outliers is unknown. In this example, the rate was set to 5% to match the training sample, influencing the threshold calculated by the built-in function threshold_.
Hyperparameters
Key hyperparameters of the Isolation Forest model include "max_samples" for the number of samples per base estimator, "n_estimators" for the number of trees in the ensemble, "max_features" for the number of features per base estimator, and "n_jobs" for the number of parallel jobs to run.
Feature Importance
The Isolation Forest can measure the relative importance of features in detecting anomalies, which is quantified using the Gini impurity index. A feature importance plot reveals the relative strengths of features in determining anomalies.
Threshold Determination and Descriptive Statistics
A reasonable threshold for the model should be based on the histogram of outlier scores. Subsequently, descriptive statistics can validate the model's reasonableness by comparing the normal and anomaly groups, using the threshold to distinguish between them.
想要了解更多内容?
点击领取《Python学习手册》,后台回复「福利」获取。『数据STUDIO』专注于数据科学原创文章分享,内容以 Python 为核心语言,涵盖机器学习、数据分析、可视化、MySQL等领域干货知识总结及实战项目。