今天在进行EDA数据探索时,使用的数据集里有比较多的字段存在异常值的情况,在使用histogram探索这些数据时,这种异常值的处理无非就是调整直方图的参数,过滤异常值等,最后生成一个normal histogram。在调试过程中,产生了一个疑惑,究竟histogram的bins 以及 binwidth到底如何确定呢?

实际上,对一个数据集来说,并没有最好的bins一说,只能根据数据集的特定进行不断的调整,获得一个相对来说最优的bins,使得直方图可以清楚的反应数据的分布情况。

Number of bins and width

There is no “best” number of bins, and different bin sizes can reveal different features of the data. Grouping data is at least as old as Graunt’s work in the 17th century, but no systematic guidelines were given[10] until Sturges’s work in 1926.[11]

Using wider bins where the density is low reduces noise due to sampling randomness; using narrower bins where the density is high (so the signal drowns the noise) gives greater precision to the density estimation. Thus varying the bin-width within a histogram can be beneficial. Nonetheless, equal-width bins are widely used.

Some theoreticians have attempted to determine an optimal number of bins, but these methods generally make strong assumptions about the shape of the distribution. Depending on the actual data distribution and the goals of the analysis, different bin widths may be appropriate, so experimentation is usually needed to determine an appropriate width. There are, however, various useful guidelines and rules of thumb.[

数量和宽度

没有“最好”的数据箱,不同的大小可以显示数据的不同特征。分组数据至少与Graunt在17世纪的工作一样古老,但在1926年Sturges的工作之前没有给出系统的指导原则[10]。[11]

使用较宽的密度较低的区域可减少由于采样随机性引起的噪声; 使用密度较高的较窄的箱体(因此信号淹没噪声)可以提供更高的密度估计精度。因此,改变直方图内的二进制位宽度是有益的。尽管如此,宽宽度相同的机箱也被广泛使用。

一些理论家试图确定最佳数量的分数,但是这些方法通常对分布的形状做出强烈的假设。根据实际数据分布和分析目标,不同的箱宽可能是适当的,因此通常需要进行实验来确定适当的宽度。然而,有各种有用的指导方针和经验法则。

wikipedia提供的方法: https://en.wikipedia.org/wiki/Histogram

在stackoverflow上也有一个比较不错的方法,实际上是使用了Freedman-Diaconis rule:

The Freedman-Diaconis rule is very robust and works well in practice. The bin-width is set to

h=2∗IQR∗n^1/3

So the number of bins is (max-min)/h.

In base R, you can use

1
hist(x,breaks="FD")

For other plotting libraries without this option (e.g. ggplot2), you can calculate binwidth as:

1
2
3
4
bins <- diff(range(x)) / (2 * IQR(x) / length(x)^(1/3))
# for example
ggplot() + geom_histogram(aes(x), bins = bw)

Freedman-Diaconis rule基于由IQR表示的四分位数范围。它将Scott的规则的3.5σ替换为2 IQR,其对数据中的异常值的标准偏差较不敏感。总的来说这个方法在处理直方图时效果不错,而且能避开异常值对直方图的影响。

总结

Freedman-Diaconis rule

The Freedman–Diaconis rule is: Freedman-Diaconis rule

The number of bins k can be assigned directly or can be calculated from a suggested bin width h as: Number of bins k

在R中有方便的自带函数可以使用,ggplot使用稍微麻烦一点

1
2
3
4
5
6
7
8
9
10
# hist()
hist(x,breaks="FD")
# ggplot
bins <- diff(range(x)) / (2 * IQR(x) / length(x)^(1/3))
ggplot() + geom_histogram(aes(x), bins = bw)
# ggplot
bw <- (2 * IQR(x) / length(x)^(1/3))
ggplot() + geom_histogram(aes(x), binwidth = bw)