机器学习与深度学习的数学基础:概率统计在AI中的重要性解析
][[]46597]]今天是概率统计基础的第二篇文章。它基于第一篇进行整理。首先要理一理这里面的逻辑。第一篇的内容蕴含了大部分概率论的知识,不过除了大数定律和中心极限定理这种理论性的支持,后期有机会会补上。而今天的这篇内容是在概率论的基础上往前迈了一步,属于数理统计的内容。
在数理统计中,我们所研究的是随机变量的分布未知或者部分未知的情况。我们要做的就是从未知分布中抽取多个样本,然后对这些数据进行统计分析,以此来研究随机变量的分布等。
[]
数理统计基础
数理统计是基于从未知分布中抽取多个样本,然后对这些数据进行统计分析,以此来分析随机变量的规律和特点。前面已经对此进行了分析,所以在这里依然会涉及到一些基本的概念。
基础概念
这里的基础概念包含总体、个体、总体容量、样本、简单随机样本。如果知晓这些概念,就可以跳过啦。
在数理统计领域里,总体指的是研究对象的全部。它一般会用一个随机变量来进行表示。组成总体的每一个基本单元被称作个体。并且,总体中所包含的个体的总数就是总体容量。
我们所研究的是这个未知分布的总体的统计规律,因此我们必须从其中随机选取一部分个体来进行统计,借助概率论的知识进行分析和推断。所以从总体中随机抽取一部分个体,这被称作取自的容量为的样本。举个例子:
随机样本中满足以下两个条件时,就被称为容量是[具体容量]的简单随机样本:一是每个个体被抽取的概率相等;二是每次抽取都是独立的,不受其他抽取的影响。
样本具有两重性。一次具体抽样后,它是一组确定的数值。在一般叙述中,样本也是一组随机变量,因为抽样是随机的。
通常,用来表示随机样本,这些随机样本所取到的值被记为,并且被称作样本观测值。在一般的情形下,进行两次观测时,样本值是不一样的。
样本是随机变量,它有一定的概率分布。这个概率分布被称作样本分布。显然,样本分布与总体的性质以及样本的性质有关。
统计量与抽样分布
数理统计的任务在于采集以及处理带有随机影响的数据,也就是收集样本并对其进行加工,通过这样的方式对所研究的问题作出一定的结论,这个过程被称作统计推断。从样本里提取出有用的信息,用以研究总体的分布以及各种特征数,这就是构造统计量的过程。所以,统计量是样本的某种函数。
比如10个灯泡的平均寿命是统计量。
常用的统计量
1. 样本均值
[]
样本均值是用来进行相关操作的。一般会用样本均值去估计总体分布的均值,同时也会用它来对有关总体分布均值的假设进行检验。在 numpy 中实现均值的函数就是 np.mean()。
2. 样本方差
设是总体的一个简单随机样本,为样本均值,称
样本方差是用来做一些相关工作的。通常会用它来估计总体分布的方差,也会用它来对有关总体分布均值或方差的假设进行检验。在 numpy 中,对应的函数是 np.var()。
3. k阶样本原点矩
设是总体的一个简单随机样本,称
样本的阶原点矩(在可以看到的时候,等同于样本均值),一般是用样本的无阶原点矩去估计总体分布的阶原点矩。
4. k阶样本中心矩
设是总体的一个简单随机样本,为样本均值,称
样本的阶中心矩常被用来估计总体分布的阶中心矩,以样本为基础的阶中心矩也是如此。
5. 顺序统计量
这个numpy的话就是np.max(), np.min()
三种重要的抽样分布
https://img1.baidu.com/it/u=2783871930,77819700&fm=253&fmt=JPEG&app=138&f=JPEG?w=800&h=1067
在进行统计推断时,常常需要知晓统计量的分布。统计量的分布被称作抽样分布。有三个极为重要的统计量的分布是我们需要了解的。因为在参数估计以及检验假设等过程中,实际上都能看到这三个分布的身影,或者是依赖于这三个分布。这三个分布分别是分布、分布和分布。
1.分布
设是来自总体的样本, 则称统计量
分布的服从自由度为某个值,记为特定的符号。自由度指的是独立变量的个数。其概率密度函数呈现出这样的形式:
当 x 小于等于 0 时,f(x)等于 0;当 x 大于 0 时,f(x)等于 x。
其中
2.分布
设,且相互独立,则称随机变量
服从自由度为的分布。它的概率密度函数:
概率密度函数图像如下:
3.分布
设且独立,则称随机变量
的分布, 记
在参数估计的时候会用到上面这些。分布本身可能较为复杂,尤其是概率密度函数,到那时会有表可供查询。
描述性统计
数据集中趋势的度量
1. 平均数
它是用于表示一组数据集中趋势的量数。也就是说,在一组数据中,先将所有数据相加,然后再用这个总和除以这组数据的个数,所得结果就是它所表示的量数。
2.中位数
中位数是指在一组按顺序排列的数据中居于中间位置的数。它能描述数据的中心位置,具有特定的数字特征。对于对称分布的数据,均值与中位数较为接近;而对于偏态分布的数据,均值与中位数则有所不同。并且,中位数不会受到异常值的影响,具有稳健性。
3. 频数
在一组数据中,同一观测值出现的次数,例如在掷骰子的情境中,一共掷了 20 次,其中出现数字 5 的次数
4. 众数(mode)
在一组数据中,那个出现次数最多的数(或者几个数)。以下是均值 VS 中位数 VS 众数的图示。
5. 百分位数
百分位数是中位数的推广,将数据按从小到大排列后,对于
它的分位点定义为
其中,有整数部分的表示。所以,0.5 分位数也就是第 50 百分位数,它就是中位数。0.25 分位数被称作第一四分位数,记作。0.75 分位数被称作第三四分位数,记作。这三个分位数在统计中是很有用的。
这个百分位数最常见的就是我们说的箱线图了:
这个箱线图可以看到数据的下面几个性质:
箱线图很适合用来比较两个或两个以上数据集的性质。同时,箱线图还能帮助我们检测是否存在异常值,即不寻常的过大或过小的值。第一四分位数与第三四分位数之间的距离被记为 IQR,也就是四分位数间距。如果数据小于 IQR 或者大于 IQR,就可能是异常的。
https://img2.baidu.com/it/u=4118167499,628146429&fm=253&fmt=JPEG&app=138&f=JPEG?w=800&h=1067
好了, 关于上面的这些内容,下面看一波实现了。
首先是求取列表元素的均值,接着求取中位数,然后求取众数,最后求取频数。因为众数在 numpy 中没有直接可用的函数,所以可以调用 scipy 包的 stats 或者自行实现。
<p><pre class="code-snippet__js" data-lang="python"> <code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">实现众数。不过这个功能不能返回多个众数,倘若存在多个众数,就需要获取一个众数出现的次数,接着依据频数来返回多个众数。</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">def</span> <span class="code-snippet__title" style="max-width: 1000%;">mode(lst)</span>:</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">if</span> <span class="code-snippet__keyword" style="max-width: 1000%;">not</span> lst:</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">return</span> </span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">return</span> max(lst, key=<span class="code-snippet__keyword" style="max-width: 1000%;">lambda</span> v: lst.count(v))</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">a = [<span class="code-snippet__number" style="max-width: 1000%;">1</span>,<span class="code-snippet__number" style="max-width: 1000%;">2</span>,<span class="code-snippet__number" style="max-width: 1000%;">4</span>,<span class="code-snippet__number" style="max-width: 1000%;">5</span>,<span class="code-snippet__number" style="max-width: 1000%;">3</span>,<span class="code-snippet__number" style="max-width: 1000%;">12</span>,<span class="code-snippet__number" style="max-width: 1000%;">12</span>,<span class="code-snippet__number" style="max-width: 1000%;">23</span>,<span class="code-snippet__number" style="max-width: 1000%;">43</span>,<span class="code-snippet__number" style="max-width: 1000%;">52</span>,<span class="code-snippet__number" style="max-width: 1000%;">11</span>,<span class="code-snippet__number" style="max-width: 1000%;">22</span>,<span class="code-snippet__number" style="max-width: 1000%;">22</span>,<span class="code-snippet__number" style="max-width: 1000%;">22</span>]</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">a_mean = np.mean(a)<span class="code-snippet__comment" style="max-width: 1000%;">#均值</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">a_med = np.median(a)<span class="code-snippet__comment" style="max-width: 1000%;">#中位数</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">a_mode = stats.mode(a)[<span class="code-snippet__number" style="max-width: 1000%;">0</span>][<span class="code-snippet__number" style="max-width: 1000%;">0</span>] <span class="code-snippet__comment" style="max-width: 1000%;"># 众数也是只能返回一个</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">a_mode1 = mode(a)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">print(<span class="code-snippet__string" style="max-width: 1000%;">"a的平均数:"</span>,a_mean)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">print(<span class="code-snippet__string" style="max-width: 1000%;">"a的中位数:"</span>,a_med)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">print(<span class="code-snippet__string" style="max-width: 1000%;">'a的众数'</span>, a_mode, a_mode1)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"># 频数</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">b = {k: a.count(k) <span class="code-snippet__keyword" style="max-width: 1000%;">for</span> k <span class="code-snippet__keyword" style="max-width: 1000%;">in</span> set(a)}</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">b<span class="code-snippet__comment" style="max-width: 1000%;"># 包含键值对 {1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 43: 1, 12: 2, 11: 1, 52: 1, 22: 3, 23: 1}</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"># 基于频数这个, 再写一个求众数的, 这个可以返回多个</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">def</span> <span class="code-snippet__title" style="max-width: 1000%;">mode_duo(d)</span>:</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">if</span> len(d) == <span class="code-snippet__number" style="max-width: 1000%;">0</span>:</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">return</span> </span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> max_values = max(d.values()) <span class="code-snippet__comment" style="max-width: 1000%;"># 找到了众数对应的次数</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">return</span> ==max_values]</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">a = [<span class="code-snippet__number" style="max-width: 1000%;">1</span>,<span class="code-snippet__number" style="max-width: 1000%;">2</span>,<span class="code-snippet__number" style="max-width: 1000%;">4</span>,<span class="code-snippet__number" style="max-width: 1000%;">5</span>,<span class="code-snippet__number" style="max-width: 1000%;">3</span>,<span class="code-snippet__number" style="max-width: 1000%;">12</span>,<span class="code-snippet__number" style="max-width: 1000%;">12</span>,<span class="code-snippet__number" style="max-width: 1000%;">23</span>,<span class="code-snippet__number" style="max-width: 1000%;">43</span>,<span class="code-snippet__number" style="max-width: 1000%;">52</span>,<span class="code-snippet__number" style="max-width: 1000%;">11</span>,<span class="code-snippet__number" style="max-width: 1000%;">22</span>,<span class="code-snippet__number" style="max-width: 1000%;">22</span>,<span class="code-snippet__number" style="max-width: 1000%;">22</span>, <span class="code-snippet__number" style="max-width: 1000%;">1</span>, <span class="code-snippet__number" style="max-width: 1000%;">1</span>]</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">b = {k: a.count(k) <span class="code-snippet__keyword" style="max-width: 1000%;">for</span> k <span class="code-snippet__keyword" style="max-width: 1000%;">in</span> set(a)}</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">mode_duo(b) <span class="code-snippet__comment" style="max-width: 1000%;"># 1 22</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">可以先将其转成 Series,接着求众数。如果有多个众数的话,也能够都返回回来。这是一种最为简单的方式。</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">pd.Series(a).mode()</span></code></pre></p>
下面来看看分位点的状况。将 a 进行转换,通过使用()函数便能够看到分位点。
<p><pre class="code-snippet__js" data-lang="css"> <code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__selector-tag" style="max-width: 1000%;">pd.Series</span>(<span class="code-snippet__selector-tag" style="max-width: 1000%;">a</span>)<span class="code-snippet__selector-class" style="max-width: 1000%;">.describe</span>()</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">## 结果:</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__selector-tag" style="max-width: 1000%;">count</span> 16<span class="code-snippet__selector-class" style="max-width: 1000%;">.000000</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__selector-tag" style="max-width: 1000%;">mean</span> 14<span class="code-snippet__selector-class" style="max-width: 1000%;">.750000</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__selector-tag" style="max-width: 1000%;">std</span> 15<span class="code-snippet__selector-class" style="max-width: 1000%;">.316658</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__selector-tag" style="max-width: 1000%;">min</span> 1<span class="code-snippet__selector-class" style="max-width: 1000%;">.000000</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">25% 2<span class="code-snippet__selector-class" style="max-width: 1000%;">.750000</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">50% 11<span class="code-snippet__selector-class" style="max-width: 1000%;">.500000</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">75% 22<span class="code-snippet__selector-class" style="max-width: 1000%;">.000000</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__selector-tag" style="max-width: 1000%;">max</span> 52<span class="code-snippet__selector-class" style="max-width: 1000%;">.000000</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__selector-tag" style="max-width: 1000%;">dtype</span>: <span class="code-snippet__selector-tag" style="max-width: 1000%;">float64</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">## 还可以借助<span class="code-snippet__selector-tag" style="max-width: 1000%;">plt</span>画出箱型图</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__selector-tag" style="max-width: 1000%;">import</span> <span class="code-snippet__selector-tag" style="max-width: 1000%;">matplotlib.pyplot</span> <span class="code-snippet__selector-tag" style="max-width: 1000%;">as</span> <span class="code-snippet__selector-tag" style="max-width: 1000%;">plt</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__selector-tag" style="max-width: 1000%;">plt.boxplot</span>(<span class="code-snippet__selector-tag" style="max-width: 1000%;">pd.Series</span>(<span class="code-snippet__selector-tag" style="max-width: 1000%;">a</span>))</span></code></pre></p>
下面我们接着来看怎样依据 IQR 来去除异常值:异常值能够进行截尾处理,同时也可以直接将其去除掉。
<p><pre class="code-snippet__js" data-lang="python"> <code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">"""这里包装了一个异常值处理的代码,可以随便调用"""</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">def</span> <span class="code-snippet__title" style="max-width: 1000%;">使用 data、col_name 和 scale 来处理离群值。首先获取 col_name 列的数据,然后根据 scale 进行相应的处理操作,以达到处理离群值的目的。<span class="code-snippet__number" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">1.5</span>)</span>:</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__string" style="max-width: 1000%;">"""</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">用于对数据进行截尾处理以去除异常值,通常会使用 box_plot(scale=1.5)来进行清洗操作。</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> param:</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> data:接收pandas数据格式</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> col_name: pandas列名</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> scale: 尺度</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> """</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> data_col = data</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> Q1 = data_col.quantile(<span class="code-snippet__number" style="max-width: 1000%;">0.25</span>) <span class="code-snippet__comment" style="max-width: 1000%;"># 0.25分位数</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> Q3 = data_col.quantile(<span class="code-snippet__number" style="max-width: 1000%;">0.75</span>)<span class="code-snippet__comment" style="max-width: 1000%;"># 0,75分位数</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> IQR = Q3 - Q1</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> data_col = Q1 - (scale * IQR)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">如果 data_col 中的值大于 Q3 加上 scale 与 IQR 的乘积,那么就将该值设置为 Q3 加上 scale 与 IQR 的乘积。</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">return</span> data</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">num_data[<span class="code-snippet__string" style="max-width: 1000%;">'power'</span>] = outliers_proc(num_data, <span class="code-snippet__string" style="max-width: 1000%;">'power'</span>)</span></code></pre></p>
上面是截尾异常值,接收的是某一列。因为有时异常值较多时,直接进行暴力删除可能不太好。当然,下面的代码会直接删除异常值,接收的是一个。然后,当判断有几列都出现异常时,才会删除这个样本。
<p><pre class="code-snippet__js" data-lang="python"> <code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"># 检测异常值并将其舍弃,返回删除的列</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet__keyword" style="max-width: 1000%;">def</span> <span class="code-snippet__title" style="max-width: 1000%;">检测并移除数据框 df 中的离群值。</span>:</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__string" style="max-width: 1000%;">这个方法会按列去检查异常值,并且会保存存在异常值的行。如果某一行当中有两个或者两个以上的异常值,那么就会将这一行删除。</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> outliers = []</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> col = list(df)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__comment" style="max-width: 1000%;">检查所有列的四分位距 IQR</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">for</span> c <span class="code-snippet__keyword" style="max-width: 1000%;">in</span> col:</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> Q1 = df.quantile(<span class="code-snippet__number" style="max-width: 1000%;">0.25</span>) <span class="code-snippet__comment" style="max-width: 1000%;"># 0.25分位数</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> Q3 = df.quantile(<span class="code-snippet__number" style="max-width: 1000%;">0.75</span>)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> IQR = Q3 - Q1</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> outliers.extend(df[(df < Q1 - (<span class="code-snippet__number" style="max-width: 1000%;">1.5</span> * IQR)) | (df > Q3 + (<span class="code-snippet__number" style="max-width: 1000%;">1.5</span> * IQR) )].index)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__comment" style="max-width: 1000%;">返回列表中离群键值对出现次数的键</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;"> <span class="code-snippet__keyword" style="max-width: 1000%;">return</span>list(k <span class="code-snippet__keyword" style="max-width: 1000%;">for</span> k,v <span class="code-snippet__keyword" style="max-width: 1000%;">in</span> Counter(outliers).items() <span class="code-snippet__keyword" style="max-width: 1000%;">if</span> v ><span class="code-snippet__number" style="max-width: 1000%;">2</span>)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">首先检测并移除数据中的异常值,然后将移除异常值后的结果赋值给 remove_list 。</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">data 会删除 remove_list 中的内容,其轴为(这里似乎缺少轴的具体信息,可能是行或列等)。<span class="code-snippet__number" style="max-width: 1000%;">0</span>)</span></code></pre></p>
数据离散趋势的度量
表示数据分散程度的特征量有方差,它能体现数据的离散情况;还有标准差,可反映数据的差异;以及极差,能展示数据的离散范围;另外还有变异系数等。
1. 方差
用于计算每一个变量(观察值)和总体均数之间的差别。在实际工作里,当总体均数难以被获取时,就用样本统计量来替代总体参数,经过校正之后,有样本方差的计算公式:
样本方差的开平方成为样本标准差。
2. 极差
数据越分散,极差越大。
3. 变异系数
变异系数是用来刻画数据相对分散性的一种度量。它只有在平均值不为零时才有定义,通常适用于平均值大于零的情况。它也被称作标准离差率或单位风险。当要比较两组数据的离散程度大小时,如果两组数据的测量尺度差异很大,或者数据的量纲不同,变异系数能够消除测量尺度和量纲的影响。
4. 四分位数差
这个已经整理过了。在样本中,上四分位数与下四分位数的差值被称作四分位差,也叫做半极差。
它是度量样本分散性的重要数字特征。它对于具有异常值的数据,在作为分散性方面具有稳健性。
下面是方差, 标准差, 变异系数的numpy实现。
<p><pre class="code-snippet__js" data-lang="makefile"> <code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">a 包含 1、2、4、5、3、12、12、23、43、52、11、22、22、22 这些数字</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">a_var = np.var(a)<span class="code-snippet__comment" style="max-width: 1000%;">#方差</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">a_std1 = np.sqrt(a_var) <span class="code-snippet__comment" style="max-width: 1000%;">#标准差</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">a_std2 = np.std(a) <span class="code-snippet__comment" style="max-width: 1000%;">#标准差</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">a_mean = np.mean(a)<span class="code-snippet__comment" style="max-width: 1000%;">#均值</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">a_cv =a_std2 /a_mean <span class="code-snippet__comment" style="max-width: 1000%;">#变异系数</span></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">print("a的方差:",a_var)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">print("a的方差:",a_std1)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">print("a的方差:",a_std2)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">print("a的变异系数:",a_cv)</span></code></pre></p>
5. 偏度与峰度
偏度:也叫做偏态,它是对统计数据分布的偏斜方向以及程度进行度量的指标,是统计数据分布非对称程度的一种数字特征。从直观上看,它就是密度函数曲线尾部的相对长度。偏度所刻画的是分布函数(数据)的对称性。如果数据是关于均值对称的,那么其偏度系数为 0;如果右侧更分散,偏度系数为正;如果左侧更分散,偏度系数为负。样本偏度系数如下:
峰度刻画的是分布函数的集中和分散程度。
峰度系数如下:
下面是一波实现:
<p><pre class="code-snippet__js" data-lang="http"> <code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">使用 np.random.randn(10000) 生成了 10000 个标准正态分布的随机数,这些随机数被存储在 data 这个列表中。</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">plt 绘制了数据 data 的直方图,直方图的区间数量为 1000,直方图的填充颜色为绿色(facecolor='g'),并且透明度为 0.5(alpha=0.5)</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">plt.show()</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer"><br/></span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">使用 pd.Series 函数将数组转化为序列。</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">打印“偏度系数”,然后输出 s 的偏度值,其值为 0.0024936359680932723</span></code><code style="white-space:pre-wrap;max-width: 1000%;text-align: left;display: flex;font-family: Consolas, "Liberation Mono", Menlo, Courier, monospace;box-sizing: border-box !important;word-wrap: break-word !important;"><span class="code-snippet_outer" style="max-width: 1000%;box-sizing: border-box !important;word-wrap: break-word !important;">打印“峰度系数”,接着输出 s.kurt() 的值,其值为 -0.05970174780792892</span></code></pre></p>
结果如下:
写到最后
然后又介绍了三个极为重要的抽样分布,分别是卡方、T 和 F。首先是描述性统计部分,其中介绍了数据集中趋势的度量,包含平均数、中位数、众数、频数、百分位数等,并且给出了 numpy 的实现。接着是离散趋势的度量,有方差、标准差、极差、四分位点等内容。最后是峰度和偏度的介绍。
页:
[1]