第11周 判别分析 距离判别法、贝叶斯分类器、Fisher判别法

1 董大均书p358,第1题

image

 

data test1;
   infile 'E:sasclass11iris1.txt';
   input x1-x4 group@@;
   run;
proc discrim distance manova;
class group;
var x1 x2 x3 x4;
run;

 

 

*通SAS所作的多种统计检验;

(1)
                                                     Squared Distance to group
                                     From group             1             2             3
                                              1             0      89.86419     179.38471
                                              2      89.86419             0      17.20107
                                              3     179.38471      17.20107             0
可看出广义平方距离D平方. 1和2的d平方为:89.86419  1和3的d平方为: 179.38471 2和3:  17.20107
都比较大,理想的。

(2)
                                          Multivariate Statistics and F Approximations
                                                      S=2    M=0.5    N=71
                         Statistic                        Value    F Value    Num DF    Den DF    Pr > F
                         Wilks' Lambda               0.02343863     199.15         8       288    <.0001
                         Pillai's Trace              1.19189883      53.47         8       290    <.0001
                         Hotelling-Lawley Trace     32.47732024     582.20         8     203.4    <.0001
                         Roy's Greatest Root        32.19192920    1166.96         4       145    <.0001
可以看到,pr值均远小于万分之一,所以四个变量均可进行显著性统计分组。
(3)
                                              Linear Discriminant Function for group
                                        Variable             1             2             3
                                        Constant     -85.20986     -71.75400    -103.26971
                                        x1             2.35442       1.56982       1.24458
                                        x2             2.35879       0.70725       0.36853
                                        x3            -1.64306       0.52115       1.27665
                                        x4            -1.73984       0.64342       2.10791
由输出的判别函数的常量及系数向量可得到判别函数式:
z1= -85.20986+2.35442*x1+2.35879*x2-1.64306*x3-1.73984x4
z2= -71.75400+1.56982*x1+0.70725*x2-0.52115*x3-0.64342x4
z3= -103.26971+1.24458*x1+0.36853*x2-1.27665*x3-2.10791x4
由此判别函数即可对数据进行判别分类,达到预测的目地;
(4)对样本判别的输出结果
                               Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
                                                             j        k           k
                                     Number of Observations and Percent Classified into group
                                From group            1            2            3        Total
                                         1           50            0            0           50
                                                 100.00         0.00         0.00       100.00
                                         2            0           48            2           50
                                                   0.00        96.00         4.00       100.00
                                         3            0            1           49           50
                                                   0.00         2.00        98.00       100.00
                                     Total           50           49           51          150
                                                  33.33        32.67        34.00       100.00
可看到,第一组,即 属于Setosa花的50样本全判对, 正确率100%
第二组,即属于 Versicolor花的50个样本中48个判断正确,有2个判成了 Virginica即第三组; 正确率96%;
第三组,即属于  Virginica花的50个样本中49个判断正确,1个判成了 第二组;正确率98%;
                Error Count Estimates for group
                                                          1           2           3       Total
                                   Rate              0.0000      0.0400      0.0200      0.0200
                                   Priors            0.3333      0.3333      0.3333
第一组判错率0% 第二组判错率4% 第三组判率 2% ,总的差错率 2%
看来得到的判别函数的判断还是相当准确的,由此函数得到的判别的可信度极高。

 

 

 

2 薛毅书例8.1,要求用SAS的方法解决

image

 

data test2;
input x1 x2 x3 x4 x5 x6 x7;
if _N_ < 12 THEN group=1;
else group = 2;
cards;
6.6 39 1.0 6.0 6 0.12 20
6.6 39 1.0 6.0 12 0.12 20
6.1 47 1.0 6.0 6 0.08 12
6.1 47 1.0 6.0 12 0.08 12
8.4 32 2.0 7.5 19 0.35 75
7.2 6 1.0 7.0 28 0.30 30
8.4 113 3.5 6.0 18 0.15 75
7.5 52 1.0 6.0 12 0.16 40
7.5 52 3.5 7.5 6 0.16 40
8.3 113 0.0 7.5 35 0.12 180
7.8 172 1.5 3.0 15 0.21 45
8.4 32 1.0 5.0 4 0.35 75
8.4 32 2.0 9.0 10 0.35 75
8.4 32 2.5 4.0 10 0.35 75
6.3 11 4.5 7.5 3 0.20 15
7.0 8 4.5 4.5 9 0.25 30
7.0 8 6.0 7.5 4 0.25 30
7.0 8 1.5 6.0 1 0.25 30
8.3 161 1.5 4.0 4 0.08 70
8.3 161 0.5 2.5 1 0.08 70
7.2 6 3.5 4.0 12 0.30 30
7.2 6 1.0 3.0 3 0.30 30
7.2 6 1.0 6.0 5 0.30 30
5.5 6 2.5 3.0 7 0.18 18
8.4 113 3.5 4.5 6 0.15 75
8.4 113 3.5 4.5 8 0.15 75
7.5 52 1.0 6.0 6 0.16 40
7.5 52 1.0 7.5 8 0.16 40
8.3 97 0.0 6.0 5 0.15 180
8.3 97 2.5 6.0 5 0.15 180
8.3 89 0.0 6.0 10 0.16 180
8.3 56 1.5 6.0 13 0.25 180
7.8 172 1.0 3.5 6 0.21 45
7.8 283 1.0 4.5 6 0.18 45
;
run;
*先用stepdisc过程进行参数的赛选;
proc stepdisc data=test2 method=sw;
class group;
run;

 

 

(1)输出结果第一步分可看出;
              The Method for Selecting Variables is STEPWISE
                             Total Sample Size      34          Variable(s) in the Analysis        7
                             Class Levels            2          Variable(s) Will Be Included       0
                                                                Significance Level to Enter     0.15
                                                                Significance Level to Stay      0.15
                                            Number of Observations Read             34
                                            Number of Observations Used             34
                                                      Class Level Information
                                              Variable
                                     group    Name        Frequency       Weight    Proportion
                                         1    _1                 11      11.0000      0.323529
                                         2    _2                 23      23.0000      0.676471
一共有34组样本,分2个级别的,即已液化的和未液化的,已夜化的11组样本和未液化的23个样本。
在进行变量引入和剔除方程的临界值为0.15;
(2)
                        Stepwise Selection: Step 1
                                                 Statistics for Entry, DF = 1, 32
                                      Variable    R-Square    F Value    Pr > F    Tolerance
                                      x1            0.0453       1.52    0.2270       1.0000
                                      x2            0.0013       0.04    0.8402       1.0000
                                      x3            0.0322       1.07    0.3095       1.0000
                                      x4            0.0861       3.01    0.0921       1.0000
                                      x5            0.3561      17.70    0.0002       1.0000
                                      x6            0.0704       2.42    0.1295       1.0000
                                      x7            0.0333       1.10    0.3020       1.0000
                                                   Variable x5 will be entered.
                                             Multivariate Statistics
                  Statistic                                       Value    F Value    Num DF    Den DF    Pr > F
                  Wilks' Lambda                                0.643914      17.70         1        32    0.0002
                  Pillai's Trace                               0.356086      17.70         1        32    0.0002
可以看到 x5的F值最大,pr=0.0002<0.15,对判别效果贡献最大,因而先引入.
下方显示出了 x5引入后的多元统计量。
(3)选择变量第二步:
                                                   Partial
                                      Variable    R-Square    F Value    Pr > F    Tolerance
                                      x1            0.1551       5.69    0.0234       0.9725
                                      x2            0.0016       0.05    0.8240       1.0000
                                      x3            0.0096       0.30    0.5882       0.9706
                                      x4            0.0207       0.66    0.4244       0.9054
                                      x6            0.1552       5.69    0.0233       0.9930
                                      x7            0.1875       7.16    0.0118       0.9339
                                                   Variable x7 will be entered.
                                                Variable(s) That Have Been Entered
                                                              x5 x7
                                                     Multivariate Statistics
                  Statistic                                       Value    F Value    Num DF    Den DF    Pr > F
                  Wilks' Lambda                                0.523154      14.13         2        31    <.0001
                  Pillai's Trace                               0.476846      14.13         2        31    <.0001
                  Average Squared Canonical Correlation        0.476846
除了x5后,在所剩变量中x7的F值最大,说明对判别贡献最大,引入x7;
下方显示出了x5 x7引入后的多元统计量。pr<0.001,模型很理想;
(4) 先择变量第三步:
                                                Statistics for Removal, DF = 1, 31
                                                         Partial
                                            Variable    R-Square    F Value    Pr > F
                                            x5            0.4588      26.29    <.0001
                                            x7            0.1875       7.16    0.0118
                                                   No variables can be removed.
                                                 Statistics for Entry, DF = 1, 30
                                                   Partial
                                      Variable    R-Square    F Value    Pr > F    Tolerance
                                      x1            0.0231       0.71    0.4067       0.5062
                                      x2            0.0106       0.32    0.5747       0.8511
                                      x3            0.0532       1.68    0.2042       0.8800
                                      x4            0.0360       1.12    0.2983       0.8581
                                      x6            0.2357       9.25    0.0048       0.9235
                                                   Variable x6 will be entered.
                                                Variable(s) That Have Been Entered
                                                             x5 x6 x7
                                                     Multivariate Statistics
                  Statistic                                       Value    F Value    Num DF    Den DF    Pr > F
                  Wilks' Lambda                                0.399827      15.01         3        30    <.0001
                  Pillai's Trace                               0.600173      15.01         3        30    <.0001
                  Average Squared Canonical Correlation        0.600173
首先第一部分表示模型中x5 x7 pr值分别为<0.0001 和0.0118<0.15,所以保留,
第二部分中看到 x6的F值最大,对判别的贡献最大。 引入x6,
引入后多元统计量,pr<0.0001,很理想。
(5)选择变量第四步;
                                              Partial
                                            Variable    R-Square    F Value    Pr > F
                                            x5            0.5502      36.70    <.0001
                                            x6            0.2357       9.25    0.0048
                                            x7            0.2650      10.82    0.0026
                                                   No variables can be removed.
                                                 Statistics for Entry, DF = 1, 29
                                                   Partial
                                      Variable    R-Square    F Value    Pr > F    Tolerance
                                      x1            0.0000       0.00    0.9959       0.4654
                                      x2            0.0204       0.60    0.4436       0.7328
                                      x3            0.0227       0.67    0.4189       0.8744
                                      x4            0.0693       2.16    0.1525       0.8528
在剩下的变量里x1 x2 x3 x4 其pr值均大于默认值0.15,认为对判别的贡献比较小, 不再引入模型。
对判别贡献比较大的变量x5 x6 x7,用过程 DISCRIM进行判别分析 ;
proc discrim data=sas.shaji distance manova;
class group;
var x5 x6 x7;
run;
(1) 由输出
                                               Generalized Squared Distance to group
                                            From group             1             2
                                                     1             0       6.45524
                                                     2       6.45524             0
液化的和未液化的广义 平方距离为 6.45524
(2)
                              Multivariate Statistics and Exact F Statistics
                                                      S=1    M=0.5    N=14
                         Statistic                        Value    F Value    Num DF    Den DF    Pr > F
                         Wilks' Lambda               0.39982725      15.01         3        30    <.0001
                         Pillai's Trace              0.60017275      15.01         3        30    <.0001
                         Hotelling-Lawley Trace      1.50108015      15.01         3        30    <.0001
                         Roy's Greatest Root         1.50108015      15.01         3        30    <.0001
                                                   Linear Discriminant Function
sas所进行的多种统计校验 P值<0.0001;
                                              Linear Discriminant Function for group
                                               Variable             1             2
                                               Constant      -4.25606      -4.99350
                                               x5             0.36747      -0.15061
                                               x6            16.40301      37.68245
                                               x7             0.00216       0.04004
由输出的判别函数的常量及系数可得到判别函数式;
(3)通过判别式对样本进行的判别结果:
                          group            1            2        Total
                                               1           10            1           11
                                                        90.91         9.09       100.00
                                               2            0           23           23
                                                         0.00       100.00       100.00
                                           Total           10           24           34
                                                        29.41        70.59       100.00
                                          Priors          0.5          0.5
                                                 Error Count Estimates for group
                                                                1           2       Total
                                         Rate              0.0909      0.0000      0.0455
                                         Priors            0.5000      0.5000
已液化的11个样本中10 判断正确,有一个样本判成了未液化。对于已液化的正确判断率为  90.91%
未液化的23个样本全判断正确。正确判断率达到了100%;

您可以选择一种方式赞助本站