基于Apriori数据关联规则挖掘 - 小众知识

基于Apriori数据关联规则挖掘

2020-08-07 01:26:13 苏内容
  标签: Apriori/挖掘
阅读:5446

关联规则

  • 在美国,一些年轻的父亲下班后经常要到超市去买婴儿尿布,超市也因此发现了一个规律,在购买婴儿尿布的年轻父亲们中,有30%~40%的人同时要买一些啤酒。超市随后调整了货架的摆放,把尿布和啤酒放在一起,明显增加了销售额。

  • 若两个或多个变量的取值之间存在某种规律性,就称为关联

  • 关联规则是寻找在同一个事件中出现的不同项的相关性,比如在一次购买活动中所买不同商品的相关性。

  • “在购买计算机的顾客中,有30%的人也同时购买了打印机”

在这里插入图片描述

  • 一个样本称为一个“事务”
  • 每个事务由多个属性来确定,这里的属性称为“项”
  • 多个项组成的集合称为“项集”

由k个项构成的集合

  • {牛奶}、{啤酒}都是1-项集;
  • {牛奶,果冻}是2-项集;
  • {啤酒,面包,牛奶}是3-项集

X==>Y含义:

  • X和Y是项集
  • X称为规则前项(antecedent)
  • Y称为规则后项(consequent)

事务仅包含其涉及到的项目,而不包含项目的具体信息。

  • 在超级市场的关联规则挖掘问题中事务是顾客一次购物所购买的商品,但事务中并不包含这些商品的具体信息,如商品的数量、价格等。

支持度(support):一个项集或者规则在所有事务中出现的频率,σ(X):表示项集X的支持度计数

  • 项集X的支持度:s(X)=σ(X)/N
  • 规则X==>Y表示物品集X对物品集Y的支持度,也就是物品集X和物品集Y同时出现的概率
  • 某天共有100个顾客到商场购买物品,其中有30个顾客同时购买了啤酒和尿布,那么上述的关联规则的支持度就是30%

置信度(confidence):确定Y在包含X的事务中出现的频繁程度。c(X → Y) = σ(X∪Y)/σ(X)

  • p(Y│X)=p(XY)/p(X)。
  • 置信度反应了关联规则的可信度—购买了项目集X中的商品的顾客同时也购买了Y中商品的可能性有多大
  • 购买薯片的顾客中有50%的人购买了可乐,则置信度为50%

在这里插入图片描述

设最小支持度为50%, 最小可信度为 50%, 则可得到 :

  • A==>C (50%, 66.6%)
  • C==>A (50%, 100%)

若关联规则X->Y的支持度和置信度分别大于或等于用户指定的最小支持率minsupport和最小置信度minconfidence,则称关联规则X->Y为强关联规则,否则称关联规则X->Y为弱关联规则。

提升度(lift):物品集A的出现对物品集B的出现概率发生了多大的变化

  • lift(A==>B)=confidence(A==>B)/support(B)=p(B|A)/p(B)
  • 现在有** 1000 ** 个消费者,有** 500** 人购买了茶叶,其中有** 450人同时** 购买了咖啡,另** 50人** 没有。由于** confidence(茶叶=>咖啡)=450/500=90%** ,由此可能会认为喜欢喝茶的人往往喜欢喝咖啡。但如果另外没有购买茶叶的** 500人** ,其中同样有** 450人** 购买了咖啡,同样是很高的** 置信度90%** ,由此,得到不爱喝茶的也爱喝咖啡。这样看来,其实是否购买咖啡,与有没有购买茶叶并没有关联,两者是相互独立的,其** 提升度90%/[(450+450)/1000]=1** 。

由此可见,lift正是弥补了confidence的这一缺陷,if lift=1,X与Y独立,X对Y出现的可能性没有提升作用,其值越大(lift>1),则表明X对Y的提升程度越大,也表明关联性越强。

在这里插入图片描述### Leverage 与 Conviction的作用和lift类似,都是值越大代表越关联

  • Leverage ?(A,B)-P(A)P(B)
  • Conviction:P(A)P(!B)/P(A,!B)

使用mlxtend工具包得出频繁项集与规则

  • pip install mlxtend
import pandas as pdfrom mlxtend.frequent_patterns import apriorifrom mlxtend.frequent_patterns import association_rules123

自定义一份购物数据集

data = {'ID':[1,2,3,4,5,6],
       'Onion':[1,0,0,1,1,1],
       'Potato':[1,1,0,1,1,1],
       'Burger':[1,1,0,0,1,1],
       'Milk':[0,1,1,1,0,1],
       'Beer':[0,0,1,0,1,0]}123456
df = pd.DataFrame(data)1
df = df[['ID', 'Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]]1
df1
IDOnionPotatoBurgerMilkBeer
0111100
1201110
2300011
3411010
4511101
5611110

设置支持度 (support) 来选择频繁项集.

  • 选择最小支持度为50%

  • apriori(df, min_support=0.5, use_colnames=True)

frequent_itemsets = apriori(df[['Onion', 'Potato', 'Burger', 'Milk', 'Beer' ]], min_support=0.50, use_colnames=True)1
frequent_itemsets1
supportitemsets
00.666667(Onion)
10.833333(Potato)
20.666667(Burger)
30.666667(Milk)
40.666667(Potato, Onion)
50.500000(Burger, Onion)
60.666667(Burger, Potato)
70.500000(Milk, Potato)
80.500000(Burger, Potato, Onion)

返回的3种项集均是支持度>=50%

计算规则

  • association_rules(df, metric='lift', min_threshold=1)
  • 可以指定不同的衡量标准与最小阈值
rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1)1
rules1
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
0(Potato)(Onion)0.8333330.6666670.6666670.801.2000.1111111.666667
1(Onion)(Potato)0.6666670.8333330.6666671.001.2000.111111inf
2(Burger)(Onion)0.6666670.6666670.5000000.751.1250.0555561.333333
3(Onion)(Burger)0.6666670.6666670.5000000.751.1250.0555561.333333
4(Burger)(Potato)0.6666670.8333330.6666671.001.2000.111111inf
5(Potato)(Burger)0.8333330.6666670.6666670.801.2000.1111111.666667
6(Burger, Potato)(Onion)0.6666670.6666670.5000000.751.1250.0555561.333333
7(Burger, Onion)(Potato)0.5000000.8333330.5000001.001.2000.083333inf
8(Potato, Onion)(Burger)0.6666670.6666670.5000000.751.1250.0555561.333333
9(Burger)(Potato, Onion)0.6666670.6666670.5000000.751.1250.0555561.333333
10(Potato)(Burger, Onion)0.8333330.5000000.5000000.601.2000.0833331.250000
11(Onion)(Burger, Potato)0.6666670.6666670.5000000.751.1250.0555561.333333

返回的是各个的指标的数值,可以按照感兴趣的指标排序观察,但具体解释还得参考实际数据的含义。

rules [ (rules['lift'] >1.125)  & (rules['confidence']> 0.8)  ]1
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
1(Onion)(Potato)0.6666670.8333330.6666671.01.20.111111inf
4(Burger)(Potato)0.6666670.8333330.6666671.01.20.111111inf
7(Burger, Onion)(Potato)0.5000000.8333330.5000001.01.20.083333inf

这几条结果就比较有价值了:

  • (洋葱和马铃薯)(汉堡和马铃薯)可以搭配着来卖
  • 如果洋葱和汉堡都在购物篮中, 顾客买马铃薯的可能性也比较高,如果他篮子里面没有,可以推荐一下.

所有指标的计算公式:

在这里插入图片描述

数据需转换成one-hot编码

retail_shopping_basket = {'ID':[1,2,3,4,5,6],
                         'Basket':[['Beer', 'Diaper', 'Pretzels', 'Chips', 'Aspirin'],
                                   ['Diaper', 'Beer', 'Chips', 'Lotion', 'Juice', 'BabyFood', 'Milk'],
                                   ['Soda', 'Chips', 'Milk'],
                                   ['Soup', 'Beer', 'Diaper', 'Milk', 'IceCream'],
                                   ['Soda', 'Coffee', 'Milk', 'Bread'],
                                   ['Beer', 'Chips']
                                  ]
                         }123456789
retail = pd.DataFrame(retail_shopping_basket)1
retail = retail[['ID', 'Basket']]1
pd.options.display.max_colwidth=1001
retail1
IDBasket
01[Beer, Diaper, Pretzels, Chips, Aspirin]
12[Diaper, Beer, Chips, Lotion, Juice, BabyFood, Milk]
23[Soda, Chips, Milk]
34[Soup, Beer, Diaper, Milk, IceCream]
45[Soda, Coffee, Milk, Bread]
56[Beer, Chips]

数据集中都是字符串组成的,需要转换成数值编码

retail_id = retail.drop('Basket' ,1)retail_id12
ID
01
12
23
34
45
56
retail_Basket = retail.Basket.str.join(',')retail_Basket12
0              Beer,Diaper,Pretzels,Chips,Aspirin
1    Diaper,Beer,Chips,Lotion,Juice,BabyFood,Milk
2                                 Soda,Chips,Milk
3                  Soup,Beer,Diaper,Milk,IceCream
4                          Soda,Coffee,Milk,Bread
5                                      Beer,Chips
Name: Basket, dtype: object1234567
retail_Basket = retail_Basket.str.get_dummies(',')retail_Basket12
AspirinBabyFoodBeerBreadChipsCoffeeDiaperIceCreamJuiceLotionMilkPretzelsSodaSoup
010101010000100
101101010111000
200001000001010
300100011001001
400010100001010
500101000000000
retail = retail_id.join(retail_Basket)retail12
IDAspirinBabyFoodBeerBreadChipsCoffeeDiaperIceCreamJuiceLotionMilkPretzelsSodaSoup
0110101010000100
1201101010111000
2300001000001010
3400100011001001
4500010100001010
5600101000000000
frequent_itemsets_2 = apriori(retail.drop('ID',1), use_colnames=True)1
frequent_itemsets_21
supportitemsets
00.666667(Beer)
10.666667(Chips)
20.500000(Diaper)
30.666667(Milk)
40.500000(Chips, Beer)
50.500000(Diaper, Beer)

如果光考虑支持度support(X>Y), [Beer, Chips] 和 [Beer, Diaper] 都是很频繁的,哪一种组合更相关呢?

association_rules(frequent_itemsets_2, metric='lift')1
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
0(Chips)(Beer)0.6666670.6666670.50.751.1250.0555561.333333
1(Beer)(Chips)0.6666670.6666670.50.751.1250.0555561.333333
2(Diaper)(Beer)0.5000000.6666670.51.001.5000.166667inf
3(Beer)(Diaper)0.6666670.5000000.50.751.5000.1666672.000000

显然{Diaper, Beer}更相关一些

电影题材关联

在这里插入图片描述
数据集: MovieLens (small)

movies = pd.read_csv('ml-latest-small/movies.csv')1
movies.head(10)1
movieIdtitlegenres
01Toy Story (1995)Adventure|Animation|Children|Comedy|Fantasy
12Jumanji (1995)Adventure|Children|Fantasy
23Grumpier Old Men (1995)Comedy|Romance
34Waiting to Exhale (1995)Comedy|Drama|Romance
45Father of the Bride Part II (1995)Comedy
56Heat (1995)Action|Crime|Thriller
67Sabrina (1995)Comedy|Romance
78Tom and Huck (1995)Adventure|Children
89Sudden Death (1995)Action
910GoldenEye (1995)Action|Adventure|Thriller

数据中包括电影名字与电影类型的标签,第一步还是先转换成one-hot格式

movies_ohe = movies.drop('genres',1).join(movies.genres.str.get_dummies())1
pd.options.display.max_columns=1001
movies_ohe.head()1
movieIdtitle(no genres listed)ActionAdventureAnimationChildrenComedyCrimeDocumentaryDramaFantasyFilm-NoirHorrorIMAXMusicalMysteryRomanceSci-FiThrillerWarWestern
01Toy Story (1995)00111100010000000000
12Jumanji (1995)00101000010000000000
23Grumpier Old Men (1995)00000100000000010000
34Waiting to Exhale (1995)00000100100000010000
45Father of the Bride Part II (1995)00000100000000000000
movies_ohe.shape1
(9125, 22)1

数据集包括9125部电影,一共有20种不同类型。

movies_ohe.set_index(['movieId','title'],inplace=True)1
movies_ohe.head()1
(no genres listed)ActionAdventureAnimationChildrenComedyCrimeDocumentaryDramaFantasyFilm-NoirHorrorIMAXMusicalMysteryRomanceSci-FiThrillerWarWestern
movieIdtitle
1Toy Story (1995)00111100010000000000
2Jumanji (1995)00101000010000000000
3Grumpier Old Men (1995)00000100000000010000
4Waiting to Exhale (1995)00000100100000010000
5Father of the Bride Part II (1995)00000100000000000000
frequent_itemsets_movies = apriori(movies_ohe,use_colnames=True, min_support=0.025)1
frequent_itemsets_movies1
supportitemsets
00.169315(Action)
10.122411(Adventure)
20.048986(Animation)
30.063890(Children)
40.363288(Comedy)
50.120548(Crime)
60.054247(Documentary)
70.478356(Drama)
80.071671(Fantasy)
90.096110(Horror)
100.043178(Musical)
110.059507(Mystery)
120.169315(Romance)
130.086795(Sci-Fi)
140.189479(Thriller)
150.040219(War)
160.058301(Action, Adventure)
170.037589(Action, Comedy)
180.038247(Action, Crime)
190.051178(Action, Drama)
200.040986(Sci-Fi, Action)
210.062904(Action, Thriller)
220.029260(Adventure, Children)
230.036712(Adventure, Comedy)
240.032438(Adventure, Drama)
250.030685(Adventure, Fantasy)
260.027726(Sci-Fi, Adventure)
270.027068(Children, Animation)
280.032877(Children, Comedy)
290.032438(Crime, Comedy)
300.104000(Drama, Comedy)
310.026959(Fantasy, Comedy)
320.090082(Romance, Comedy)
330.067616(Crime, Drama)
340.057863(Crime, Thriller)
350.031671(Mystery, Drama)
360.101260(Romance, Drama)
370.087123(Drama, Thriller)
380.031014(War, Drama)
390.043397(Horror, Thriller)
400.036055(Mystery, Thriller)
410.028932(Sci-Fi, Thriller)
420.035068(Romance, Drama, Comedy)
430.032000(Crime, Drama, Thriller)
rules_movies =  association_rules(frequent_itemsets_movies, metric='lift', min_threshold=1.25)1
rules_movies1
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
0(Action)(Adventure)0.1693150.1224110.0583010.3443372.8129550.0375751.338475
1(Adventure)(Action)0.1224110.1693150.0583010.4762762.8129550.0375751.586111
2(Action)(Crime)0.1693150.1205480.0382470.2258901.8738600.0178361.136081
3(Crime)(Action)0.1205480.1693150.0382470.3172731.8738600.0178361.216716
4(Sci-Fi)(Action)0.0867950.1693150.0409860.4722222.7890150.0262911.573929
5(Action)(Sci-Fi)0.1693150.0867950.0409860.2420712.7890150.0262911.204870
6(Action)(Thriller)0.1693150.1894790.0629040.3715211.9607460.0308221.289654
7(Thriller)(Action)0.1894790.1693150.0629040.3319841.9607460.0308221.243510
8(Adventure)(Children)0.1224110.0638900.0292600.2390333.7412990.0214391.230158
9(Children)(Adventure)0.0638900.1224110.0292600.4579763.7412990.0214391.619096
10(Adventure)(Fantasy)0.1224110.0716710.0306850.2506713.4975180.0219121.238881
11(Fantasy)(Adventure)0.0716710.1224110.0306850.4281353.4975180.0219121.534608
12(Sci-Fi)(Adventure)0.0867950.1224110.0277260.3194442.6096070.0171011.289519
13(Adventure)(Sci-Fi)0.1224110.0867950.0277260.2265002.6096070.0171011.180614
14(Children)(Animation)0.0638900.0489860.0270680.4236718.6487580.0239391.650122
15(Animation)(Children)0.0489860.0638900.0270680.5525738.6487580.0239392.092205
16(Children)(Comedy)0.0638900.3632880.0328770.5145801.4164530.0096661.311672
17(Comedy)(Children)0.3632880.0638900.0328770.0904981.4164530.0096661.029255
18(Romance)(Comedy)0.1693150.3632880.0900820.5320391.4645110.0285721.360609
19(Comedy)(Romance)0.3632880.1693150.0900820.2479641.4645110.0285721.104581
20(Crime)(Thriller)0.1205480.1894790.0578630.4800002.5332560.0350221.558693
21(Thriller)(Crime)0.1894790.1205480.0578630.3053792.5332560.0350221.266089
22(Romance)(Drama)0.1693150.4783560.1012600.5980581.2502360.0202671.297810
23(Drama)(Romance)0.4783560.1693150.1012600.2116841.2502360.0202671.053746
24(War)(Drama)0.0402190.4783560.0310140.7711171.6120150.0117752.279087
25(Drama)(War)0.4783560.0402190.0310140.0648341.6120150.0117751.026321
26(Horror)(Thriller)0.0961100.1894790.0433970.4515392.3830520.0251861.477810
27(Thriller)(Horror)0.1894790.0961100.0433970.2290342.3830520.0251861.172413
28(Mystery)(Thriller)0.0595070.1894790.0360550.6058933.1976720.0247792.056601
29(Thriller)(Mystery)0.1894790.0595070.0360550.1902833.1976720.0247791.161509
30(Sci-Fi)(Thriller)0.0867950.1894790.0289320.3333331.7592060.0124861.215781
31(Thriller)(Sci-Fi)0.1894790.0867950.0289320.1526891.7592060.0124861.077769
32(Drama, Comedy)(Romance)0.1040000.1693150.0350680.3371971.9915360.0174601.253291
33(Romance)(Drama, Comedy)0.1693150.1040000.0350680.2071201.9915360.0174601.130057
34(Crime, Drama)(Thriller)0.0676160.1894790.0320000.4732582.4976730.0191881.538742
35(Drama, Thriller)(Crime)0.0871230.1205480.0320000.3672963.0468840.0214971.389989
36(Crime)(Drama, Thriller)0.1205480.0871230.0320000.2654553.0468840.0214971.242778
37(Thriller)(Crime, Drama)0.1894790.0676160.0320000.1688842.4976730.0191881.121845
rules_movies[(rules_movies.lift>4)].sort_values(by=['lift'], ascending=False)1
antecedentsconsequentsantecedent supportconsequent supportsupportconfidenceliftleverageconviction
14(Children)(Animation)0.0638900.0489860.0270680.4236718.6487580.0239391.650122
15(Animation)(Children)0.0489860.0638900.0270680.5525738.6487580.0239392.092205

Children和Animation 这俩题材是最相关的了,常识也可以分辨出来。

movies[(movies.genres.str.contains('Children')) & (~movies.genres.str.contains('Animation'))]1
movieIdtitlegenres
12Jumanji (1995)Adventure|Children|Fantasy
78Tom and Huck (1995)Adventure|Children
2627Now and Then (1995)Children|Drama
3234Babe (1995)Children|Drama
3638It Takes Two (1995)Children|Comedy
5154Big Green, The (1995)Children|Comedy
5660Indian in the Cupboard, The (1995)Adventure|Children|Fantasy
7480White Balloon, The (Badkonake sefid) (1995)Children|Drama
8187Dunston Checks In (1996)Children|Comedy
98107Muppet Treasure Island (1996)Adventure|Children|Comedy|Musical
114126NeverEnding Story III, The (1994)Adventure|Children|Fantasy
125146Amazing Panda Adventure, The (1995)Adventure|Children
137158Casper (1995)Adventure|Children
148169Free Willy 2: The Adventure Home (1995)Adventure|Children|Drama
160181Mighty Morphin Power Rangers: The Movie (1995)Action|Children
210238Far From Home: The Adventures of Yellow Dog (1995)Adventure|Children
213241Fluke (1995)Children|Drama
215243Gordy (1995)Children|Comedy|Fantasy
222250Heavyweights (Heavy Weights) (1995)Children|Comedy
230258Kid in King Arthur's Court, A (1995)Adventure|Children|Comedy|Fantasy|Romance
234262Little Princess, A (1995)Children|Drama
280314Secret of Roan Inish, The (1994)Children|Drama|Fantasy|Mystery
308343Baby-Sitters Club, The (1995)Children
320355Flintstones, The (1994)Children|Comedy|Fantasy
326362Jungle Book, The (1994)Adventure|Children|Romance
338374Richie Rich (1994)Children|Comedy
361410Addams Family Values (1993)Children|Comedy|Fantasy
371421Black Beauty (1994)Adventure|Children|Drama
404455Free Willy (1993)Adventure|Children|Drama
431484Lassie (1994)Adventure|Children
............
770783177Yogi Bear (2010)Children|Comedy
773584312Home Alone 4 (2002)Children|Comedy|Crime
782387383Curly Top (1935)Children|Musical|Romance
790089881Superman and the Mole-Men (1951)Children|Mystery|Sci-Fi
792990866Hugo (2011)Children|Drama|Mystery
793591094Muppets, The (2011)Children|Comedy|Musical
794291286Little Colonel, The (1935)Children|Comedy|Crime|Drama
797191886Dolphin Tale (2011)Children|Drama
809695740Adventures of Mary-Kate and Ashley, The: The Case of the United States Navy Adventure (1997)Children|Musical|Mystery
819998441Rebecca of Sunnybrook Farm (1938)Children|Comedy|Drama|Musical
820098458Baby Take a Bow (1934)Children|Comedy|Drama
8377104074Percy Jackson: Sea of Monsters (2013)Adventure|Children|Fantasy
8450106441Book Thief, The (2013)Children|Drama|War
8558110461We Are the Best! (Vi är bäst!) (2013)Children|Comedy|Drama
8592111659Maleficent (2014)Action|Adventure|Children|IMAX
8689115139Challenge to Lassie (1949)Children|Drama
8761118997Into the Woods (2014)Children|Comedy|Fantasy|Musical
8765119155Night at the Museum: Secret of the Tomb (2014)Adventure|Children|Comedy|Fantasy
8766119655Seventh Son (2014)Adventure|Children|Fantasy
8792122932Elsa & Fred (2014)Children|Comedy|Romance
8845130073Cinderella (2015)Children|Drama|Fantasy|Romance
8850130450Pan (2015)Adventure|Children|Fantasy
8871132046Tomorrowland (2015)Action|Adventure|Children|Mystery|Sci-Fi
8916135264Zenon: Girl of the 21st Century (1999)Adventure|Children|Comedy
8917135266Zenon: The Zequel (2001)Adventure|Children|Comedy|Sci-Fi
8918135268Zenon: Z3 (2004)Adventure|Children|Comedy
8960139620Everything's Gonna Be Great (1998)Adventure|Children|Comedy|Drama
8967140152Dreamcatcher (2015)Children|Crime|Documentary
898114074716 Wishes (2010)Children|Drama|Fantasy
9052149354Sisters (2015)Children|Comedy

336 rows × 3 columns

具体分析还得落实到数据本身,这就需要充分理解数据才可以。


扩展阅读
相关阅读
© CopyRight 2010-2021, PREDREAM.ORG, Inc.All Rights Reserved. 京ICP备13045924号-1