ApproximateFrequent PatternMining
PhilipS.Yu1,Xifeng Yan1,Jiawei Han2,
HongCheng2,Feida Zhu2
1IBMT.J.Watson ResearchCenter 2UniversityofIllinoisatUrbana*
ApproximateFrequent PatternMining PhilipS.Yu 1 , Xifeng Yan 1 - - PowerPoint PPT Presentation
ApproximateFrequent PatternMining PhilipS.Yu 1 , Xifeng Yan 1 ,Jiawei Han 2 , HongCheng 2 ,Feida Zhu 2 1 IBMT.J.Watson ResearchCenter 2 UniversityofIllinoisatUrbana* Champaign
1IBMT.J.Watson ResearchCenter 2UniversityofIllinoisatUrbana*
Frequentpatternmininghasbeenstudiedforoveradecade
Apriori (SIGMOD93,VLDB94,) FPgrowth (SIGMOD00),EClat,LCM,
Extendedtosequentialpatternmining,graphmining,
GSP,PrefixSpan,CloSpan,gSpan,
Applications:Dozensofinterestingapplicationsexplored
Associationandcorrelationanalysis Classification(CBA,CMAR,,discrim.featureanalysis) Clustering(e.g.,micro*arrayanalysis) Indexing(e.g.g*Index)
FirstproposedbyAgrawal etal.in1993[AIS93].
Itemset X={x1,…,xk} Givenaminimumsupports,
sup(X)isthepercentageof
Ifs=40%,X={A,B}isa
50 A,C,D 60 C,D 40 B,C,D 70 ,C,D 30 A 20 ,C 10 Itemsbought Transaction'id
A,B A,B A,B
Wecanalsousea
Row:Transactions Column:Items Entry:Presence/absence
A B C D 10 20 30 40 50 60 70
Anoisefreedatamodel
Anoisydatamodel
error.Forexample:
Promotions Specialevents Out*of*stockitemsoroverstockeditems Measurementimprecision
fragmenteditemsets,butmissthetrueones.
ItemsetA ItemsetB Items Transactions ItemsetA ItemsetB Items Transactions
I.E.,evenundernoise,theoriginalpatterncanstill
Summarypatternmaynotevenappearinthe
CorePatternDefinition
Anitemset xisacorepatternifitsexactsupportinthe
Ifanapproximateitemset isinteresting,itiswith
Besidesthecorepatternconstraint,weusethe
Letand For<ABCD>,itsexact
Byallowingafractionof
Foreachitemin<ABCD>,
A B C D 10 20 30 40 50 60 70
Intuition
Discoverapproximateitemsets byallowing“holes” inthe
Constraints
Minimumsupports:thepercentageoftransactions
Rowerrorrate:thepercentageof0s(item)allowedin
Columnerrorrate :thepercentageof0sallowedin
Minecorepatternsusing Buildalatticeofthecorepatterns Traversethelatticetocomputetheapproximate
A B C D 10 20 30 40 50 60 70
genes conditions
MCM3 MCM7 NASP FEN1 SNRPG CDC2 CCNB1 UNG
Microarray Coexpression Network Module
~9000genes 105x~(9000x9000)=8billionedges
graphmining
Patternsdiscoveredinmultiplegraphsaremorereliableandsignificant dense vertexset
Transcriptional Annotation
ONE graph
clustering
ScaleDown
densesubgraphs in summarygraph Frequentdense vertexsets
(1) (2) identify (3) group mining together seed
around50to100?Unfortunately,not!
Thedownwardclosure property
Anysub*patternofafrequentpatternisfrequent.
Example.If(a1,a2,,a100)isfrequent,thena1,a2,,a100,(a1,a2),(a1,
a3),,(a1,a100),(a1,a2,a3), areallfrequent!Thereareabout2100 suchfrequentitemsets!
Nomatterusingbreadth*firstsearch(e.g.,Apriori)ordepth*firstsearch
(FPgrowth),wehavetoexaminesomanypatterns
Afrequentpatternis ifandonlyifthereexistsnosuper*pattern
thatisbothfrequentandhasthesamesupport
Afrequentpatternis ifandonlyifthereexistsnofrequent
super*pattern
reallysolveit:Weoftenneedtominescatteredlargepatterns!
Micro*arrayanalysisinbioinformatics(whensupportislow) Biologicalsequencepatterns Biological/sociological/informationgraphpatternmining
Iftheminingofmid*sizedpatternsisexplosiveinsize,
Whatwemaydevelopisaphilosophythatmayjump
Thekeyistodevelopamechanismthatmayquickly
Mostpreviousworkfocusedonfindingexact
Thereexistsadiscrepancybetweentheexactmodel
Noise,perturbation,etc
Verylongpatternminingcanbeanotherprohibiting
Needtodevelopnewmethodologiestofind