ApproximateFrequent PatternMining PhilipS.Yu 1 , Xifeng Yan 1 - - PowerPoint PPT Presentation

approximate frequent pattern mining
SMART_READER_LITE
LIVE PREVIEW

ApproximateFrequent PatternMining PhilipS.Yu 1 , Xifeng Yan 1 - - PowerPoint PPT Presentation

ApproximateFrequent PatternMining PhilipS.Yu 1 , Xifeng Yan 1 ,Jiawei Han 2 , HongCheng 2 ,Feida Zhu 2 1 IBMT.J.Watson ResearchCenter 2 UniversityofIllinoisatUrbana* Champaign


slide-1
SLIDE 1

ApproximateFrequent PatternMining

PhilipS.Yu1,Xifeng Yan1,Jiawei Han2,

HongCheng2,Feida Zhu2

1IBMT.J.Watson ResearchCenter 2UniversityofIllinoisatUrbana*

Champaign

slide-2
SLIDE 2

FrequentPatternMining

Frequentpatternmininghasbeenstudiedforoveradecade

withtonsofalgorithmsdeveloped

Apriori (SIGMOD93,VLDB94,) FPgrowth (SIGMOD00),EClat,LCM,

Extendedtosequentialpatternmining,graphmining,

GSP,PrefixSpan,CloSpan,gSpan,

Applications:Dozensofinterestingapplicationsexplored

Associationandcorrelationanalysis Classification(CBA,CMAR,,discrim.featureanalysis) Clustering(e.g.,micro*arrayanalysis) Indexing(e.g.g*Index)

slide-3
SLIDE 3

TheProblemofFrequent Itemset Mining

FirstproposedbyAgrawal etal.in1993[AIS93].

Itemset X={x1,…,xk} Givenaminimumsupports,

discoverallitemsets X, s.t.sup(X)>=s

sup(X)isthepercentageof

transactionscontainingX

Ifs=40%,X={A,B}isa

frequentitemset since sup(X)=3/7>40%

50 A,C,D 60 C,D 40 B,C,D 70 ,C,D 30 A 20 ,C 10 Itemsbought Transaction'id

Table1.Asample transactiondatabaseD

A,B A,B A,B

slide-4
SLIDE 4

ABinaryMatrixRepresentation

Wecanalsousea

binarymatrixto representatransaction database.

Row:Transactions Column:Items Entry:Presence/absence

  • fanitemina

transaction 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

A B C D 10 20 30 40 50 60 70

Table2.Binary representationofD

slide-5
SLIDE 5

ANoisyDataModel

Anoisefreedatamodel

  • Assumptionmadebyalltheabovealgorithms

Anoisydatamodel

  • Realworlddataissubjecttorandomnoise andmeasurement

error.Forexample:

Promotions Specialevents Out*of*stockitemsoroverstockeditems Measurementimprecision

  • Thetruefrequentitemsets couldbedistortedbysuchnoise.
  • Theexactitemset miningalgorithmswilldiscovermultiple

fragmenteditemsets,butmissthetrueones.

slide-6
SLIDE 6

Itemsets WithandWithout Noise

ItemsetA ItemsetB Items Transactions ItemsetA ItemsetB Items Transactions

Figure1(a).Itemset withoutnoise Figure1(b).Itemset withnoise

Exactminingalgorithms getfragmenteditemsets!

slide-7
SLIDE 7

AlternativeModels

Existenceofcorepatterns

I.E.,evenundernoise,theoriginalpatterncanstill

appearwithhighprobability

Onlysummarypatternscanbederived

Summarypatternmaynotevenappearinthe

database

slide-8
SLIDE 8

TheCorePatternApproach

CorePatternDefinition

Anitemset xisacorepatternifitsexactsupportinthe

noisydatabasesatisfies

Ifanapproximateitemset isinteresting,itiswith

highprobability thatitisacorepatterninthenoisy database.Therefore,wecoulddiscoverthe approximateitemsets fromonlythecorepatterns.

Besidesthecorepatternconstraint,weusethe

constraintsofminimumsupport,,and,asin [LPS+06].

≤ ⋅ ≥ α α

  • ε
  • ε
slide-9
SLIDE 9

ApproximateItemset Example

Letand For<ABCD>,itsexact

support=1;

Byallowingafractionof

noiseinarow, transaction10,30,60,70 allapproximatelysupport <ABCD>;

Foreachitemin<ABCD>,

inthetransactionset{10, 30,60,70},afractionof 0sisallowed.

  • =
  • ε

A B C D 10 20 30 40 50 60 70

  • =
  • ε

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

  • =
  • ε
  • =
  • ε
slide-10
SLIDE 10

TheApproximateFrequent Itemset MiningApproach

Intuition

Discoverapproximateitemsets byallowing“holes” inthe

matrixrepresentation.

Constraints

Minimumsupports:thepercentageoftransactions

containinganitemset

Rowerrorrate:thepercentageof0s(item)allowedin

eachtransaction

Columnerrorrate :thepercentageof0sallowedin

transactionsetforeachitem

  • ε
  • ε
slide-11
SLIDE 11

AlgorithmOutlines

Minecorepatternsusing Buildalatticeofthecorepatterns Traversethelatticetocomputetheapproximate

itemsets

≤ ⋅ = α α

slide-12
SLIDE 12

ARunningExample

Letthedatabasebe

D,,, s=3,and

  • =

α

A B C D 10 20 30 40 50 60 70

DatabaseD

  • =
  • ε
  • =
  • ε

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

TheLatticeofCorePatterns

slide-13
SLIDE 13

Microarray → Co'ExpressionNetwork

genes conditions

MCM3 MCM7 NASP FEN1 SNRPG CDC2 CCNB1 UNG

TwoIssues: • noiseedges

  • largescale

Microarray Coexpression Network Module

slide-14
SLIDE 14

~9000genes 105x~(9000x9000)=8billionedges

  • transform

graphmining

Patternsdiscoveredinmultiplegraphsaremorereliableandsignificant dense vertexset

MiningPoorQualityData

Transcriptional Annotation

slide-15
SLIDE 15

SummaryGraph:Concept

  • M networks

ONE graph

  • verlap

clustering

ScaleDown

slide-16
SLIDE 16

SummaryGraph:NoiseEdges

Densesubgraphs areaccidentallyformedby

noiseedges

Theyarefalsefrequentdensevertexsets Noiseedgeswillalsointerferewithtrue

modules

?

densesubgraphs in summarygraph Frequentdense vertexsets

slide-17
SLIDE 17

UnsupervisedPartition:Finda Subset

  • clustering

(1) (2) identify (3) group mining together seed

slide-18
SLIDE 18

FrequentApproximateSubstrinng

ATCCGCACAGGTCAGTAGCA

slide-19
SLIDE 19

LimitationonMiningFrequentPatterns: MineVerySmallPatterns!

  • Canweminelarge(i.e.,colossal)patterns?― suchasjustsize

around50to100?Unfortunately,not!

  • Whynot?― thecurseofdownwardclosure offrequentpatterns

Thedownwardclosure property

Anysub*patternofafrequentpatternisfrequent.

Example.If(a1,a2,,a100)isfrequent,thena1,a2,,a100,(a1,a2),(a1,

a3),,(a1,a100),(a1,a2,a3), areallfrequent!Thereareabout2100 suchfrequentitemsets!

Nomatterusingbreadth*firstsearch(e.g.,Apriori)ordepth*firstsearch

(FPgrowth),wehavetoexaminesomanypatterns

  • Thusthedownwardclosurepropertyleadstoexplosion!
slide-20
SLIDE 20

DoWeNeedMiningColossalPatterns?

  • Fromfrequentpatternstoclosedpatternsandmaximalpatterns

Afrequentpatternis ifandonlyifthereexistsnosuper*pattern

thatisbothfrequentandhasthesamesupport

Afrequentpatternis ifandonlyifthereexistsnofrequent

super*pattern

  • Closed/maximalpatternsmaypartiallyalleviatetheproblembutnot

reallysolveit:Weoftenneedtominescatteredlargepatterns!

  • Manyreal*worldminingtasksneedsminingcolossalpatterns

Micro*arrayanalysisinbioinformatics(whensupportislow) Biologicalsequencepatterns Biological/sociological/informationgraphpatternmining

slide-21
SLIDE 21

ColossalPatternMiningPhilosophy

Iftheminingofmid*sizedpatternsisexplosiveinsize,

thereisnohopetofindcolossalpatternsefficientlyby insistingcompleteset miningphilosophy

Whatwemaydevelopisaphilosophythatmayjump

  • utoftheswampofmid*sizedresultsthatare

explosiveinsizeandjumptoreachcolossalpatterns

Thekeyistodevelopamechanismthatmayquickly

reachcolossalpatternsanddiscovermostofthem

slide-22
SLIDE 22

Conclusions

Mostpreviousworkfocusedonfindingexact

frequentpatterns

Thereexistsadiscrepancybetweentheexactmodel

andsomerealworldphenomenondueto

Noise,perturbation,etc

Verylongpatternminingcanbeanotherprohibiting

problem

Needtodevelopnewmethodologiestofind

approximatefrequentpatterns