Information Filtering Information Systems M Prof. Paolo Ciaccia - - PDF document

information filtering
SMART_READER_LITE
LIVE PREVIEW

Information Filtering Information Systems M Prof. Paolo Ciaccia - - PDF document

Information Filtering Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/


slide-1
SLIDE 1

1

Information Filtering

Information Systems M

  • Prof. Paolo Ciaccia

http://www-db.deis.unibo.it/courses/SI-M/

  • The Information Filtering (IF) problem:
  • Deliver to users only the information that is relevant to them, filtering
  • ut all irrelevant new data items (news, papers, advertisments, …)
  • Although IF and IR share the common goal to provide users with relevant

information, there are important differences:

  • IR

IF Goal Selecting relevant items (docs) for each query Filtering out the many irrelevant data items Type of use Ad-hoc use Repetitive use Type of users One-time users Long-term users Representation of information needs Queries User profiles Index Items User profiles

slide-2
SLIDE 2

2

  • IF techniques find applications in a variety of scenarios, including:
  • Automatic delivery of news/alerts
  • Online display advertising
  • Publish/subscribe systems
  • Recommender systems are a specific type of IF systems that will be

discussed later on

  • Due to its similarity with IR, it is not surprise that the most common

approaches to IF are based on the Boolean and the Vector Space models

  • However, a more detailed and structured description of the user profile is

now needed, in order to improve the effectiveness of matching

  • In the sequel we will sketch the details of a recent approach based on the

Boolean model; examples of use of the VSM will be given in the context of recommender systems

slide-3
SLIDE 3

3

  • Reference: [WBS+09]

Scenario:

  • A (profiled) user visiting a web site (also called an “assignment”)
  • Many advertisement campaigns managed by the site
  • Both specified using Boolean expressions (BE’s) over a multi)attribute

space

Alternatively (pub/sub system):

  • An incoming item
  • Many stored user profiles

One “assignment” to be efficiently matched against many stored BE’s

  • BE

index Assignment Matched BE’s

  • Two types of Boolean predicates: ∈ and ∉
  • E.g.: state ∈ {CA,NY}, state ∉ {NY}
  • Ranges of values are converted into ∈ and ∉ predicates
  • age < 30 converted into age ∈ {0,1,2} (0 = [0,9], 1 = [10,19], …)
  • A BE is either in DNF or in CNF normal form, e.g.:

(state ∈ {CA,NY} & age ∈ {1,2}) | (state ∉ {NY} & gender ∈ {F})

  • & = AND; | = OR
  • In the following we only discuss the DNF case
  • An assignment S is a set (conjunction) of attribute and value pairs
  • E.g.: S: state = CA & gender = F
  • An attribute-value pair is also called a key
  • E.g. (state,CA) is a key
slide-4
SLIDE 4

4

  • A BE E is satisfied by an assignment S if S makes E true
  • S: state = CA & gender = F
  • E1: state ∈ {CA,NY}

satisfied

  • E2: state ∈ {CA,NY} & gender ∈ {M}

not satisfied

  • Since an assignment needs not to specify a value for all the attributes, the

semantics of matching needs to be refined

  • (state ∈ {NY} & gender ∈ {F}) is satisfied by gender = F? NO
  • (state ∉ {NY} & gender ∈ {F}) is satisfied by gender = F? MAYBE…
  • Two alternative interpretations for ∉ predicates:
  • Strong)∉ predicate: violated if no value is specified for the attribute
  • Weak)∉ predicate: satisfied if no value is specified for the attribute
  • The default are weak-∉ predicates;
  • The strong-∉ semantics can be enforced by writing, e.g.: state ∉ {NY,NULL},

which requires a value for state to be present in the assignment

  • !

"#

  • The basic idea is to build an inverted index on BE’s that, for each key, stores

the BE’s containing it

  • The basic case is when BE’s are simple conjunctions of ∈ predicates

E1: A ∈ {1} E2: A ∈ {1} & B ∈ {2} & C ∈ {3,4}

  • $

Key Posting list

(A,1) E1, E2 (B,2) E2 (C,3) E2 (C,4) E2

Inverted Index

S: A = 1 & B = 2 The problem is that neither intersection nor union of posting lists work here:

  • Intersection: E2
  • Union: E1 and E2
slide-5
SLIDE 5

5

"%&'(

  • Entries are partitioned based on the number of conjuncts K in each BE
  • The partition of the inverted index storing information of BE’s with K

conjuncts is called the “K-index”

  • The “Z key” is used to handle the case

K = 0 (notice that ∉ predicates do not concur to determine the value of K)

  • )

K Key Posting list

(state,CA) (C6,∉) (state,NY) (C6,∉) Z (C6,∈) 1 (age,3) (C5,∈) (age,4) (C5,∈) 2 (state,NY) (C1,∈) (age,3) (C1,∈), (C2,∈), (C3,∈) (gender,F) (C2,∈) (state,CA) (C3,∉) ,(C4,∈) (gender,M) (C3,∈), (C4,∈)

Inverted Index

ID BE K

C1 age ∈ {3} & state ∈ {NY} 2 C2 age ∈ {3} & gender ∈ {F} 2 C3 age ∈ {3} & gender ∈ {M} & state ∉ {CA} 2 C4 state ∈ {CA} & gender ∈ {M} 2 C5 age ∈ {3,4} 1 C6 state ∉ {CA,NY}

BE’s (conjunctions)

*%+&'("#

  • Given an assignment S with t keys, two basic conditions are used to check if

a conjunction C matches S:

  • 1. For a K)index with K ≤ t, a conjunction C matches S only if there are K

posting lists such that:

Each list refers to a key (A,v) in S, and (C,∈) is in the posting list

  • 2. For no (A,v) key in S there is a posting list in which (C,∉) appears
  • Example:
  • C1: (age ∈ {3} & gender ∈ {M}) matches

S: age ∈ {3} & gender ∈ {M} & state ∈ {CA}

  • C2: (age ∈ {3} & gender ∈ {M} & state ∉ {CA})

does not match S, since the posting list of the key (state,CA) includes the entry (C2,∉)

  • The Conjunction algorithm iterates through the K)indexes by checking that

above conditions are satisfied

  • Further, it does not consider at all K)indexes with K > t
  • ,-
slide-6
SLIDE 6

6

*%+&'("

S: age =3 & state = CA & gender = M

  • First, all the relevant posting lists are
  • btained (one K-index at a time)
  • For K=2 it is recognized that neither

C1 nor C2 can be satisfied by S

  • Although C3 satisfies condition 1,

it violates cond. 2

  • C4 satisfies both conditions
  • The same holds for C5 (K=1)
  • C6 violates condition 2

Result: {C4,C5}

  • ,,

K Key Posting list

(state,CA) (C6,∉) Z (C6,∈) 1 (age,3) (C5,∈) 2 (age,3) (C1,∈), (C2,∈), (C3,∈) (state,CA) (C3,∉) ,(C4,∈) (gender,M) (C3,∈), (C4,∈)

Inverted Index

ID BE K

C1 age ∈ {3} & state ∈ {NY} 2 C2 age ∈ {3} & gender ∈ {F} 2 C3 age ∈ {3} & gender ∈ {M} & state ∉ {CA} 2 C4 state ∈ {CA} & gender ∈ {M} 2 C5 age ∈ {3,4} 1 C6 state ∉ {CA,NY}

BE’s (conjunctions)

*./

  • To process BE’s in DNF it is sufficient to observe that a BE E is satisfied by an

assignment S iff at least one of its conjunctions of predicates is satisfied by S

  • Example:

(state ∈ {CA} & gender ∈ {M}) | (state ∈ {NY} & gender ∈ {F}) is satisfied by S: age =3 & state = CA & gender = M

  • ,