Web User Profiling using Data Redundancy - - PowerPoint PPT Presentation

web user profiling using data redundancy
SMART_READER_LITE
LIVE PREVIEW

Web User Profiling using Data Redundancy - - PowerPoint PPT Presentation

Web User Profiling using Data Redundancy http://aminer.org/profiling Xiaotao Gu, Hong Yang, Jie Tang, Jing Zhang Tsinghua University 1 Web User Profiling using Data Redundancy Introduction Traditional Way Basic Idea


slide-1
SLIDE 1

1

Web User Profiling using Data Redundancy

http://aminer.org/profiling

Xiaotao Gu, Hong Yang, Jie Tang, Jing Zhang Tsinghua University

slide-2
SLIDE 2

2

Web User Profiling using Data Redundancy

  • Introduction
  • Traditional Way
  • Basic Idea
  • MagicFG
  • Experiments
  • Conclusion
slide-3
SLIDE 3

3 Address Phone & Fax Email Homepage Affiliation Position

  • Expert Finding
  • Recommendation
  • Getting in Touch
slide-4
SLIDE 4

4

Web User Profiling using Data Redundancy

  • Introduction
  • Traditional Way
  • Basic Idea
  • MagicFG
  • Experiments
  • Conclusion
slide-5
SLIDE 5

5

Traditional Way: Two-Step

  • Source Finding
  • Extraction
slide-6
SLIDE 6

6

Traditional Way: Two-Step

  • Source Finding
  • Extraction

SVM CRF LR

slide-7
SLIDE 7

7

Traditional Way: Two-Step

  • Low Recall – single data source
  • Low Precision – error propagation
slide-8
SLIDE 8

8

Traditional Way: Two-Step

  • Low Recall – single data source
  • Low Precision – error propagation

Homepage Finding Profile Extraction

90% 90%

* =

81%

Result

slide-9
SLIDE 9

9

Web User Profiling using Data Redundancy

  • Introduction
  • Traditional Way
  • Basic Idea
  • MagicFG
  • Experiments
  • Conclusion
slide-10
SLIDE 10

10

Basic Idea

  • A Uniform Framework

ü All in one step, avoiding error propagation ü Incorporate information from different data sources: Homepage, Google Scholar, Twitter, Linkedin, Facebook, etc.

slide-11
SLIDE 11

11

Basic Idea

  • A Uniform Framework
  • Search Engine as the data source
slide-12
SLIDE 12

12

Basic Idea

  • Search Engine as Data Source
slide-13
SLIDE 13

13

Basic Idea

  • Search Engine as Data Source

ü Efficient

  • Different from traditional methods that crawled each of the relevant pages, It is

much faster and more stable, as different servers that host the relevant pages may have very different network speed. ü Effective

  • we found with the constructed “smart” queries, more than 90% of the profile

attributes are already contained in the snippets returned by the search engine. ü Economical

  • One additional advantage is that we do not need to maintain a large database to

record all the relevant pages for all the query persons. This is very important, as, for example, in AMiner, we have more than 130,000,000 researchers— maintaining such a big database for all researchers itself is a challenging task.

Why snippets?

slide-14
SLIDE 14

14

Basic Idea

  • A Uniform Framework
  • Search Engine as the Data Source
  • Smart Query Construction

Categorical : Gender, Position, Country… Non-Categorical : Email, Affiliation, Address…

Profile Attributes

slide-15
SLIDE 15

15

Query Construction

Non-Categorical Person_Name + Attribute_Name Query = “Phillip S. Yu email”

slide-16
SLIDE 16

16

Query Construction

Categorical Person_Name + Representative Words Query = “Phillip S. Yu his OR her”

slide-17
SLIDE 17

17

Representative Words

“his” Male “he” “…” Female “her” “he” “…” Query = “Phillip S. Yu his OR her”

slide-18
SLIDE 18

18

Basic Idea

  • A Uniform Framework
  • Search Engine as the Data Source
  • Smart Query Construction
  • Basic Classification
slide-19
SLIDE 19

19

Feature Definition

Email

  • First name in prefix
  • Last name in prefix
  • Initials in prefix

Gender

  • How many “his”
  • How many “her”
slide-20
SLIDE 20

20

Basic Classification

Email Gender Uniformly outperform the baselines (CTRF, FGNL)

slide-21
SLIDE 21

21

Web User Profiling using Data Redundancy

  • Introduction
  • Traditional Way
  • Basic Idea
  • MagicFG
  • Experiments
  • Conclusion
slide-22
SLIDE 22

22

MagicFG

  • Markov Logic Factor Graph

Data Redundancy Logic Factors More Accurate Classification

slide-23
SLIDE 23

23

Why logic factors?

y1 y2 y3 y4 y5 e1, v e2, v e3, v e4, v e5, v f (y1, x1) f (y2, x2) f (y3, x3) f (y4, x4) f (y5, x5) g (y1, y2) g (y2, y4) g (y4, y5)

Prior Knowledge Complete Consistency Partial Consistency

ü Depict and utilize correlations between possible candidates from redundant data. ü Incorporate human knowledge to guide and amend the classification model.

slide-24
SLIDE 24

24

Logic Factors

  • Complete Consistency

Two same vertices must share the same label. psyu@cs.uic.edu psyu@cs.uic.edu True True OR psyu@cs.uic.edu psyu@cs.uic.edu False False

slide-25
SLIDE 25

25

Logic Factors

  • Partial Consistency

Two similar vertices probably share the same (preferred) label. e.g. Two Emails sharing the same prefix are probably both credible for the target user. psyu@cs.uic.edu psyu@uic.edu True True probably

slide-26
SLIDE 26

26

Logic Factors

  • Prior Knowledge

Some prior knowledge can be converted to logic factors. e.g. Some Email addresses are modified (blocked) for some reason, whose domains are still visible and credible. Emails with the same domain with a blocked one are probably valid. email@cs.uic.edu psyu@uic.edu Blocked True probably

slide-27
SLIDE 27

27

Markov Logic Factor Graph

  • Attribute factor function
  • Logic factor function
  • Log-likelihood function
  • Target parameter
slide-28
SLIDE 28

28

Markov Logic Factor Graph

  • Training: Gradient Ascent
  • Gradient:
  • Learning:
  • Classification:
slide-29
SLIDE 29

29

Web User Profiling using Data Redundancy

  • Introduction
  • Traditional Way
  • Basic Idea
  • MagicFG
  • Experiments
  • Conclusion
slide-30
SLIDE 30

30

Accuracy Performance

78 80 82 84 86 88 90 92 94 Precision Recall F1-score TCRF MagicFG 70 75 80 85 90 95 100 Precision Recall F1-score FGNL MagicFG

  • Comparison between MagicFG and state-of-the-art methods

for Email and Gender extraction Email Gender

slide-31
SLIDE 31

31

Accuracy Performance

86 87 88 89 90 91 92 93 94 Precision Recall F1-score

Basic Basic+CC Basic+CC+PC Basic+CC+PC+PK

90 90.5 91 91.5 92 92.5 93 93.5 94 94.5 Precision Recall F1-score Basic Basic+CC

Email Gender

Logic factors do help!

slide-32
SLIDE 32

32

Web User Profiling using Data Redundancy

  • Introduction
  • Traditional Way
  • Basic Idea
  • MagicFG
  • Experiments
  • Conclusion
slide-33
SLIDE 33

33

Conclusion

  • Motivation
  • To solve the problem of low recall and error

propagation in traditional two-step methods.

  • Basic Idea
  • Search engine as the data source.
  • MagicFG
  • Utilize correlations in redundant data.
  • Incorporate human knowledge
slide-34
SLIDE 34

34

Thank you!

Code & Data http://aminer.org/profiling