Can We Predict Scientific Impact with Social Media? A Comparison - - PowerPoint PPT Presentation

can we predict scientific impact with social media
SMART_READER_LITE
LIVE PREVIEW

Can We Predict Scientific Impact with Social Media? A Comparison - - PowerPoint PPT Presentation

Can We Predict Scientific Impact with Social Media? A Comparison with Traditional Metrics of Scientific Impact Denis Helic Knowledge Technologies Institute, Graz University of Technology December 04, 2014 Helic (KTI) Scientific Impact


slide-1
SLIDE 1

Can We Predict Scientific Impact with Social Media?

A Comparison with Traditional Metrics of Scientific Impact Denis Helic

Knowledge Technologies Institute, Graz University of Technology

December 04, 2014

Helic (KTI) Scientific Impact December 04, 2014 1 / 52

slide-2
SLIDE 2

Question 1

(How) Will Social Media change scientific processes and/or influence scientific impact?

Helic (KTI) Scientific Impact December 04, 2014 2 / 52

slide-3
SLIDE 3

It is already happening...

Qualitatively, we observe the following change Growing numbers of scholars discuss and share the research literature

  • n Twitter, Facebook, etc.

They organize articles in social reference managers like Mendeley Review it in blogs, on reddit, etc. The daily research work is moving online and is being put into the spotlight

Helic (KTI) Scientific Impact December 04, 2014 3 / 52

slide-4
SLIDE 4

Spotlight

Traditionally, the spotlight was always almost exclusively on citations It is easy to quantify the scientific impact from citations, citations networks, etc. The citation count and derivatives such as h-index, PageRank, etc. Often criticized because it can not measure the invisible Discussion with colleagues, hallway talk, conference talks, and similar

Helic (KTI) Scientific Impact December 04, 2014 4 / 52

slide-5
SLIDE 5

Question 2

Can we quantify the influence of Social Media on scientific processes?

Helic (KTI) Scientific Impact December 04, 2014 5 / 52

slide-6
SLIDE 6

Example 1: Information Retrieval

Can Social Media improve information retrieval? Allow scientists to access relevant articles more efficiently Traditionally, digital libraries will have subject catalogs, faceted navigation, or keyword search In a study with Mendeley tagging system we analyzed (hierarchical) navigational structures extracted from author keywords and readership tags

Helic (KTI) Scientific Impact December 04, 2014 6 / 52

slide-7
SLIDE 7

Example 1: Information Retrieval

0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 2 3 4 5 6 7 8 9 10 s, τ Shortest Path Greedy Navigator (1000000 Runs) l

  • =4.020685, h
  • =10.423890, sg=0.998249, τg=2.592566

Success Rate (s) Stretch (τ)

(a) OK, m = 20

0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 2 3 4 5 6 7 8 9 10 11 12 s, τ Shortest Path Greedy Navigator (1000000 Runs) l

  • =4.062013, h
  • =8.340154, sg=0.998127, τg=2.053207

Success Rate (s) Stretch (τ)

(b) OT, m = 20

Figure : Although the success rates remain excellent over all datasets, stretch increases slightly in keyword datasets. This results in path lengths that are on average longer by 1 or 2 in keyword networks.

Helic (KTI) Scientific Impact December 04, 2014 7 / 52

slide-8
SLIDE 8

Example 1: Information Retrieval

F B N T Tg F B N T Tg E Keywords Tags F Tg N B T r2 r1 r3 r4 r5 r6 F Tg T N B E r2 r1 r3 r4 r5 r6 Bipartite

Figure : Keywords (left) and tags (right) with metadata “folksonomy” (F), “tagging” (Tg), “tags” (T), “navigation” (N), “browsing” (B), and “entropy” (E). Tag hierarchies are richer in structure than keyword hierarchies. Structurally richer hierarchies are more stable and robust to the negative effects

  • f the user interface constraints.

Helic (KTI) Scientific Impact December 04, 2014 8 / 52

slide-9
SLIDE 9

Example 1: Information Retrieval

Folksonomies and keyword hierarchies exhibit comparable quantitative properties We find interesting qualitative differences with regard to navigation Folksonomies create more efficient navigational structures They enable users to find target resources with fewer hops Reason: greater overlap between tags provides better options for users to switch between different parts of the network

Helic (KTI) Scientific Impact December 04, 2014 9 / 52

slide-10
SLIDE 10

Example 2: Citation Latency

How early availability for accessing an article influences the citation latency? Citation latency: the time that it takes from the moment an article is accepted for publication until it is cited in other (published) articles Depending on the community, the process, the accessibility of the journal this may range anywhere from 3 months to 1-2 years Is the latency reduced by e.g. pre-print platforms http://arxiv.org/ at Cornell Paper Tim Brody, Stevan Harnad, and Leslie Carr. 2006. Earlier Web usage statistics as predictors of later citation impact: Research Articles.

Helic (KTI) Scientific Impact December 04, 2014 10 / 52

slide-11
SLIDE 11

Example 2: Citation Latency

Figure : Changing distribution of latencies, e.g. for older articles the latency was

  • approx. 12 months or more. Recently, latency decreased to seemingly nothing.

Helic (KTI) Scientific Impact December 04, 2014 11 / 52

slide-12
SLIDE 12

Example 2: Citation Latency

The latency between an article being uploaded and later cited has reduced From a peak at 12 months to no or small delay at all to the peak rate

  • f citations

This can be biased because of the possibility to revise the paper However, it indicates that the authors are increasingly citing very recent work that has yet to be published Even new questions for the peer-review process?

Helic (KTI) Scientific Impact December 04, 2014 12 / 52

slide-13
SLIDE 13

Example 3: Download vs. citation vs. readership

How downloads or an article compare to the number of citations that article obtains How readership data compares to the number of citations Readership data is e.g. a number of mentions in Mendeley user libraries How downloads and readership compare A study with Mendeley and Know-Center Paper Schl¨

  • gl et al., Download vs. citation vs. readership data: the case of an

information systems journal

Helic (KTI) Scientific Impact December 04, 2014 13 / 52

slide-14
SLIDE 14

Example 3: Download vs. citation vs. readership

eaders vs. cites, scattergram (publicat

50 100 150 200 250 300

cites downloads

downloads vs. cites

Figure : Spearman correlation r=0.77

Helic (KTI) Scientific Impact December 04, 2014 14 / 52

slide-15
SLIDE 15

Example 3: Download vs. citation vs. readership

ication year: 2002-2011, doc type:

50 100 150 200 250 300 50 100 150

cites readership

readerhip vs. cites

Figure : Spearman correlation r=0.51

Helic (KTI) Scientific Impact December 04, 2014 15 / 52

slide-16
SLIDE 16

Example 3: Download vs. citation vs. readership

Figure 1. Downloads vs. reader

20 40 60 80 100 120

readership downloads

downloads vs. readers

Figure : Spearman correlation r=0.73

Helic (KTI) Scientific Impact December 04, 2014 16 / 52

slide-17
SLIDE 17

Example 3: Download vs. citation vs. readership

The results are in line with several other similar studies Correlations do however change depending on the source of citations Also depending on the journal or conference, scientific field, etc. Strongly time dependent Somewhat smaller correlation between readership and citations Mendeley a new young system? Mendeley user population?

Helic (KTI) Scientific Impact December 04, 2014 17 / 52

slide-18
SLIDE 18

Question 3

How should we quantify the influence of Social Media on scientific processes?

Helic (KTI) Scientific Impact December 04, 2014 18 / 52

slide-19
SLIDE 19

Methodology

In all examples we measured a different thing and applied a different methodology Example 1: algorithmic approach to information retrieval Example 2: distribution of citation latency Example 3: non-parametric statistics with rank correlations Should we also apply other methods?

Helic (KTI) Scientific Impact December 04, 2014 19 / 52

slide-20
SLIDE 20

Time dependence

Traditional as well as new metrics are strongly time dependent E.g. citation delay, time of the peak, etc. Downloads are strongly time dependent as a different function of time Social Media is even more sensitive to time and shorter time spans

Helic (KTI) Scientific Impact December 04, 2014 20 / 52

slide-21
SLIDE 21

Example 4: Response dynamics

Now, we can include Social Media in the loop Ask questions such as what is the download latency for pre-prints How does Twitter influence the download latency? How does Twitter influence the citation count? A study with http://arxiv.org/ Paper Shuai et al., How the Scientific Community Reacts to Newly Submitted Preprints: Article Downloads, Twitter Mentions, and Citations

Helic (KTI) Scientific Impact December 04, 2014 21 / 52

slide-22
SLIDE 22

Example 4: Response dynamics

Figure : Twitter mentions spike shortly after submission and wane quickly, whereas downloads peak shortly afterwards but continue to exhibit significant activity many weeks later.

Helic (KTI) Scientific Impact December 04, 2014 22 / 52

slide-23
SLIDE 23

Example 4: Response dynamics

Thus, we need an even more sophisticated methodology than simple correlation measurements Counting twitter mentions, downloads, and citations at different times can lead to varying correlations Time series analysis Multivariate regression methods, etc. Methodologically, a very interesting field!

Helic (KTI) Scientific Impact December 04, 2014 23 / 52

slide-24
SLIDE 24

Example 4: Response dynamics

Figure : Pearson correlation R for 70 most mentioned articles

Helic (KTI) Scientific Impact December 04, 2014 24 / 52

slide-25
SLIDE 25

Example 4: Response dynamics

Figure : Pearson correlation R for 70 most mentioned articles

Helic (KTI) Scientific Impact December 04, 2014 25 / 52

slide-26
SLIDE 26

Example 4: Response dynamics

Figure : Pearson correlation R for 70 most mentioned articles

Helic (KTI) Scientific Impact December 04, 2014 26 / 52

slide-27
SLIDE 27

Example 4: Response dynamics

The results are highly suggestive of a strong tie between social media interest, article downloads and even early citations There are two different temporal patterns of activity The volume of twitter mentions is statistically correlated with that of both downloads and early citations Two possible explanations: through exposition to twitter download and citation behavior is affected Second: intrinsic quality

Helic (KTI) Scientific Impact December 04, 2014 27 / 52

slide-28
SLIDE 28

Interpretation

Causality vs. correlation A correlation between x and y may occur because:

1

x influences y

2

y influences x

3

the influence is in both direction

4

a third variable influences both x and y

We need more analysis and interpretation

Helic (KTI) Scientific Impact December 04, 2014 28 / 52

slide-29
SLIDE 29

Interpretation and interdisciplinary approach

Computer scientists are good at calculating things But we lack knowledge in user behavior We lack knowledge in community practices Only interdisciplinary teams can interpret the results in a satisfactory manner

Helic (KTI) Scientific Impact December 04, 2014 29 / 52

slide-30
SLIDE 30

Question 4

Can we move beyond quantification to modeling, predicting and understanding?

Helic (KTI) Scientific Impact December 04, 2014 30 / 52

slide-31
SLIDE 31

Hypotheses

After observing, measuring and quantifying Can we formulate hypotheses which can be tested? Can such hypotheses explain the phenomena that we observe? Can we use these models to predict new phenomena, e.g. the scientific impact of an article?

Helic (KTI) Scientific Impact December 04, 2014 31 / 52

slide-32
SLIDE 32

Example 5: Long-term predictability

Is there a long-term predictability in citation patterns? Are there universal laws governing citation process across the fields, authors, and journals What are the parameters of such a universal model? How does the model capture phenomena such as Social Media? Paper Wang et al., Quantifying Long-Term Scientific Impact

Helic (KTI) Scientific Impact December 04, 2014 32 / 52

slide-33
SLIDE 33

Example 5: Long-term predictability

Extremely difficult because of impact heterogeneity E.g. power laws in the impact, citation, download, or twitter mentions distributions

i- y

  • pub-

) ions

  • t

A

Figure : Yearly citations of randomly selected articles from the Physical Review

Helic (KTI) Scientific Impact December 04, 2014 33 / 52

slide-34
SLIDE 34

Example 5: Long-term predictability

Three basic mechanisms that drive the citation history of individual papers:

1

Preferential attachment (rich-get-richer phenomenon)

2

New ideas are integrated in subsequent work: immediacy governs the time to citation peak and longevity captures the decay rate

3

Fitness captures the intrinsic quality of a paper

Novelty and fitness depend on the community response to the work, i.e. they capture also any influence coming from e.g. Social Media

Helic (KTI) Scientific Impact December 04, 2014 34 / 52

slide-35
SLIDE 35

Backtracking 1

But let us first take a step back...

Helic (KTI) Scientific Impact December 04, 2014 35 / 52

slide-36
SLIDE 36

Quality vs popularity

What would happen if Justin Bieber published a scientific paper? We have to be more critical about what we measure Is it the popularity? Is it the immediacy? Is it recency? Is it quality?

Helic (KTI) Scientific Impact December 04, 2014 36 / 52

slide-37
SLIDE 37

Backtracking 2

But let us first take another step back...

Helic (KTI) Scientific Impact December 04, 2014 37 / 52

slide-38
SLIDE 38

Quality

What is quality in science? What is a good scientific paper? My argument: only papers that have been recognized by your peers! Regardless of the time E.g. the paper on the Web by Tim-Berners Lee has been rejected at HT1993 But, eventually he got recognized (e.g. keynote next year)

Helic (KTI) Scientific Impact December 04, 2014 38 / 52

slide-39
SLIDE 39

Quality

How great scientists think about the quality?

Helic (KTI) Scientific Impact December 04, 2014 39 / 52

slide-40
SLIDE 40

Donald Knuth

I have been a happy man ever since January 1, 1990, when I no longer had an email address. I’d used email since about 1975, and it seems to me that 15 years of email is plenty for one lifetime. Email is a wonderful thing for people whose role in life is to be on top of

  • things. But not for me; my role is to be on the bottom of things. What I

do takes long hours of studying and uninterruptible concentration. I try to learn certain areas of computer science exhaustively; then I try to digest that knowledge into a form that is accessible to people who don’t have time for such study.

Helic (KTI) Scientific Impact December 04, 2014 40 / 52

slide-41
SLIDE 41

Donald Knuth

On the other hand, I need to communicate with thousands of people all

  • ver the world as I write my books. I also want to be responsive to the

people who read those books and have questions or comments. My goal is to do this communication efficiently, in batch mode – like, one day every three months. So if you want to write to me about any topic, please use good ol’ snail mail and send a letter to the following address: . . . http://www-cs-faculty.stanford.edu/~uno/email.html

Helic (KTI) Scientific Impact December 04, 2014 41 / 52

slide-42
SLIDE 42

Micheal Jordan

I like to use the analogy of building bridges. If I have no principles, and I build thousands of bridges without any actual science, lots of them will fall down, and great disasters will occur. Similarly here, if people use data and inferences they can make with the data without any concern about error bars, about heterogeneity, about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re an engineer and a statistician—then you will make lots of predictions, and there’s a good chance that you will

  • ccasionally solve some real interesting problems. But you will occasionally

have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.

Helic (KTI) Scientific Impact December 04, 2014 42 / 52

slide-43
SLIDE 43

Micheal Jordan

And so that’s where we are currently. A lot of people are building things hoping that they work, and sometimes they will. And in some sense, there’s nothing wrong with that; it’s exploratory. But society as a whole can’t tolerate that; we can’t just hope that these things work. Eventually, we have to give real guarantees. Civil engineers eventually learned to build bridges that were guaranteed to stand up. So with big data, it will take decades, I suspect, to get a real engineering approach, so that you can say with some assurance that you are giving out reasonable answers and are quantifying the likelihood of errors.

Helic (KTI) Scientific Impact December 04, 2014 43 / 52

slide-44
SLIDE 44

Micheal Jordan

http://www-cs-faculty.stanford.edu/~uno/email.html Machine-Learning Maestro Michael Jordan on the Delusions of Big Data and Other Huge Engineering Efforts IEEE Spectrum Beware: media manipulation Answer from the Maestro: Big Data, Hype, the Media and Other Provocative Words to Put in a Title https://amplab.cs.berkeley.edu/2014/10/22/ big-data-hype-the-media-and-other-provocative-words-to-put-

Helic (KTI) Scientific Impact December 04, 2014 44 / 52

slide-45
SLIDE 45

Micheal Jordan

But most of all, from my point of view, it’s a major engineering and mathematical challenge, one that will not be solved by just gluing together a few existing ideas from statistics, optimization, databases and computer systems.

Helic (KTI) Scientific Impact December 04, 2014 45 / 52

slide-46
SLIDE 46

Quality

The quality requires a lot of effort...

Helic (KTI) Scientific Impact December 04, 2014 46 / 52

slide-47
SLIDE 47

Richard Feynman

https://www.youtube.com/watch?v=EYPapE-3FRw

Helic (KTI) Scientific Impact December 04, 2014 47 / 52

slide-48
SLIDE 48

Quality

Acheiving quality is difficult...

Helic (KTI) Scientific Impact December 04, 2014 48 / 52

slide-49
SLIDE 49

Example 5: Long-term predictability

Analytic solution of the model shows that the shape of citation distribution depends on immediacy, longevity and fitness But the long-term asymptotic behavior depends only on fitness After long time citation distributions of all papers with the same fitness converge regardless of their immediacy and longevity In other words, only the intrinsic paper quality as perceived in a particular community matters in the long run Empirical analysis also confirms these theoretical results

Helic (KTI) Scientific Impact December 04, 2014 49 / 52

slide-50
SLIDE 50

Example 5: Long-term predictability

B E C F

Year 2 Year 4 Year 10 Year 20

Figure : Convergence of citation distributions

Helic (KTI) Scientific Impact December 04, 2014 50 / 52

slide-51
SLIDE 51

Summary

Social Media influences the scientific process We can quantify that impact in various ways We need sound methodologies for the quantification We need interdisciplinary research for interpretation of the results Recent results indicate that the intrinsic quality of a paper is the only indicator of its long-term impact Short term impact can be influenced by Social Media

Helic (KTI) Scientific Impact December 04, 2014 51 / 52

slide-52
SLIDE 52

Thank You!

Questions?

Helic (KTI) Scientific Impact December 04, 2014 52 / 52