Scaling NewSum Big data text Clustering and https://www.scify.org - - PowerPoint PPT Presentation

scaling newsum
SMART_READER_LITE
LIVE PREVIEW

Scaling NewSum Big data text Clustering and https://www.scify.org - - PowerPoint PPT Presentation

Scaling NewSum Big data text Clustering and https://www.scify.org Summarization using N-Gram graphs Alexandros Tzoumas | a.tzoumas@scify.org WWW.FED4FIRE.EU Whats our product about? Scaling NewSum | SciFY.org 3 WWW.FED4FIRE.EU 4


slide-1
SLIDE 1

WWW.FED4FIRE.EU

Scaling NewSum

Big data text Clustering and Summarization using N-Gram graphs

https://www.scify.org

Alexandros Tzoumas | a.tzoumas@scify.org

slide-2
SLIDE 2

What’s our product about?

Scaling NewSum | SciFY.org

slide-3
SLIDE 3

WWW.FED4FIRE.EU 3

slide-4
SLIDE 4

WWW.FED4FIRE.EU 4

slide-5
SLIDE 5

WWW.FED4FIRE.EU 5

slide-6
SLIDE 6

Business Goals

Scaling NewSum | SciFY.org

slide-7
SLIDE 7

WWW.FED4FIRE.EU

Business goals From a technical perspective

7

Goals

  • improve the quality of the

solutions our product offers

  • allow NewSum technology to

expand to new domains/markets

Measure and evaluate:

  • the accuracy of candidate

clustering components,

  • the effectiveness (summary

quality) of alternative summarization components

  • the overall scalability of the system
slide-8
SLIDE 8

WWW.FED4FIRE.EU

Business challenges From a technical perspective

8

Challenges

  • Expansion to new markets should take

domain specific characteristics into account as system parameters

  • A product manager is not able to configure

the product-related settings appropriate for each domain, so a semi-supervised process would be invaluable

  • Define a process for evaluating different

clustering and summarization components

  • Scale the algorithms to process thousand of

sources/articles

slide-9
SLIDE 9

The Experiments

Scaling NewSum | SciFY.org

slide-10
SLIDE 10

WWW.FED4FIRE.EU

Tengu testbed with the support of IMEC Cassandra - Hadoop - Spark

10

Setup

slide-11
SLIDE 11

WWW.FED4FIRE.EU

Experiment set 1 Results

11

Experiments

Goal: Measure effectiveness of NewSum’s candidate clustering implementations Related datasets: Multiling (articles with clustering information) 6GB database of news articles Methodology: Run clustering on 2 different clustering implementations and measure recall and precision. Automatic evaluation for MultiLing dataset Manual process for news articles dataset

Selected the algorithm with higher precision & recall

slide-12
SLIDE 12

WWW.FED4FIRE.EU

Experiment set 2 Results

12

Experiments

Goal: Measure scalability Related dataset: 6GB database of news articles Methodology: Run the clustering pipeline using as input a) the algorithm from experiment set 1 b) a variable number of articles. Measure speed

Increased 5 times the speed of the clustering pipeline! Identified areas of improvement

slide-13
SLIDE 13

WWW.FED4FIRE.EU

Experiment set 3 Results

13

Experiments

Goal: Measure effectiveness of NewSum’s candidate summarization implementations Related datasets: 6GB database of news articles Methodology: Run the summarization pipeline using as input a) configuration/parameter setting b) a number of clusters to be summarized. Recall and precision were measured through a manual process. Results: Implemented/Identified the process for selecting the algorithm appropriate for each scenario

Implemented/Identified the process for selecting the algorithm appropriate for each scenario

slide-14
SLIDE 14

Conclusions

Scaling NewSum | SciFY.org

slide-15
SLIDE 15

WWW.FED4FIRE.EU

15

What we achieved

  • Defined a process for evaluating clustering algorithms
  • Defined a process for evaluating summarization components
  • Increased 5 times the speed of the clustering pipeline!
  • Measured scalability and identified bottlenecks
slide-16
SLIDE 16

WWW.FED4FIRE.EU 16

How Fed4Fire+ helped us

Patron’s support was crucial to the success of the experiments

slide-17
SLIDE 17

WWW.FED4FIRE.EU 17

How Fed4Fire+ helped us

Provided a quick way to start experimenting with big data without having to worry about the underlying technologies

slide-18
SLIDE 18

WWW.FED4FIRE.EU 18

How Fed4Fire+ helped us

Funding allowed us allocate time to implement the algorithms and analyze next steps

slide-19
SLIDE 19

Next Steps

Scaling NewSum | SciFY.org

slide-20
SLIDE 20

WWW.FED4FIRE.EU

Continue working on algorithm implementations Distributed N-gram graphs Improve clustering speed using blocking methodology Automate the set up of a pipeline in a cloud environment to be used in production. Release a domain specific product related to Blockchain news.

20

Next steps

slide-21
SLIDE 21

WWW.FED4FIRE.EU 21

slide-22
SLIDE 22

This project has received funding from the European Union’s Horizon 2020 research and innovation programme, which is co-funded by the European Commission and the Swiss State Secretariat for Education, Research and Innovation, under grant agreement No 732638.

WWW.FED4FIRE.EU

www.scify.org