Network PDF file unstructured information - - PowerPoint PPT Presentation

network pdf file unstructured information
SMART_READER_LITE
LIVE PREVIEW

Network PDF file unstructured information - - PowerPoint PPT Presentation

Network PDF file unstructured information Pengchanghuan,Sunwei,Fuxiaohan Shanghai Jiaotong University 598095762@qq.com extraction What is the most important thing in our modern society? PPT www.1ppt.com/moban/ PPT


slide-1
SLIDE 1

Network PDF file unstructured information extraction

Pengchanghuan,Sunwei,Fuxiaohan Shanghai Jiaotong University 598095762@qq.com

slide-2
SLIDE 2 PPT模板下载:www.1ppt.com/moban/ 行业PPT模板:www.1ppt.com/hangye/ 节日PPT模板:www.1ppt.com/jieri/ PPT素材下载:www.1ppt.com/sucai/ PPT背景图片:www.1ppt.com/beijing/ PPT图表下载:www.1ppt.com/tubiao/ 优秀PPT下载:www.1ppt.com/xiazai/ PPT教程: www.1ppt.com/powerpoint/ Word教程: www.1ppt.com/word/ Excel教程:www.1ppt.com/excel/ 资料下载:www.1ppt.com/ziliao/ PPT课件下载:www.1ppt.com/kejian/ 范文下载:www.1ppt.com/fanwen/ 试卷下载:www.1ppt.com/shiti/ 教案下载:www.1ppt.com/jiaoan/ PPT论坛:www.1ppt.cn

DATA

What

is the most important thing in our modern society? Faster data = Faster cognition = Faster results

slide-3
SLIDE 3

THE BUSENESS PLAN Smar t P DF Financial Repo r t Data Search Engine

What did we do? What can it do?

slide-4
SLIDE 4

1.Introduction 2.Our process 3.Evaluation 4.Discussion

Catalog

2.1PDF2CSV 2.2Preliminary extraction 2.3 Autodetect module 2.4Visualization 2.5 Convert to CSV file 3.1Table search 3.2Table extraction

5.Conclusion

slide-5
SLIDE 5

1.Introduction

Significance:Once successfully acquisition of economic data for most companies, quickly access to market information, the company can make correct decision and successfully occupy a favorable position in the market. Problems : PDF(Portable Document Format), is a form

  • f

documentation designed to prevent modifications (in fact, it is difficult to modify them). Results:for our test of 32 PDF reports it has a 93% recall rate

1.Introduction

slide-6
SLIDE 6

2.Our process

Tabula can help you extract data tables from PDF files and save them in CSV format so that you can easily access data and use it for the second time.It is an

  • pen source free program

Python library xlwt can make our python program be able to handle excel form After you enter the correct parameter to these tools,you can transform the entire PDF file into a CSV file. In the CSV file,text and data are separated by commas and line breaks.

2.1PDF2CSV

slide-7
SLIDE 7

2.Our process

2.1PDF2CSV

slide-8
SLIDE 8

2.Our process

We use the table header specific known as a template Extract the corresponding part of the data Store in our original database.Collect the training data we need for fuzzy matching. Features:text content, numeric type, numeric size, literal numeric distribution, text density, and so on. We take 27 known headers of 5 different reports as our initial samples.

2.2Preliminary extraction

slide-9
SLIDE 9

2.2Preliminary extraction 2.Our process

slide-10
SLIDE 10

We find that the appearance of tables is related to the features we mentioned in the last part. We match the model trained by the existing data in the database with the contents in the CSV file We compare the feature sets between them.Then select the higher matching part as a table. In addition,We also take the line distribution in the graph of the PDF file as the basis for the table to appear.

2.3 Autodetect module 2.Our process

slide-11
SLIDE 11

2.Our process 2.4 Visualization 2.5 Convert to CSV file

slide-12
SLIDE 12

In the test, we used 32 PDF file.We have a recall rate about 93%,such a high recall rate indicates that

  • ur

program is of practical value.

Result Table search Table extraction

3.Evaluation

slide-13
SLIDE 13

3.Evaluation 3.1Table search

slide-14
SLIDE 14

3.Evaluation 3.2Table extraction

slide-15
SLIDE 15

3.Evaluation 3.2Table extraction

slide-16
SLIDE 16

4.Discussion

In our project we can’t extract the title of the table,because it can exist in all directions of the table.So the next thing to do is to look for the title of the table intelligently.

5.Conclusion

With our own intelligent analysis extraction program, we have a very high recognition extraction rate (for our test of 32 report draws have 93% recall rate).

4&5 Discussion&Conclusion

slide-17
SLIDE 17

Thank you!

Network PDF file unstructured information extraction

  • -Piracle present