Text File Layout Inference Reid Phillips, Wingning Li, Craig - PowerPoint PPT Presentation

Text File Layout Inference Reid Phillips, Wingning Li, Craig Thompson University of Arkansas {rxp01, wingning, cwt}@uark.edu

Outline • Review – Layout inference problem and properties – Prototype • Assumptions • Flow of control • Oracles • Record length analysis • Content identification • Making a decision • Current work – Recent prototype enhancements – Results and analysis • Future work

Problem description Undefined Well-Defined Undefined Well-Defined Data Files Data Files Data Files Data Files ? ? Layout Engine ? ? Layout Engine ? ? Layout Engine ? ? Layout Engine Infer File Layout

File layout definition • Character encoding – EBCDIC – ASCII • Delimiters – Field – Record – Text quote (string literals) • File type – Delimited – Fixed – Hybrid • Record length • Record structure – Field type – Field location – Field length*

Input file types and properties Additional Record Field Character Text Quote Delimiter Delimiter Encoding aspects Current CR, LF, Comma, ASCII Single quote, scope CR-LF Pipe, Tab double quote EBCDIC File Types Record Record Field Field Length Delimiter Length Delimiter Delimited Variable Yes Variable Yes Fixed Fixed No Fixed No Hybrid Fixed Yes Fixed No

File layout definition • Character encoding – EBCDIC – ASCII • Delimiters – Field { Comma, Pipe, Tab } – Record { CR, LF, CR-LF } – Text quote { “, ‘ } • File type – Delimited – Fixed – Hybrid • Record length • Record structure – Field type – Field location – {Field length}

Prototype Assumptions • Structured file • U.S. customer data • Name, address, phone, and email content types. • Allowing errors and blank fields for a statistically small number of records • Known field delimiters and record delimiters as shown earlier • ASCII and EBCDIC encoding

Prototype process overview Key elements: 1. Sampling encoding file sampling 2. Encoding analysis analysis 3. File type analysis 4. File length analysis ascii internal encoding 5. Delimited field content id 6. Fixed field content id 7. Oracles and their applications file type analysis 8. XML layout delimiter delimited fixed length type Content content id analysis Oracle hybrid Data Layout fixed field (xml) content id set up oracles

Encoding and file type analysis • File encoding analysis is based on statistical difference on character encoding. For ASCII and EBCDIC, the most significant bit is used. • File type analysis is based on the assumed field delimiters and record delimiters and statistical measurement of their occurrences. • The prototype has performed very well for both analyses in real data and synthetic data testing. • Future work in this area could include allowing additional character encoding such as Latin-1, Unicode and additional delimiters.

Content types (oracles) Name prefix City First name State Full name Last name Zip code Name suffix Email Directional Phone number Street number Boolean Street name Address one Street suffix Post Office Box Address Line Unit Designator Address two Unit number Yes A string Oracle No

Record length analysis • For a fully fixed file, record length must be determined. Once the length is known, the file is treated as a hybrid file in field content identification. • Start with a initial length and try all possible length values until a well known content is lined up nicely according to the oracle of the content type. • The prototype has been tested using synthetic and real data successfully.* • Future work in this area could include examining multiple content types determining which provide the best evidence. *Two real files have been used and initially only one returned a record length. After adjusting an appropriate threshold both returned correct results.

Record length analysis • First step • Step N • Final step

Delimited field content identification • For delimited files, fields are between delimiters. Once the file analysis is done, all field locations are known. • For a given field in the sample, each content oracle is consulted by sending each string in the field to it and receiving a “yes” and “no” answer from it. The percentage of yes answers of each oracle is computed and used to identify the content type of the field.

Fixed field content identification • For hybrid files, records are between record delimiters and have a fixed length. Once the file analysis is done, record length is known. • To determine the length and starting position of each field, oracles in conjunction with combinatorial and statistical approach are used. • At this point it is possible for ambiguity among the fields and thus guesses, potential field positions (PFP), are generated.

Fixed field content identification (PFP analysis) First name? First name? First name? First name? First name?

Making a decision

Recent prototype enhancements • Web service – All functionality provided by the prototype can be accessed via a single function call. • buildLayout( fileSample: byte[] ): String – A web service endpoint was defined that invokes the preceding method • buildLayout( parameterData: byte[], fileSample: byte[] ): String – This definition assumes that the server running the engine might not have access to the data and thus the engine parameters and the file data sample must be passed to the endpoint as method parameters.

Recent prototype enhancements • Configuration file – User can set each threshold • Example: Setting the value indicating when fields line up during record length analysis – User can set each heuristic • Example: The size of the data sample. – User can set what delimiters to test for – User can specify what content types to identify by setting what oracles are loaded into the prototype

Recent prototype enhancements • Header record – Currently are able to test for existence – Future work would include using the information from the header record as extra evidence about the file • Assign a label to UNSPECIFIED fields • Correspond header information with the prototype’s results • Assumption: Only fully delimited files have header records. Can be easily extended to include fixed file types.

Recent prototype enhancements • Comparator program – Compares the XML output of the engine with an XML file representing the correct output – Lists correct and incorrect results separately – Provides simple statistics – Text dump to the console

Recent prototype enhancements • Cross reference analysis – Correlate related fields such as zipcode, city, and state • Vertical analysis – Examine the sample by columns rather than by record (horizontal) – Ex: Boolean v. Directional • Contextual analysis – Ex: Middle name and street number fields • All are performed after the decision methodologies in an attempt to improve the results with extra evidence.

Recent prototype enhancements • Decision functionality – Currently a work in progress – In order, previous decisions were based on the content types, the PFP, and corresponding statistical counts – New logic will reorder the logic to consider the PFP, the content types, and finally the corresponding statistical counts – New logic appears to be a more generic solution to the decision making process

Results for a synthetic fixed and hybrid data file • Actual record structure • Prototype results 1 full name 1 27 27 1 full name 1 27 27 2 address one 28 54 27 2 address one 28 54 27 3 address two 55 81 27 3 address two 55 81 27 4 city 82 99 18 4 city 82 99 18 5 state 100 101 2 5 state 100 102 3 6 zip code 103 107 5 6 zip code 102 107 6

Results for three real data files: File 3 (delimited) • Actual record structure • Prototype results 1 Seq # 74 street suffix 1 * UNSPECIFIED * 74 street suffix 2 filler 75-81 filler 2 * UNSPECIFIED * 75-81 * UNSPECIFIED * 3 name 82 city 3 name 82 city 4-5 filler 83 state 4-5 * UNSPECIFIED * 83 state 6 address line one 84 zip code 6 address line one 84 zip code 7 filler 85-87 filler 7 * UNSPECIFIED * 85-87 * UNSPECIFIED * 8 city 88 boolean 8 city 88 directional 9 state 89 Seq # 9 state 89 street number 10 zip code 10 zip code 11 zip plus four 11 zip plus four 12-14 filler 12-14 * UNSPECIFIED * 15 name 15 name 16-28 filler 16-28 * UNSPECIFIED * 29 first name 29 first name 30 middle name 30 middle name 29 last name 29 last name 32-56 filler 32-56 * UNSPECIFIED * 57 address line one 57 address line one 58-67 filler 58-67 * UNSPECIFIED * 68 directional 68 directional 69 street number 69 street number 70-71 filler 70-71 * UNSPECIFIED * 72 street name 72 street name 73 filler 73 * UNSPECIFIED *

Text File Layout Inference Reid Phillips, Wingning Li, Craig - PowerPoint PPT Presentation

Text File Layout Inference Reid Phillips, Wingning Li, Craig Thompson University of Arkansas {rxp01, wingning, cwt}@uark.edu Outline Review Layout inference problem and properties Prototype Assumptions Flow of control

Layout design I. Chapter 6 Basic layout types Systematic layout planning procedure Computerized

Layout Dynamic layout Layout design pattern Layout strategies 2 Dynamic Layout Applications

Layout Dynamic layout Layout design pattern Layout strategies 2 Dynamic Layout Applications

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Layouts Dynamic layout Swing and Layout Managers Layout strategies 1 CS 349 - Layouts 2 CS

Text layout with Core Text Jjgod Jiang <gzjjgod@gmail.com> What is text layout? Convert a

File Management What is a file? Elements of file management File organization

Layout design III. Chapter 6 Layout generation MCRAFT BLOCPLAN LOGIC Methods for layout

CS/EE 6710 Introduction to Layout Inverter Layout Example Layout Design Rules Composite Layout

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

File Systems Main Points File layout Directory layout

CIS 218 File Utilities and Filters Text / File Commands File Manipulation cat displays

Layout design II. Chapter 6 Layout generation Pairwise exchange method Graph-based method

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

CPSC 410/611: File Management What is a file? Elements of file management File

LightWatcher Datenrekorder Luzian Wolf Wolf Technologieberatung (Object-Tracker) Wien,

United States Court of Appeals for the Federal Circuit 2007-1198, -1348 OLE K. NILSSEN and GEO

United Nations Environment Programme en.lighten initiative 1 WEBINAR AGENDA Part I.

The Global Lighting Industry Tradition, Transition, Transformation Christian Schraft, OSRAM AG

Motion Capture Muhammad Bilal K I 180707026 Kanad Niyogi - 180707036 Contents Introduction

Thats So Fetch Team B4: Dan Barychev, Luca Amblard, Hana Frluckaj Use Case Want a puppy?

Chris Black Corporate Communications Manager Virtual studios around the world Norwegian Media

MiningSuite http://bit.ly/miningsuite Olivier Lartillot MiningSuite Complete redesign,

Text File Layout Inference Reid Phillips, Wingning Li, Craig - PowerPoint PPT Presentation

Text File Layout Inference Reid Phillips, Wingning Li, Craig Thompson University of Arkansas {rxp01, wingning, cwt}@uark.edu Outline Review Layout inference problem and properties Prototype Assumptions Flow of control

Layout design I. Chapter 6 Basic layout types Systematic layout planning procedure Computerized

Layout Dynamic layout Layout design pattern Layout strategies 2 Dynamic Layout Applications

Layout Dynamic layout Layout design pattern Layout strategies 2 Dynamic Layout Applications

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Layouts Dynamic layout Swing and Layout Managers Layout strategies 1 CS 349 - Layouts 2 CS

Text layout with Core Text Jjgod Jiang &lt;gzjjgod@gmail.com&gt; What is text layout? Convert a

File Management What is a file? Elements of file management File organization

Layout design III. Chapter 6 Layout generation MCRAFT BLOCPLAN LOGIC Methods for layout

CS/EE 6710 Introduction to Layout Inverter Layout Example Layout Design Rules Composite Layout

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Click on M odel File for CAD Click on M odel File for CAD Click on Model File for CAD Click

File Systems Main Points File layout Directory layout

CIS 218 File Utilities and Filters Text / File Commands File Manipulation cat displays

Layout design II. Chapter 6 Layout generation Pairwise exchange method Graph-based method

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

CPSC 410/611: File Management What is a file? Elements of file management File

LightWatcher Datenrekorder Luzian Wolf Wolf Technologieberatung (Object-Tracker) Wien,

United States Court of Appeals for the Federal Circuit 2007-1198, -1348 OLE K. NILSSEN and GEO

United Nations Environment Programme en.lighten initiative 1 WEBINAR AGENDA Part I.

The Global Lighting Industry Tradition, Transition, Transformation Christian Schraft, OSRAM AG

Motion Capture Muhammad Bilal K I 180707026 Kanad Niyogi - 180707036 Contents Introduction

Thats So Fetch Team B4: Dan Barychev, Luca Amblard, Hana Frluckaj Use Case Want a puppy?

Chris Black Corporate Communications Manager Virtual studios around the world Norwegian Media

MiningSuite http://bit.ly/miningsuite Olivier Lartillot MiningSuite Complete redesign,

Text layout with Core Text Jjgod Jiang <gzjjgod@gmail.com> What is text layout? Convert a