text file layout inference
play

Text File Layout Inference Reid Phillips, Wingning Li, Craig - PowerPoint PPT Presentation

Text File Layout Inference Reid Phillips, Wingning Li, Craig Thompson University of Arkansas {rxp01, wingning, cwt}@uark.edu Outline Review Layout inference problem and properties Prototype Assumptions Flow of control


  1. Text File Layout Inference Reid Phillips, Wingning Li, Craig Thompson University of Arkansas {rxp01, wingning, cwt}@uark.edu

  2. Outline • Review – Layout inference problem and properties – Prototype • Assumptions • Flow of control • Oracles • Record length analysis • Content identification • Making a decision • Current work – Recent prototype enhancements – Results and analysis • Future work

  3. Problem description Undefined Well-Defined Undefined Well-Defined Data Files Data Files Data Files Data Files ? ? Layout Engine ? ? Layout Engine ? ? Layout Engine ? ? Layout Engine Infer File Layout

  4. File layout definition • Character encoding – EBCDIC – ASCII • Delimiters – Field – Record – Text quote (string literals) • File type – Delimited – Fixed – Hybrid • Record length • Record structure – Field type – Field location – Field length*

  5. Input file types and properties Additional Record Field Character Text Quote Delimiter Delimiter Encoding aspects Current CR, LF, Comma, ASCII Single quote, scope CR-LF Pipe, Tab double quote EBCDIC File Types Record Record Field Field Length Delimiter Length Delimiter Delimited Variable Yes Variable Yes Fixed Fixed No Fixed No Hybrid Fixed Yes Fixed No

  6. File layout definition • Character encoding – EBCDIC – ASCII • Delimiters – Field { Comma, Pipe, Tab } – Record { CR, LF, CR-LF } – Text quote { “, ‘ } • File type – Delimited – Fixed – Hybrid • Record length • Record structure – Field type – Field location – {Field length}

  7. Prototype Assumptions • Structured file • U.S. customer data • Name, address, phone, and email content types. • Allowing errors and blank fields for a statistically small number of records • Known field delimiters and record delimiters as shown earlier • ASCII and EBCDIC encoding

  8. Prototype process overview Key elements: 1. Sampling encoding file sampling 2. Encoding analysis analysis 3. File type analysis 4. File length analysis ascii internal encoding 5. Delimited field content id 6. Fixed field content id 7. Oracles and their applications file type analysis 8. XML layout delimiter delimited fixed length type Content content id analysis Oracle hybrid Data Layout fixed field (xml) content id set up oracles

  9. Encoding and file type analysis • File encoding analysis is based on statistical difference on character encoding. For ASCII and EBCDIC, the most significant bit is used. • File type analysis is based on the assumed field delimiters and record delimiters and statistical measurement of their occurrences. • The prototype has performed very well for both analyses in real data and synthetic data testing. • Future work in this area could include allowing additional character encoding such as Latin-1, Unicode and additional delimiters.

  10. Content types (oracles) Name prefix City First name State Full name Last name Zip code Name suffix Email Directional Phone number Street number Boolean Street name Address one Street suffix Post Office Box Address Line Unit Designator Address two Unit number Yes A string Oracle No

  11. Record length analysis • For a fully fixed file, record length must be determined. Once the length is known, the file is treated as a hybrid file in field content identification. • Start with a initial length and try all possible length values until a well known content is lined up nicely according to the oracle of the content type. • The prototype has been tested using synthetic and real data successfully.* • Future work in this area could include examining multiple content types determining which provide the best evidence. *Two real files have been used and initially only one returned a record length. After adjusting an appropriate threshold both returned correct results.

  12. Record length analysis • First step • Step N • Final step

  13. Delimited field content identification • For delimited files, fields are between delimiters. Once the file analysis is done, all field locations are known. • For a given field in the sample, each content oracle is consulted by sending each string in the field to it and receiving a “yes” and “no” answer from it. The percentage of yes answers of each oracle is computed and used to identify the content type of the field.

  14. Fixed field content identification • For hybrid files, records are between record delimiters and have a fixed length. Once the file analysis is done, record length is known. • To determine the length and starting position of each field, oracles in conjunction with combinatorial and statistical approach are used. • At this point it is possible for ambiguity among the fields and thus guesses, potential field positions (PFP), are generated.

  15. Fixed field content identification (PFP analysis) First name? First name? First name? First name? First name?

  16. Making a decision

  17. Recent prototype enhancements • Web service – All functionality provided by the prototype can be accessed via a single function call. • buildLayout( fileSample: byte[] ): String – A web service endpoint was defined that invokes the preceding method • buildLayout( parameterData: byte[], fileSample: byte[] ): String – This definition assumes that the server running the engine might not have access to the data and thus the engine parameters and the file data sample must be passed to the endpoint as method parameters.

  18. Recent prototype enhancements • Configuration file – User can set each threshold • Example: Setting the value indicating when fields line up during record length analysis – User can set each heuristic • Example: The size of the data sample. – User can set what delimiters to test for – User can specify what content types to identify by setting what oracles are loaded into the prototype

  19. Recent prototype enhancements • Header record – Currently are able to test for existence – Future work would include using the information from the header record as extra evidence about the file • Assign a label to UNSPECIFIED fields • Correspond header information with the prototype’s results • Assumption: Only fully delimited files have header records. Can be easily extended to include fixed file types.

  20. Recent prototype enhancements • Comparator program – Compares the XML output of the engine with an XML file representing the correct output – Lists correct and incorrect results separately – Provides simple statistics – Text dump to the console

  21. Recent prototype enhancements • Cross reference analysis – Correlate related fields such as zipcode, city, and state • Vertical analysis – Examine the sample by columns rather than by record (horizontal) – Ex: Boolean v. Directional • Contextual analysis – Ex: Middle name and street number fields • All are performed after the decision methodologies in an attempt to improve the results with extra evidence.

  22. Recent prototype enhancements • Decision functionality – Currently a work in progress – In order, previous decisions were based on the content types, the PFP, and corresponding statistical counts – New logic will reorder the logic to consider the PFP, the content types, and finally the corresponding statistical counts – New logic appears to be a more generic solution to the decision making process

  23. Results for a synthetic fixed and hybrid data file • Actual record structure • Prototype results 1 full name 1 27 27 1 full name 1 27 27 2 address one 28 54 27 2 address one 28 54 27 3 address two 55 81 27 3 address two 55 81 27 4 city 82 99 18 4 city 82 99 18 5 state 100 101 2 5 state 100 102 3 6 zip code 103 107 5 6 zip code 102 107 6

  24. Results for three real data files: File 3 (delimited) • Actual record structure • Prototype results 1 Seq # 74 street suffix 1 * UNSPECIFIED * 74 street suffix 2 filler 75-81 filler 2 * UNSPECIFIED * 75-81 * UNSPECIFIED * 3 name 82 city 3 name 82 city 4-5 filler 83 state 4-5 * UNSPECIFIED * 83 state 6 address line one 84 zip code 6 address line one 84 zip code 7 filler 85-87 filler 7 * UNSPECIFIED * 85-87 * UNSPECIFIED * 8 city 88 boolean 8 city 88 directional 9 state 89 Seq # 9 state 89 street number 10 zip code 10 zip code 11 zip plus four 11 zip plus four 12-14 filler 12-14 * UNSPECIFIED * 15 name 15 name 16-28 filler 16-28 * UNSPECIFIED * 29 first name 29 first name 30 middle name 30 middle name 29 last name 29 last name 32-56 filler 32-56 * UNSPECIFIED * 57 address line one 57 address line one 58-67 filler 58-67 * UNSPECIFIED * 68 directional 68 directional 69 street number 69 street number 70-71 filler 70-71 * UNSPECIFIED * 72 street name 72 street name 73 filler 73 * UNSPECIFIED *

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend