1 2
play

1 2 This demonstration is aimed at anyone with lots of text, - PDF document

1 2 This demonstration is aimed at anyone with lots of text, unstructured or multi- format data to analyse. This could be lots of Word, PDF and text file formats or in various databases or spreadsheets, maybe loaded with acronyms and


  1. 1

  2. 2

  3. This demonstration is aimed at anyone with lots of text, unstructured or multi- format data to analyse. This could be lots of Word, PDF and text file formats or in various databases or spreadsheets, maybe loaded with acronyms and abbreviations, and with all the challenges presented by language. Big Data or just simply lots of data: text, categorical, numerical. This was the challenge the Air Lessons Cell faced; finding and analysing the lessons in reports, emails and databases to identify their combined effects. So themes, trends over time or space, unusual effects, correlations, root causes or hypothesis tests. The Air Lessons Cell solution was to build a capability using software, science and subject matter expertise to collate multi-format information and achieve data reduction to deliver Information Retrieval or evidence based Collective Analysis. 3

  4. The software conducts most of the tedious processing of text – called Text Mining - to identify the most relevant statements. This may involve filtering, searching, producing frequency counts of terms or dictionary categories, and ultimately coding (or tagging) all text of possible interest. Text mining is usually the start of the process of Data Reduction, which gets lots of data down to as little as possible. The software used is a suite of three (QDA Miner, WordStat, SimStat) from Provalis Research. This was chosen in autumn 2009 following a review of text mining and identifying which software package/vendor was most suitable for the initial analysis of Air lessons data. The suite of tools from Provalis Research were chosen for 3 main reasons: a. Good reviews. There were good reviews of the ability of the Provalis Research software to handle many formats of unstructured data, including some from the Operational Research community and applied to Air safety data. b. Better analysis of individual terms. One of the Provalis Research tools offered better analysis of individual terms than the other software short-listed. c. Highly competitive price. The Provalis Research tools were available to Government users at more competitive prices than other software. http://provalisresearch.com/ 4

  5. The science provides an independent, evidence-based and testable analysis of the text and its metadata. Three main variables may be examined for analysis: the metadata, categories and coded text. The science is used to describe, compare and draw inferences from the text. The main issue is how representative the lessons are to draw wider conclusions and inferences. Science is usually the middle stage of Data Reduction. 5

  6. The Subject Matter Expertise make up any shortcomings of the software and science by providing context and relevance to their outputs. Subject Matter Expertise can also add their knowledge to compensate for missing data and also read the most relevant data, samples as required. Subject Matter Expertise is usually the last stage of Data Reduction. 6

  7. A process was developed to use the software with lessons data to conduct Text Mining to provide Information Retrieval and Collective Analysis. The process starts with gathering and preparation of data which are then read into a project file for Data Reduction. Text Mining is then applied using the software to process the text and the science to provide an independent view. If necessary, dictionaries may be used and information retrieved at this stage. If Collective Analysis is required then further Text Mining and Subject Matter Expertise is applied. Text mining and Subject Matter Expertise may need to be conducted a number of times before suitable Collective Analysis is achieved. Project files may be combined to provide Big Data. 7

  8. The fourth working area in a project file is the documents which are shown in the middle. Documents contain any data that may be subject to text mining, so the main text of cases. For databases and spreadsheets a choice may need to be made as to which fields are to be documents. http://www.jallc.nato.int/newsmedia/docs/JALLC_Explorer_Volume1_Issue1.pdf 8

  9. The Content Analysis may start by using bespoke dictionaries that contain categories of similar terms grouped together. Categories can be hierarchical and selected individually or en masse as required. Dictionaries are useful to quickly categorise sets of text en masse and to answer open questions that are not looking for anything in particular like “what key lessons are there from our experiences in Afghanistan?” On screen a dictionary comprising four sub dictionaries is shown. 1 is the Analysis dictionary which contains four lower level categories including one to find lessons in text. Further down at 2C is the largest and most important dictionary, a dictionary of Target terms, in this case Defence. This dictionary is based on the UK Defence Taxonomy which is a 4 level hierarchy of around 2,500 categories. Some of the next level categories can be seen on screen. The UK Defence Taxonomy has been used as it provides a ready made, approved hierarchy to categorise Defence terminology rather than inventing a new one. 9

  10. Looking into the Analysis dictionary and its Lessons Finder category returns the terms in the Lessons Best Indicator category. The terms may be words, acronyms, abbreviations or phrases and include wildcards. Rules are used for terms with multiple meanings, on the screen this can be seen prefixed with the @ sign. Here the rule states to only include in the category the terms “lesson*, recommend* and should*” if they are not within 5 terms relating to DLIMS lessons management and role titles. This rule was required to automatically exclude the many irrelevant references in DLIMS lessons management like “... Recommend that this lesson should be closed...” 10

  11. Another example of terms with a dictionary category is shown now. Here, the category is 4 levels down in the Target Dictionary based on the UK Defence Taxonomy. The category is for Support Helicopters and lists the terms and one rule which only includes the term “SH” when it is not after the term “SO2”, so by ignoring references to the role rather then the helicopter. 11

  12. When Dictionaries and Options have been selected and set as required, the next step is to conduct a Frequency Count. Frequency Counts are broken into three parts (a) for Included categories or terms those explicitly meeting inclusion criteria (if any) stated in Options (b) for Leftover Words, those terms not meeting inclusion criteria and (c) for Unknown Terms, those that are not recognised from a formal language dictionary such as the Oxford English dictionary. Frequency Counts output the frequencies with which categories (if using a Dictionary) or terms (if not) appear within the selected text. The screen shot is of the Inclusion tab of Categories from a Dictionary. For all the Frequency Counts it is possible to quickly view how they appear in the text using the Keyword-In-Context (KWIC) function and to see and code the sentences or paragraphs containing terms. Frequency outputs can be sorted alphabetically, by frequency, the number of cases in which terms occur and in various other ways. A suite of graphical and statistical techniques is available to view and further interrogate the frequency count. Summary statistics are available, these are shown in the box on the bottom right, and reveal the number of words processed and speed of processing amongst other things. 12

  13. When dictionaries are not selected and/or limits are placed on the number of categories or terms to include using the Options tab, the output will be a list of terms and their frequencies within the text. Frequency outputs can be sorted alphabetically, by frequency, the number of cases in which terms occur and in various other ways. The screen shot is of the Leftover Words tab of terms. Terms of interest can be added to dictionary categories by dragging and dropping them to the screen on the left. A suite of graphical and statistical techniques is available to view and further interrogate the frequency count. For all the Frequency Counts it is possible to quickly view how they appear in the text using the Keyword-In-Context (KWIC) function and to see and code the sentences or paragraphs containing terms. 13

  14. The third frequency count, of Unknown Words, is particularly useful to identify acronyms, abbreviations and spelling mistakes. Terms of interest can be added to dictionary categories by dragging and dropping them to the screen on the left. For all the Frequency Counts it is possible to quickly view how they appear in the text using the Keyword-In-Context (KWIC) function and to see and code the sentences or paragraphs containing terms. 14

  15. The Keyword-In-Context function puts the category or term selected in the middle of the screen and lists the words in the sentences around them. This enables the context to be quickly seen. Terms can be sorted by the terms before and after the keywords, this is useful to quickly identify keywords that are frequently pair with others. 15

  16. Another technique to examine text is the Phrase Finder. This shows the frequencies of phrases between 2 and 7 terms in length. Phrases of interest can be added to dictionary categories by dragging and dropping them to the screen on the left. For all the Frequency Counts it is possible to quickly view how they appear in the text using the Keyword-In-Context (KWIC) function and to see and code the sentences or paragraphs containing phrases. 16

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend