SLIDE 1
1 2 This demonstration is aimed at anyone with lots of text, - - PDF document
1 2 This demonstration is aimed at anyone with lots of text, - - PDF document
1 2 This demonstration is aimed at anyone with lots of text, unstructured or multi- format data to analyse. This could be lots of Word, PDF and text file formats or in various databases or spreadsheets, maybe loaded with acronyms and
SLIDE 2
SLIDE 3
This demonstration is aimed at anyone with lots of text, unstructured or multi- format data to analyse. This could be lots of Word, PDF and text file formats or in various databases or spreadsheets, maybe loaded with acronyms and abbreviations, and with all the challenges presented by language. Big Data or just simply lots of data: text, categorical, numerical. This was the challenge the Air Lessons Cell faced; finding and analysing the lessons in reports, emails and databases to identify their combined effects. So themes, trends over time or space, unusual effects, correlations, root causes
- r hypothesis tests.
The Air Lessons Cell solution was to build a capability using software, science and subject matter expertise to collate multi-format information and achieve data reduction to deliver Information Retrieval or evidence based Collective Analysis.
3
SLIDE 4
The software conducts most of the tedious processing of text – called Text Mining
- to identify the most relevant statements.
This may involve filtering, searching, producing frequency counts of terms or dictionary categories, and ultimately coding (or tagging) all text of possible interest. Text mining is usually the start of the process of Data Reduction, which gets lots
- f data down to as little as possible.
The software used is a suite of three (QDA Miner, WordStat, SimStat) from Provalis Research. This was chosen in autumn 2009 following a review of text mining and identifying which software package/vendor was most suitable for the initial analysis of Air lessons data. The suite of tools from Provalis Research were chosen for 3 main reasons:
- a. Good reviews. There were good reviews of the ability of the Provalis
Research software to handle many formats of unstructured data, including some from the Operational Research community and applied to Air safety data.
- b. Better analysis of individual terms. One of the Provalis Research tools
- ffered better analysis of individual terms than the other software short-listed.
- c. Highly competitive price. The Provalis Research tools were available to
Government users at more competitive prices than other software. http://provalisresearch.com/
4
SLIDE 5
The science provides an independent, evidence-based and testable analysis of the text and its metadata. Three main variables may be examined for analysis: the metadata, categories and coded text. The science is used to describe, compare and draw inferences from the text. The main issue is how representative the lessons are to draw wider conclusions and inferences. Science is usually the middle stage of Data Reduction.
5
SLIDE 6
The Subject Matter Expertise make up any shortcomings of the software and science by providing context and relevance to their outputs. Subject Matter Expertise can also add their knowledge to compensate for missing data and also read the most relevant data, samples as required. Subject Matter Expertise is usually the last stage of Data Reduction.
6
SLIDE 7
A process was developed to use the software with lessons data to conduct Text Mining to provide Information Retrieval and Collective Analysis. The process starts with gathering and preparation of data which are then read into a project file for Data Reduction. Text Mining is then applied using the software to process the text and the science to provide an independent view. If necessary, dictionaries may be used and information retrieved at this stage. If Collective Analysis is required then further Text Mining and Subject Matter Expertise is applied. Text mining and Subject Matter Expertise may need to be conducted a number of times before suitable Collective Analysis is achieved. Project files may be combined to provide Big Data.
7
SLIDE 8
The fourth working area in a project file is the documents which are shown in the middle. Documents contain any data that may be subject to text mining, so the main text
- f cases.
For databases and spreadsheets a choice may need to be made as to which fields are to be documents. http://www.jallc.nato.int/newsmedia/docs/JALLC_Explorer_Volume1_Issue1.pdf
8
SLIDE 9
The Content Analysis may start by using bespoke dictionaries that contain categories of similar terms grouped together. Categories can be hierarchical and selected individually or en masse as required. Dictionaries are useful to quickly categorise sets of text en masse and to answer
- pen questions that are not looking for anything in particular like “what key
lessons are there from our experiences in Afghanistan?” On screen a dictionary comprising four sub dictionaries is shown. 1 is the Analysis dictionary which contains four lower level categories including one to find lessons in text. Further down at 2C is the largest and most important dictionary, a dictionary of Target terms, in this case Defence. This dictionary is based on the UK Defence Taxonomy which is a 4 level hierarchy of around 2,500 categories. Some of the next level categories can be seen on screen. The UK Defence Taxonomy has been used as it provides a ready made, approved hierarchy to categorise Defence terminology rather than inventing a new one.
9
SLIDE 10
Looking into the Analysis dictionary and its Lessons Finder category returns the terms in the Lessons Best Indicator category. The terms may be words, acronyms, abbreviations or phrases and include wildcards. Rules are used for terms with multiple meanings, on the screen this can be seen prefixed with the @ sign. Here the rule states to only include in the category the terms “lesson*, recommend* and should*” if they are not within 5 terms relating to DLIMS lessons management and role titles. This rule was required to automatically exclude the many irrelevant references in DLIMS lessons management like “... Recommend that this lesson should be closed...”
10
SLIDE 11
Another example of terms with a dictionary category is shown now. Here, the category is 4 levels down in the Target Dictionary based on the UK Defence Taxonomy. The category is for Support Helicopters and lists the terms and one rule which
- nly includes the term “SH” when it is not after the term “SO2”, so by ignoring
references to the role rather then the helicopter.
11
SLIDE 12
When Dictionaries and Options have been selected and set as required, the next step is to conduct a Frequency Count. Frequency Counts are broken into three parts (a) for Included categories or terms those explicitly meeting inclusion criteria (if any) stated in Options (b) for Leftover Words, those terms not meeting inclusion criteria and (c) for Unknown Terms, those that are not recognised from a formal language dictionary such as the Oxford English dictionary. Frequency Counts output the frequencies with which categories (if using a Dictionary) or terms (if not) appear within the selected text. The screen shot is of the Inclusion tab of Categories from a Dictionary. For all the Frequency Counts it is possible to quickly view how they appear in the text using the Keyword-In-Context (KWIC) function and to see and code the sentences or paragraphs containing terms. Frequency outputs can be sorted alphabetically, by frequency, the number of cases in which terms occur and in various other ways. A suite of graphical and statistical techniques is available to view and further interrogate the frequency count. Summary statistics are available, these are shown in the box on the bottom right, and reveal the number of words processed and speed of processing amongst
- ther things.
12
SLIDE 13
When dictionaries are not selected and/or limits are placed on the number of categories or terms to include using the Options tab, the output will be a list of terms and their frequencies within the text. Frequency outputs can be sorted alphabetically, by frequency, the number of cases in which terms occur and in various other ways. The screen shot is of the Leftover Words tab of terms. Terms of interest can be added to dictionary categories by dragging and dropping them to the screen on the left. A suite of graphical and statistical techniques is available to view and further interrogate the frequency count. For all the Frequency Counts it is possible to quickly view how they appear in the text using the Keyword-In-Context (KWIC) function and to see and code the sentences or paragraphs containing terms.
13
SLIDE 14
The third frequency count, of Unknown Words, is particularly useful to identify acronyms, abbreviations and spelling mistakes. Terms of interest can be added to dictionary categories by dragging and dropping them to the screen on the left. For all the Frequency Counts it is possible to quickly view how they appear in the text using the Keyword-In-Context (KWIC) function and to see and code the sentences or paragraphs containing terms.
14
SLIDE 15
The Keyword-In-Context function puts the category or term selected in the middle
- f the screen and lists the words in the sentences around them.
This enables the context to be quickly seen. Terms can be sorted by the terms before and after the keywords, this is useful to quickly identify keywords that are frequently pair with others.
15
SLIDE 16
Another technique to examine text is the Phrase Finder. This shows the frequencies of phrases between 2 and 7 terms in length. Phrases of interest can be added to dictionary categories by dragging and dropping them to the screen on the left. For all the Frequency Counts it is possible to quickly view how they appear in the text using the Keyword-In-Context (KWIC) function and to see and code the sentences or paragraphs containing phrases.
16
SLIDE 17
The 2D Map shows the Cluster Analysis for all the categories previously selected. The purpose of this Cluster Analysis is to show which Target Dictionary categories are most linked with terms from the Important category; this linkage is shown in the large red cluster on the left. The location, size and number of major clusters will depend on the data (text) for the analysis. Broadly, clusters that are more centrally located have more links with other clusters. As there are around 350 categories in the Target Dictionary the screen and clustering is very busy. One aim now is to continue Data Reduction by removing those categories that occur very little and/or are of no relevance for the analysis.
- Circle size is related to the amount of information in each category; the larger
the circle, the more information.
- Terms (from categories) that occur together more often will be shown as circles
that are closer together; these are called clusters.
- For most data, categories that occur more often will naturally have a higher
chance of occurring with other higher occurring categories.
17
SLIDE 18
Following data reduction - in this case removing those categories that occur very little and/or are of no relevance for the analysis - the Cluster Analysis is less busy and easier to read. Note removing larger or many categories may lead to major changes in the relationships between those that remain and result in significantly different clusters.
18
SLIDE 19
Here the focus is on the red cluster on the left to show the Analysis Important category and the Target categories that are most linked with it; these are the categories the analysis should now concentrate on.
19
SLIDE 20
The Dendrogram shows the same cluster in an ordinal view. Using this technique suggests the categories most linked with importance are Information, Training, Reconnaissance and Maritime Patrol Aircraft, Communications and Information Systems, Interoperability, Network Systems, Intelligence, Artillery Ammunition, Attack Helicopters, UAV, bombs and Civilian Road Vehicles amongst others.
20
SLIDE 21
This returns 492 sentences with each containing a term coded as Important and with a term from at least one of the selected Target categories. Other metadata can be output alongside each sentence. These sentences can be coded individually or en masse, for example they could be coded with a label as Important Selected Target.
21
SLIDE 22
Sequences between categories are assigned a positive or negative number which shows how their actual sequences differ from their expected sequences. Red indicates sequences that occur less than expected; the more red the less
- ften than expected.
Green indicates sequences that occur more than expected; the more green the more often than expected. On the screen, the Context Awareness occurs less often than expected for many Resource categories, but for Information it occurs much more. That awareness and information appear together often makes sense and is not unsurprising. Other reds and greens may be less obvious or surprising and therefore prompt investigation as to why.
22
SLIDE 23
Another technique, the Heatmap, uses colour to show which cross tabbed categories (on the left) and independent variables (along the top) have higher percentage occurrences. This means that equally bright cells along any category could relate to different amounts of text because the totals differ for the independent variables. So a cross tab with 3 out of 10 would be the same brightness cell as a cross tab with 30 out of 100, even though one cell contains 10 times as much text as the
- ther; proportionally they are the same.
Clustering can then be added to show which categories and independent variables have similar distributions. The screen shot shows that Information and Military Training Individual are very common categories across the independent variables of years. Other categories appear more sporadically, for example Interoperability, and Command and Control, are proportionally much higher in 2003 than in any of the
- ther years.
23
SLIDE 24
Another technique, Correspondence Analysis, can be used to show commonalities and differences in the characteristics of data. Here, factors that are common or normal appear around the cross axis, whereas differences appear further away. So factors relating to themes would most likely appear around the cross axis. The text behind any of the codes can be seen by right clicking and selecting Coding Retrieval. Whilst the independent variable on screen is Time, it could be Environment (to compare lessons between Air, Land and Fleet for example) or Nation (to compare commonalities and differences for lessons between different nations and
- rganisations).
24
SLIDE 25
Regardless of the aims of an analysis and what techniques are used, often the
- utput will be the text – sentences – relating to factors of interest.
The text, its codes and various metadata, can be output to Excel for further manipulation if required. As often the number of sentences of interest could be huge the Air Lessons Cell created a macro to produce various cross tabs of text codes and metadata to enable a user to quickly see what is there and be taken to just the relevant data.
25
SLIDE 26
The macro produces a Summary Table from any two of the codes or metadata. The screen shot shows a cross tab of the Target Dictionary categories selected earlier that were most linked with the Analysis Important category. The user then selects a cross tab of interest, in this case Network Systems and Analysis Important, presses the Filter Pair button and is taken to a worksheet containing just those sentences.
26
SLIDE 27
The screen shot shows the filter the 493 sentences to 32. These all relate to the Analysis Important category and the Target categories selected earlier.
27
SLIDE 28
The graph shows that sampling dramatically reduces the amount of reading required. Along the bottom, even as the number of sentences increases from 1,000 on the left to 1,000,000 on the right, the sample size hardly changes. The dark blue line along the bottom, the lowest confidence of the four lines, remains around the 500 level regardless of population size. Even the turquoise line along the top only rises by a factor of 4 when population rises 1000 fold.
28
SLIDE 29
It is possible to use the software to forecast some types of metadata. These would be useful for non database, spreadsheet and less structured sources like reports and emails, where metadata is not usually readily available. Metadata might include operations and exercises, nationalities, classification, author and locations. The forecast works by taking the key characteristics of existing data, associating them with metadata, and using them to forecast future data.
29
SLIDE 30