Grey Relational Analysis and Natural Language Processing Arjab Singh - - PowerPoint PPT Presentation

grey relational analysis and natural language processing
SMART_READER_LITE
LIVE PREVIEW

Grey Relational Analysis and Natural Language Processing Arjab Singh - - PowerPoint PPT Presentation

Grey Relational Analysis and Natural Language Processing Arjab Singh Khuman 1 Yingjie Yang 1 Sifeng Liu 2 1 Centre for Computational Intelligence De Montfort University Leicester, United Kingdom 2 College of Economics and Management Nanjing


slide-1
SLIDE 1

Grey Relational Analysis and Natural Language Processing

Arjab Singh Khuman1 Yingjie Yang1 Sifeng Liu2

1Centre for Computational Intelligence

De Montfort University Leicester, United Kingdom

2College of Economics and Management

Nanjing University of Aeronautics and Astronautics Nanjing, China

September 2015

slide-2
SLIDE 2

Outline for the Presentation

1 Introduction 2 Natural Language Processing 3 Grey Relational Analysis 4 Proposal 5 Observations 6 Conclusion

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 2 / 30

slide-3
SLIDE 3

Introduction

  • We will investigate the validity of using Grey Relational Analysis for

Natural Language Processing

  • Providing a theoretical overview from which further research can be

undertaken

  • Describing what Grey Relational Analysis and Natural Language

Processing entails

  • We look towards the use of Grey Incidence Analysis for inspection and

quantification

  • Understanding the traditional use of Grey Incidence, allows one to better

understand our intended use for Natural Language Processing

  • We describe the the varying components to our framework, highlighting

problem areas and possible solutions

  • We conclude and suggestions of possible enhancements are put

forward

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 3 / 30

slide-4
SLIDE 4

Outline for the Presentation

1 Introduction 2 Natural Language Processing 3 Grey Relational Analysis 4 Proposal 5 Observations 6 Conclusion

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 4 / 30

slide-5
SLIDE 5

Natural Language Preliminaries

  • Natural Language Processing is primarily concerned with the interaction

between machines and human based linguistics

  • It has been a hot topic within Computer Science and Artificial

Intelligence since the 1950s

  • It is an umbrella term, which encompasses many sub-domains, including

Natural Language Understanding which is associated with deriving meaning and sentiment

  • There are many examples of experiments and programs that are

associated with Natural Language Processing

  • The Georgetown experiment in 1954, where the automatic

transformation of over 60 Russian sentences were converted into interpretable English equivalent sentences

  • The creation of ELIZA, a system which simulated a person-centred

counseling client

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 5 / 30

slide-6
SLIDE 6

Natural Language Preliminaries

  • The 1970s saw the introduction of conceptual ontologies, which

associated itself with structuring real-world information into data that was machine understandable

  • The likes of MARGIE, SAM, PAM, POLITICS, all which are examples
  • f conceptual ontology programs
  • The introduction chatterbots, programs that could interact with users

and engage in menial conversation, at least to some extent

  • The likes of PARRY, a program written to simulate a paranoid

schizophrenic

  • Racter, which was supposedly able to generate English language prose,

short pieces of grammatically structured works, with rudimental natural flow

  • Jabberwacky, a chatterbot created to synthesize natural human

chatter in an interesting, entertaining manner

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 6 / 30

slide-7
SLIDE 7

Natural Language Preliminaries

  • Modern Natural Language Processing algorithms are based on machine

learning, in particular statistical machine learning

  • Prior implementations of language-processing tasks typically involved the

hard-coding of a large number of deterministic rules

  • Modern day machine learning algorithms are still firmly rooted in

statistical inferencing

  • There are several different classes of machine learning which execute in

similar ways; taking large sets of features that are obtained from the input data

  • The current trend is still very much to make use of statistical models,

which allow for soft, probabilistic decisions based on attaching a weight to each identified input feature

  • There are certain characteristics that make it very applicable for

Grey Theory

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 7 / 30

slide-8
SLIDE 8

Outline for the Presentation

1 Introduction 2 Natural Language Processing 3 Grey Relational Analysis 4 Proposal 5 Observations 6 Conclusion

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 8 / 30

slide-9
SLIDE 9

Grey Relational Analysis Preliminaries

  • Grey Relational Analysis falls under the remit of Grey Incidence

Analysis, whereby the main ethos is to understand which factors of a system are more important than others

  • Establishing which factors can be identified as being favourable and

equally, which factors are detrimental

  • By using a characteristic sequence, a sequence that represents an ideal of

the system, then comparing it against behavioural factors to ascertain how much the sequences are alike, or how much the behaviour factors impact upon the characteristic sequence itself

  • This information can then be used in terms of identifying if more

emphasis should be applied to a particular behaviour or not

  • Given that incidence analysis is mainly used for the inspection of a

system, there is little to no literature regarding the use of incidence analysis for Natural Language Processing

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 9 / 30

slide-10
SLIDE 10

Grey Relational Analysis Preliminaries

  • The characteristic sequences of a system Y1, Y2, . . . , Yn, against its

behavioural factor sequences X1, X2, . . . , Xm, all of which must be of the same magnitude

  • Γ = [γij], where each entry in the ith row of the matrix is the degree of

grey incidence for the corresponding characteristic sequence Yi, and relevant behavioural factors X1, X2, . . . , Xm

  • Each entry for the jth column is reference to the degrees of grey

incidence for the characteristic sequences Y1, Y2, . . . , Yn and behavioural factors Xm

  • For the inspection and analysis of the sequences, there are several

variations of the degree of incidence one could employ...

  • However, we a merely concerned with the Absolute degree of grey

incidence

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 10 / 30

slide-11
SLIDE 11

Degrees of Grey Incidence

Absolute degree of grey incidence

Assume that Xi and Xj ∈ U are two sequences of data with the same magnitude, that are defined as the sum of the distances between two consecutive time points, whose zero starting points have already been computed:

si = n

1

(Xi − xi(1))dt sj = n

1

(Xj − xj(1))dt (1) si − sj = n

1

(X0

i − X0 j )dt

(2)

  • Which is associated with the absolute relationships that exist between

characteristic sequences and their behaviours

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 11 / 30

slide-12
SLIDE 12

Outline for the Presentation

1 Introduction 2 Natural Language Processing 3 Grey Relational Analysis 4 Proposal 5 Observations 6 Conclusion

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 12 / 30

slide-13
SLIDE 13

The Concept

  • We are merely interested in the analysis of the sequences
  • Assume that you have a hard-wired linguistic sequence in the system,

this may execute an associated command; this can be representative of a characteristic sequence

  • Also assume that a user input stream is presented to the system; a

behavioural sequence, incidence analysis can be carried out to establish how similar or dissimilar the sequences are

  • If the returned coefficient surpasses a threshold value, the associated
  • utput command is executed
  • This harks back to the fact that the more recent Natural Language

Processing algorithms make use of statistical based models

  • Allowing for soft, probabilistic decisions to be undertaken, with the

advantage of expressing relative certainty to any number of possible answers rather than just one

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 13 / 30

slide-14
SLIDE 14

The Concept

  • Multiple input streams could be compared to multiple target streams

and compared accordingly in a pairwise manner to establish which input is better suited to which output

  • This is achieved is by the measurement of the metric spaces contained

between the geometric curves of the sequences being compared

  • As the sequence themselves are made up of discretised data points, point

wise comparisons can be made to garner the relative similarity between sequences

  • The use of the absolute degree of grey incidence gives the means of

providing computation, returning a coefficient value of absoluteness

  • The value itself falls within the range of [0, 1], the more similar the

sequences are the closer to 1 the coefficient will be and vice-versa

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 14 / 30

slide-15
SLIDE 15

Outline for the Presentation

1 Introduction 2 Natural Language Processing 3 Grey Relational Analysis 4 Proposal 5 Observations 6 Conclusion

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 15 / 30

slide-16
SLIDE 16

Observations

  • We will present some of the core individual aspects that contribute to

the framework

  • Small examples are demonstrated to further enhance the understanding
  • f using such an approach
  • Also identified are the weak points and the assumptions that are placed

upon the concept

  • Possible solutions to circumvent these weak areas an unrealistic

assumptions are discussed

  • Some key application areas are described where real world applicability is

feasible

  • The overall evaluation of the framework is also discussed, remarking
  • n the individual aspects of the framework
  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 16 / 30

slide-17
SLIDE 17

Observations

  • Envision that the linguistic term to be coded into a sequence is done by

using simple symbolic association: a = 1, b = 2, c = 3, . . . , i = n

  • Assume the word ‘would’ is the characteristic sequence and its associated

valued sequences is: s0 = [23, 15, 21, 12, 4]

  • It is noteworthy to mention that if the word is spelled correctly, the

sequence it generates will be completely unique

  • There will be no other exact sequence other than the sequence you are

referring to

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 17 / 30

slide-18
SLIDE 18

Observations

  • The s0 sequence is the characteristic sequence, assume that the input

stream presented to the system is ‘could’ with the following valued sequence: s1 = [3, 15, 21, 12, 4]

  • The returned absolute degree of grey incidence for these two sequences is:

0.888

  • A high scoring coefficient indicating the similarity of the two sequences is

high

  • Obviously, if the input sequence and the target sequence matched exactly

the output for the incidence would be an absolute 1.

  • The use of either the relative or synthetic degree of incidence for

analysis, is actually not needed

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 18 / 30

slide-19
SLIDE 19

Observations

  • With the English language and like many others, there are several ways

to refer to the initial same observation

  • One could use the Queen’s English and produce a grammatically,

perfectly structured sentence, or one could use broken English and still maintain the underlying sentiment

  • Sentence 1 below is a grammatically correct statement which describes

the colour of a door.

  • Sentence 2 is a broken sentence, but it contains the underlying sentiment
  • f sentence 1 using only two words.
  • 1. THE(1) DOOR(2) WAS(3) A(4) VIVID(5) GREEN(6)
  • 2. DOOR(2) GREEN(6)
  • There will always be a statement that will be of the smallest possible

length, one which will contain all the relevant sentiment and key features of a more grammatically correct statement

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 19 / 30

slide-20
SLIDE 20

Observations

  • Sentence 1 would have an associated sequence of:

|20, 8, 5|1 27 |4, 15, 15, 18|2 27 |23, 1, 19|3 27 |1|4 27 |22, 9, 22, 9, 4|5 27 |7, 18, 5, 5, 14|6

  • The value of 27 is indicative of a white space and indicates the start of a

new word. Given the sequence and the values it contains, that sequence can only ever refer to that sentence.

  • Sentence 2 is the target sequence, therefore it has the following

information contained: |4, 15, 15, 18|2 27 |7, 18, 5, 5, 14|6

  • Token 2 and 6 are identical to the target and therefore it can be

concluded that sentence 1 is indeed a possible match for sentence 2

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 20 / 30

slide-21
SLIDE 21

Observations

  • A requirement of Grey Relational Analysis is that the sequences being

compared must have the same length

  • A input stream may have an unknown length, as compared to the known

length of the target stream

  • There is a high likelihood that some words may not be spelt correctly
  • Identifying key features and having those compared against the target

sequence would lessen the burden of exactness

  • As the sequences themselves can be tokenised and parsed, these

individual elements can be inspected using incidence analysis

  • If the key features of an inspected input stream return high coefficient

values, there is a high likelihood they are a positive match

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 21 / 30

slide-22
SLIDE 22

Observations

T HE DOOR W AS A V IV ID GREEN A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 22 / 30

slide-23
SLIDE 23

Observations

  • It is this concatenated statement that would be the target sequence
  • As it is, there is no abstraction or understanding of what the word or

words mean, it is merely a collection of letters.

  • Therefore, Natural Language Understanding applications at this stage

would not be a key domain

  • However, morphological segmentation most definitely would
  • The separation of words into individual grammatical units; the smallest

meaningful unit of a word concatenated would still provide for a unique sequence - such as making use of syllables

  • If the individual syllables have their sequences mapped and stored in a

system, those syllables collected and presented in a certain way would only ever refer to the word that was intended

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 23 / 30

slide-24
SLIDE 24

Observations

  • By tokenising a sentence we have effectively created word blocks, which

will have their own geometric patterns for their associated sequences

  • If the input stream can be isolated and tokenised, those individual tokens

could then be compared to individual tokens from the target sequence

  • Assuming that an input stream has been tokenised and parsed into the

system, the collective geometric curves of the statement could be permutated to see if fits with a possible target sequence

  • The degree of incidence could then be computed on a word by word basis,

with every high scoring result for its coefficient being collected and stored

  • Theses stored coefficient values could then be sequenced to see how

similar the overall comparison is

  • One would hope to see a geometric curve, as straight as possible

and as close to 1 throughout its duration

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 24 / 30

slide-25
SLIDE 25

Observations

  • The area of Named Entity Recognition would be a possible avenue for

further research

  • Parsing is also another area that grey analysis could be deployed with

some degree of success

  • This concatenated and tokenised parsed sequence would be the target,

which would be compared to against input streams

  • The individual tokens of the input stream could be compared against

segments of the target stream

  • This would circumvent the problem of having the exact same magnitude

for the sequences themselves, as we have a higher likelihood of comparing a token of the sequence with the token of the target, of the same magnitude

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 25 / 30

slide-26
SLIDE 26

Observations

  • This again has associated problems, the main one being that one has to

assume that the input stream contains correctly spelt words

  • The use of sequencing to represent syllables would allow for this problem

to be somewhat alleviated

  • The target sequence could be in theory a collection of target sequences

for a specific output, all which contain possible variations of how a word maybe pronounced using permutations of syllable ordering

  • This would be applicable for the area of Word Sense Disambiguation
  • If the target sequences are that of a word with associated disambiguation,

then several permutations of that word could then be given meaning

  • Using a grey approach for Natural Language Processing can be

evaluated from both the intrinsic and extrinsic perspectives

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 26 / 30

slide-27
SLIDE 27

Outline for the Presentation

1 Introduction 2 Natural Language Processing 3 Grey Relational Analysis 4 Proposal 5 Observations 6 Conclusion

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 27 / 30

slide-28
SLIDE 28

Final Remarks

  • We touched upon the validity of using Grey Relational Analysis

techniques for use in certain Natural Language Processing domains

  • The main approach adopts Grey Incidence Analysis for the inspection of

sequences

  • The uniqueness of a word or sentence, will only ever refer to what was

intended

  • As such, that word or statement will always have the exact same

geometric pattern for its sequence

  • It would be a farfetched to assume that every input stream would

contain the correct spelling

  • In which case the inspection of the segmentation of the word may
  • ffer an alternative, such as the syllables that make up the word
  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 28 / 30

slide-29
SLIDE 29

Final Remarks

  • Having multiple target sequences which are slight permutations of its

intended meaning would help overcome the problems of not distinguishing between homonyms/homophones

  • The returned coefficient for any inspected pair of sequences provides one

a measure of similarity

  • The greater the value is to 1, the greater the likeness of the two

sequences, and vice-versa

  • Further enhanced via the possible inclusion of Radial Analysis
  • Another enhancement could be the inclusion of grey bounds, upper and

lower bounds which would contain the input sequence itself - providing a realm of containment

  • A comparison of not only the sequences themselves, but also of the

realms could be undertaken to gauge similarity

  • A. S. Khuman

(C.C.I.) The Leverhulme Trust September 2015 29 / 30

slide-30
SLIDE 30

Grey Relational Analysis and Natural Language Processing

Arjab Singh Khuman1 Yingjie Yang1 Sifeng Liu2

1Centre for Computational Intelligence

De Montfort University Leicester, United Kingdom

2College of Economics and Management

Nanjing University of Aeronautics and Astronautics Nanjing, China

September 2015