E-Learning Materials Development Based on Abstract Analysis Using - - PDF document

▶

Mar 01, 2024 17 likes •96 views

E-Learning Materials Development Based on Abstract Analysis Using Web Tools Tomofumi NAKANO and Yukie KOYAMA Nagoya Institute of Technology Gokiso-cho, Showa-ku, Nagoya 466-8555 Japan { nakano, koyama } @center.nitech.ac.jp Abstract. This study

SLIDE 1

E-Learning Materials Development Based on Abstract Analysis Using Web Tools

Tomofumi NAKANO and Yukie KOYAMA

Nagoya Institute of Technology Gokiso-cho, Showa-ku, Nagoya 466-8555 Japan {nakano, koyama}@center.nitech.ac.jp

Abstract. This study includes an original corpus of engineering journals

and is part of the series of E-Learning & English for Specific Purposes (ESP) researches . Purposes (ESP) researches that includes an original corpus of engineering journals. In this paper the results of a corpus study will be presented, and a sample of the ESP e-learning materials being developed for graduate students in engineering will be shown. Abstracts were chosen for the corpus this time because students are likely to read many for their research, and eventually to have to produce their own. We prepare the 40,000-word corpus that consists of 263 abstracts from mechanical and electrical engineering journals. The corpus is analyzed using Wmatrix, which gives part-of-speech tags and semantic tags, and compares the results with those of the BNC written corpus sampler. Some special features found in the analysis are frequencies in seman- tic tags, part-of-speech tags, difference in the use of verbal forms and multi-words. As an application of the important features, we are develop- ing web-based materials which include the original abstracts with target items hyper-linked to various pages containing exercises, concordances, grammar explanations, a bilingual dictionary, etc.

1 Introduction

In the field of English teaching, since Swales claimed in his epoch-making book, Genre Analysis [11], it has been widely accepted that ESP is one of the most ef- ficient approaches in terms of content appropriateness and students’ motivation. Since we teach at a university of technology, we first started the need analysis and found that reading, especially reading academic papers, is the most impor- tant skill for engineering students. After that, we started to compile an original corpus of engineering journal papers. This corpus is still growing both in its discipline coverage and its quantity. In this study the results of a corpus study and a sample of the ESP e-learning material will be shown. This material is developed for graduate students in engineering this time because students are likely to read many articles for their research, and eventually to have to produce articles of their own. Needless to say, abstracts play an enormously important role in the academic world, because by reading abstracts, in many cases, the readers decide whether or not they continue

SLIDE 2

to read the full papers [4]. Another reason is that, in the English as a Foreign Language (EFL) situation, researchers often write their abstracts in English but the rest of the paper is written in their first language. This also raises the need for abstract analysis in EFL situations such as in Japan. While reading is the most important language skill for engineering students in Japan [6], they are hindered in reading academic papers by a lack of vocabu- lary (usually sub-technical or academic) [1] and by difficulty with the grammar

f long, often complex sentences [5]. Therefore, this study focuses not only on

the word lists but also on part-of-speech and semantic areas. An application introduced in this study also makes it possible to adjust the level of frequency and the degree of specification of the word compared to that in general corpus. As Morton points out, the problem for a student is not technical vocabulary but the difficult words of more general English [7]. Thanks to developments in ICT, E-Learning has become an ideal medium for language learning because of its flexibility and the autonomous learning op- portunities it provides inside and outside the classroom. As a new application of the results of relevant analysis, an e-learning material for engineering graduate students will be introduced in the rest of this paper.

2 Corpus Analysis

2.1 The method of analysis The 40,000-word corpus used in this study consists of 263 abstracts from mechan- ical and electrical engineering journals. This corpus is taken from an originally compiled 1,120,000-word corpus of full papers of these journals. We use Wma- trix [9] for the abstract analysis, not only because this software is very easy to handle but also it has a special function which can determine the characteristics

f the corpus. Using the Wmatrix the corpus was automatically tagged, both by

part-of-speech tags with CLAWS7 [3] and by semantic tags of USAS (UCREL Semantic Analysis System [8]). Moreover, Wmatrix provides frequency tables and log-likelihood tables of words and these two kinds of tags. Log-likelihood is a measurement which shows the difference in frequencies of two different cor- pora [2]. Therefore, the information given by log-likelihood is very important for ESP material development in order to grasp the characteristics of the ESP

Corpus. The two corpora used in this study are the abstract corpus and BNC

written corpus sampler which is the built-in corpus in Wmatrix. In Table 1, the left word list is ordered by the frequency and the right word list is ordered by the log-likelihood. While almost all words are general in the left list, the words in right list are specific to the abstracts or engineering papers. 2.2 Results of the analysis The lists of part-of-speech tags and semantic tags are shown in Table 2 and 3

respectively. Both lists show tag names, the frequency in the corpus, its frequency

SLIDE 3

Table 1. Left: a word list ordered by the frequency. Right: a word list ordered by the log-likelihood. rank word freq. 1 the 2459 2

1205 3 and 945 4 a 683 5 in 525 6 is 522 7 to 521 8 for 396 9 are 262 10 with 260 11 this 213 12 by 190 13 that 183 14 an 172 15

156 16 be 149 17 at 128 18 from 127 19 as 120 20 flow 119 Abst. BNC log-like- rank word

freq. rate
freq. rate

lihood 1 the 2459 8.38 37283 3.79 1158.11 2

1205 4.10 12817 1.30 1068.71 3 flow 119 0.41 10 0.00 772.82 4 model 103 0.35 20 0.00 621.26 5 results 88 0.30 31 0.00 488.40 6 energy 84 0.29 33 0.00 457.50 7 presented 63 0.21 9 0.00 392.35 8 method 59 0.20 16 0.00 340.94 9 fuel 59 0.20 22 0.00 324.30 10 paper 93 0.32 174 0.02 323.55 11 power 71 0.24 67 0.01 315.47 12 using 91 0.31 189 0.02 302.33 13 analysis 50 0.17 10 0.00 300.55 14 combustion 42 0.14 1 0.00 287.94 15 by 190 0.65 1293 0.13 286.05 16 performance 58 0.20 39 0.00 282.24 17 based on 55 0.19 31 0.00 278.82 18 experimental 43 0.15 4 0.00 277.34 19 conditions 61 0.21 61 0.01 266.37 20 gas 70 0.24 106 0.01 265.30 Table 2. A part-of-speech tag list POS Abst. BNC log-like- tag freq. rate freq. rate lihood NN1 7297 24.86 147395 15.22 1447.40 JJ 3481 11.86 74927 7.74 533.93 FO 222 0.76 2050 0.21 233.75 VVN 1205 4.10 24675 2.55 226.88 AT 2483 8.46 67521 6.97 84.13 VBZ 522 1.78 11171 1.15 82.10 IO 1204 4.10 30286 3.13 78.32 NN2 2064 7.03 55665 5.75 75.84 IF 398 1.36 8765 0.91 55.09 VVZ 350 1.19 7602 0.79 51.59 VBR 262 0.89 5435 0.56 46.88 . . .

SLIDE 4

Table 3. A semantic tag list semantic Abst. BNC log-like- tag.

freq. rate freq. rate

lihood meaning X4.2 444 1.51 3108 0.32 640.06 Mental object :- Means, method O1.3 167 0.57 300 0.03 586.57 Substances and materials generally: Gas O2 610 2.08 6100 0.63 577.74 Objects generally O3 204 0.69 651 0.07 537.88 Electricity and electrical equipment A1.5.1 308 1.05 1965 0.20 485.85 Using N3.1 130 0.44 413 0.04 343.66 Measurement: General O1 151 0.51 689 0.07 314.64 Substances and materials generally M4 161 0.55 843 0.09 301.64 Shipping, swimming etc. O4.6 78 0.27 110 0.01 301.46 Temperature X2.4 252 0.86 2176 0.22 288.38 Investigate, examine, test, search N2 143 0.49 760 0.08 264.68 Mathematics . . .

rate, the frequency in BNC written corpus sampler, its frequency rate and the log-likelihood, and these are sorted by the log-likelihood. Examining the results shown in the tables, the findings are as follows:

1. Semantic areas such as objects, mental objects (method and means), sub-

stances & materials (gas, solid and general), measurement (length & height, distance, size and volume), comparison, and evaluation occur much more frequently.

2. Parts of speech appearing more often are common nouns, the past participle,

general adjectives, the definite article, ‘of’, ‘for’, ‘is’, and ‘are’.

3. In the use of verbal forms, the frequency of past participles is significant as

found in the journal corpus [5], while the occurrence of past of lexical verbs and infinitive forms is much less compared to BNC written sampler.

4. Multi-words appearing more frequently are ‘based on’, ‘due to’, ‘used to’,

‘such as’, ‘carried out’, ‘as well’, ‘in order to’, ‘in terms of’, ‘in addition’ and ‘according to’.

3 Material development

Through such analysis of corpora data, features of special importance to our students can be selected. Using automatic item generation allows learners to work with different authentic texts each time. Materials underdevelopment include the original abstracts with target items hyper-linked to various pages containing concordances, grammar explanations, a bilingual dictionary, etc. The outline of material is as follows: – An abstract is used as the base of this material, whose objective is to enhance the ability of abstract reading comprehension.

SLIDE 5

Fig. 1. Highlighting the same tag words

– In the course of reading, characteristic words and multi-words are empha- sized. – The materials are designed as a course, which consists of a series of abstracts. 3.1 Design for the material Highlighting: Each word in an abstract has a few attributes such as POS tag, semantic tag, and so on though the attributes do not appear directly. When a mouse pointer comes to the word, words which belong to the same tag, they are

highlighted. One example is shown in Figure 1. It is important for a reader to

understand part of speech in order to understand the structure of the sentence and eventually to comprehend its overall meaning. The highlighting function in this material supports learners to understand the meaning of each POS tag without any previous knowledge. Semantic tags are highlighted in the same way as POS tag. The total number of all POS tags is 137 and the number of semantic tags is 232. The tags are too large in number for a learner to understand all the

meanings. Therefore we decided to remove less important tags according to the

results of the analysis and reduce the number of tags. On the other hand, there are two parameters in Wmatrix, which are frequency and log-likelihood. With same reason as in the case of tags, we also set lower bound for each parameter, and the two bounds are given before generating materials. It is mentioned in Section 3.2. A word list: The material shows the word list which learners should learn. For the selection of the words in the list the lower bounds are applied for frequency rate and log-likelihood as well as tag’s ones. Moreover, in order to remove too general words such as ‘the’ or ‘it’ , the upper bound for frequency rate in BNC was used for the selection of appropriate words for learners. The concordancer: Each word of the list is linked to a concordancer developed by us, which shows part of the sentence that includes the keyword in the several

SLIDE 6

Fig. 2. An example of search by the concordancer
Fig. 3. Material tuner
corpora. With this function learners are able to look over usages of the word.

An example of the concordancer is shown in Figure 2. Pop-up bilingual dictionary: When learners keep the mouse pointer on a word for a while,the content of an English-Japanese dictionary appears.We use e lemma.txt (Ver. 1) [10] to draw the dictionary in this material. This helps learners to concentrate on understanding the structure of each sentence. 3.2 Material tuner According to the analysis mentioned above, it is required to set the upper/lower bounds of frequency or log-likelihood, and it is also necessary to avoid redun- dancy of the words among word lists in a course of the materials. These bounds,

r parameters, should be tuned in consideration to the learner’s ability and of

amount of materials through a trial and error process. Therefore, we developed a material parameter tuner which can adjust these conditions. It is shown in Figure 3. On this page, we can set the following parameters: word log-likelihood lower bound, word frequency lower bound, BNC word frequency upper bound, POS log-likelihood lower bound, POS frequency lower bound, semantic-tag log- likelihood lower bound, and semantic-tag frequency lower bound.

SLIDE 7

4 Conclusion and future works

In this paper, we presented a method of development of e-learning materials based on analysis of an abstract corpus. First, we analyzed the abstract corpus by using Wmatrix. From the results of the analysis it is confirmed that abstracts has specific features compared to general corpus. Therefore, the development of the materials in this study can be seen as meaningful from the ESP perspective. Next, we designed several functions for the material. Since this is the reading material, these functions are provided for the purpose of better reading compre-

hension. The functions provided for each material are highlighting of POS and

semantic tags, a list of words to learn, links to a concordancer to learn the key word in context in other examples, and a pop-up bilingual dictionary. Finally, we developed a material tuner to reflect the analysis in materials By using this tuner we can adjust the frequency level or log-likelihood level of the word and tags, which allows a student to learn more comfortably depending on his/her ability. Future steps in the development of this project include expanding the size

f the corpus and the number of disciplines represented; comparison of data

for abstracts with those of full papers; and development of materials to guide students in writing abstracts.

References

1. Olsen L. A. and T. N. Huckin. Technical Writing and Professional Communication.

New York: McGraw-Hill, 1991.

2. Ted Dunning.

Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1):61–74, 1994.

3. R. Garside and N. Smith. A hybrid grammatical tagger: Claws4. In R. Garside,
G. Leech, and A. McEnery, editors, Corpus Annotation: Linguistic Information

from Computer Text Corpora, pages 102–121. Longman, 1997.

4. K. Hyland.

Disciplinary Discourses: Social Interactions in Academic Writing. Pearson Education Ltd., 2000.

5. Yukie Koyama. English for science and technology using corpus based approach.

Technical report, Nagoya Institute of Technology, 2003.

6. Yukie Koyama and Robin Nagano. Text analysis based on est corpus and its ap-

plication to english teaching. Technical report, Nagaoka University of Technology, 2001.

7. R. Morton.

Abstracts as authentic material for eap classes. ELT Journal, 53(3):177–182, 1999.

8. Scott S. L. Piao, Paul Rayson, Dawn Archer, Andrew Wilson, and Tony McEnery.

Extracting multiword expressions with a semantic tagger. In Proceedings of the Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, (ACL 2003), pages 49–56, 2003.

9. P. Rayson. Wmatrix: a web-based corpus processing environment. Technical re-

port, Lancaster University, 2001.

10. Yasumasa Someya. e lemma.txt (ver.1), 1998.
11. J.M. Swales. Genre Analysis. Cambridge University Press, 1990.