Challenges of linking statistical data and phonetic pronunciation - - PowerPoint PPT Presentation

challenges of linking statistical data and phonetic
SMART_READER_LITE
LIVE PREVIEW

Challenges of linking statistical data and phonetic pronunciation - - PowerPoint PPT Presentation

Challenges of linking statistical data and phonetic pronunciation software Case study: Problem Of Regular Statistics Establishments' Frames In Egypt Nehall Ahmed Farouk nehall_ahmed@capmas.gov.eg Research , sampling ,and computer specialist


slide-1
SLIDE 1

Challenges of linking statistical data and phonetic pronunciation software

Case study: Problem Of Regular Statistics Establishments' Frames In Egypt

Nehall Ahmed Farouk nehall_ahmed@capmas.gov.eg Research , sampling ,and computer specialist Central Agency of Public Mobilization and Statistics(CAPMAS) Egypt

slide-2
SLIDE 2

Points of discussion

Introduction Problem and methods Expected Results conslusion

slide-3
SLIDE 3

Introduction

Different types of statistical data are processed for various reasons to improve the statistical work and to provide new indicators. Some types of these data are measurable, comparable, and linkable but others are not .Statistical work might have a lot of challenges of mixing, comparing, and linking data ,these challenges results from the nature of data type.

slide-4
SLIDE 4

Introduction

Case Study : Problem Of Regular Statistics Establishments' Frames In Egypt

  • CAPMAS, Egypt conducts many different regular statistics

establishments' surveys; each survey has its own frame called (establishments' frame) which all is conducted over about 108515 establishments.

  • The regular statistics contains a total of 89 frames distributed over

the 9 different departments.

  • Some of these frames contain main centers only for the

establishments, others contains main centers and some branches,

  • r contain some main centers and some branches.
slide-5
SLIDE 5

Problem and methods

Problem core Current situation

Aggregation process purposes Aggregation process structure Aggregation process implementation

slide-6
SLIDE 6

Problem and methods

Problem core

CAPMAS seeks to generate a main aggregated frame for all of the regular statistics establishments' frames. The total number of the related overlapped frames is 67 frames.

The problem appears in the implementation of the aggregation process because there is no way to compare and link the same establishments over different frames.

slide-7
SLIDE 7

Problem and methods

Current situation

Different frames

  • f the

establishments are overlapped and same establishment exists in different frames. All establishments have no unique ID number to be used in data linking. Disability of matching the same establishment in the related frames as it is not completely compatible in name but partially compatible because of the nature of writing in Arabic. Disability of matching the same establishment in different frames as it exists with different names (about 20% of the frames).

slide-8
SLIDE 8

Problem and methods

Current situation

Number of frames Total number of establishments Departments 7 10866 Labor statistics department 14 6091 Finance and price department 8 7859 Industrial statistics department 13 748 Agriculture statistics department 24 18270 Service statistics department 5 53158 Education statistics department 4 10028 Trade statistics department 11 1054 Transportation statistics department 3 441 Infrastructure statistics department

slide-9
SLIDE 9

Problem and methods

Aggregation process purposes

Creating the important part in generating administrative data for the establishment . Solving the frames confliction problem and the establishments repetition. Making each establishment unique with its own ID in the created master frame. Selecting all of the establishments' surveys from the generated master frame.

slide-10
SLIDE 10

Problem and methods

Determining and collecting metadata about all of the

  • verlapped related frames.

Determining relationships and inter-relationships between the frames. Classifying the frames : Relationship (master frames - related frames - independent frames ) Sectoral activity(public /business sector – governmental sector – private / investment sector ). In parallel: (Creating a unique ID number- compare through the pronunciation phonetic system). Final aggregation process (matching through TTS software).

Aggregation process structure

slide-11
SLIDE 11

Problem and methods

Aggregation process implementation

  • 1. Parts already

achieved

  • 2. Problems

By using Phonetic pronunciation software

  • 3. Overcoming

the problems

slide-12
SLIDE 12

classifying the frames according to the type of relation with each

  • ther ,then

excluding 22 independent frames . Classifying the frames according to the sectoral activity :(public /business sector – governmental sector– private / investment sector). Dividing each sectoral activity into 2 relation types: (Comprehensive relations frames

  • Partially

relations frames(

  • 1. Parts already achieved

Comprehensive relations frames: almost is the master huge frame that may include establishments for other frames and might have relations with each other. Partially relations frames: have relation with each other and with the comprehensive relations frames.

slide-13
SLIDE 13

Relations between the overlapped frames

  • 1. Parts already achieved
slide-14
SLIDE 14
  • 2. Problems

No unique ID number for the establishments to be used in data linking. Disability of matching the same establishment in the related frames as it is almost the same name but partially compatible because of the nature of writing in Arabic. Disability of matching the same establishment in different frames as it exists with different names (about 20% of the frames). Frames Aggregation and unification process is not accomplished due to lack of matching techniques.

slide-15
SLIDE 15

S W O T Strength Opportunity Threat Weakness

  • Collecting all the 67 different establishments frames’

meta data .

  • Excluding the independent frames and determining the

frames in-between relations.

  • Accomplish to classify the related frames into 2 stages

(sectoral activity –relation type ).

  • Having consultants that monitor the project

implementation process .

  • No unique ID number for all establishments.
  • Disability to aggregate the overlapped establishments

in the different frames .

  • Redundancy of establishment in different frames

with same partial compatible names or different names.

  • Lack of soft ware technique to solve the problem of

the natural of Arabic writing.

  • Finding the suitable pronunciation phonetic

soft ware or program that matches the establishment partial compatible name’s .

  • Generating unique ID number for each establishment

during the implementation process.

  • Ability to create the master aggregated establishments’

frame.

  • Achieving the core of making administrative data

for establishments in CAPMAS.

  • 20% of the establishments might have different

names in different frames.

  • Finding a soft ware that make both pronunciation

phonetic and also matches it .

  • 2. Problems
slide-16
SLIDE 16
  • 3. Overcoming the problems

The idea of linking data here will depend on phonetic pronunciation software technique as a main part in the aggregation process to compare the data first and then linking it. The nature of Arabic language writing and its challenges for TTS software like: Writing and pronunciation

  • f Arabic are Very difficult.

Arabic has some of problems to be implemented as comparing data through TTS software.

slide-17
SLIDE 17

Using phonetic pronunciation software

Phonetic pronunciation comparing process contains two levels Generating speech for the establishments as by using TTS software program. Comparing the establishments' pronunciation name by using phonetic pronunciation software.

slide-18
SLIDE 18

Using phonetic pronunciation software

(Text) establishment 1 (speech) establishment 1 (Text) establishment 2 (speech) establishment 2

Phonetic Comparing

Process

Y N

Aggregation +Same id Different ids

slide-19
SLIDE 19

Using phonetic pronunciation software

Sample 2 Sample 1 هكرشلا ةينواعتلاتلباصتلؤل ةكرشلا ةينواعتلاتلباصتلؤل تاراقعلل ةيردنكسلبا تاراقعلل ةيردنكسلئا ةكرشلا ةشمقلؤل ةيرصملا ةكرشلا ةيرصملا

TTS program Phonetic pronunciation compare software Most of the writing mistakes that appears from the nature

  • f Arabic writing will be vanished as in these two samples .
slide-20
SLIDE 20

Using phonetic pronunciation software

In parallel generating a primary new id for each establishment that's code depends on many factors to be generated, these factors are: The department that include the establishment in one of it frames. The establishment sector . The eligible structure of the establishment.  Whether if the establishment is a main center or a branch.

slide-21
SLIDE 21

Expected results

1

  • Data about one establishment will be collected once.

2

  • Reduce the fieldwork cost.

3

  • Excluding some surveys and affects the total cost.

4

  • Helping in generating the administrative data for

establishment.

The expected results of generating the master aggregated frame will have many effects in our statistical work, economic and technical systems

slide-22
SLIDE 22

conclusions

Also that phonetic software is useful in comparing and linking data if the suitable software was developed. Statisticians must study the nature of data and then think of how to use the most technological systems

  • r methods to link it.

Linking incomparable data can be achieved by the analysis of the data. The step

  • f finding out relations

between different files of data and how to compare then is the most important point to link data.

slide-23
SLIDE 23