Challenges of linking statistical data and phonetic pronunciation - - PowerPoint PPT Presentation
Challenges of linking statistical data and phonetic pronunciation - - PowerPoint PPT Presentation
Challenges of linking statistical data and phonetic pronunciation software Case study: Problem Of Regular Statistics Establishments' Frames In Egypt Nehall Ahmed Farouk nehall_ahmed@capmas.gov.eg Research , sampling ,and computer specialist
Points of discussion
Introduction Problem and methods Expected Results conslusion
Introduction
Different types of statistical data are processed for various reasons to improve the statistical work and to provide new indicators. Some types of these data are measurable, comparable, and linkable but others are not .Statistical work might have a lot of challenges of mixing, comparing, and linking data ,these challenges results from the nature of data type.
Introduction
Case Study : Problem Of Regular Statistics Establishments' Frames In Egypt
- CAPMAS, Egypt conducts many different regular statistics
establishments' surveys; each survey has its own frame called (establishments' frame) which all is conducted over about 108515 establishments.
- The regular statistics contains a total of 89 frames distributed over
the 9 different departments.
- Some of these frames contain main centers only for the
establishments, others contains main centers and some branches,
- r contain some main centers and some branches.
Problem and methods
Problem core Current situation
Aggregation process purposes Aggregation process structure Aggregation process implementation
Problem and methods
Problem core
CAPMAS seeks to generate a main aggregated frame for all of the regular statistics establishments' frames. The total number of the related overlapped frames is 67 frames.
The problem appears in the implementation of the aggregation process because there is no way to compare and link the same establishments over different frames.
Problem and methods
Current situation
Different frames
- f the
establishments are overlapped and same establishment exists in different frames. All establishments have no unique ID number to be used in data linking. Disability of matching the same establishment in the related frames as it is not completely compatible in name but partially compatible because of the nature of writing in Arabic. Disability of matching the same establishment in different frames as it exists with different names (about 20% of the frames).
Problem and methods
Current situation
Number of frames Total number of establishments Departments 7 10866 Labor statistics department 14 6091 Finance and price department 8 7859 Industrial statistics department 13 748 Agriculture statistics department 24 18270 Service statistics department 5 53158 Education statistics department 4 10028 Trade statistics department 11 1054 Transportation statistics department 3 441 Infrastructure statistics department
Problem and methods
Aggregation process purposes
Creating the important part in generating administrative data for the establishment . Solving the frames confliction problem and the establishments repetition. Making each establishment unique with its own ID in the created master frame. Selecting all of the establishments' surveys from the generated master frame.
Problem and methods
Determining and collecting metadata about all of the
- verlapped related frames.
Determining relationships and inter-relationships between the frames. Classifying the frames : Relationship (master frames - related frames - independent frames ) Sectoral activity(public /business sector – governmental sector – private / investment sector ). In parallel: (Creating a unique ID number- compare through the pronunciation phonetic system). Final aggregation process (matching through TTS software).
Aggregation process structure
Problem and methods
Aggregation process implementation
- 1. Parts already
achieved
- 2. Problems
By using Phonetic pronunciation software
- 3. Overcoming
the problems
classifying the frames according to the type of relation with each
- ther ,then
excluding 22 independent frames . Classifying the frames according to the sectoral activity :(public /business sector – governmental sector– private / investment sector). Dividing each sectoral activity into 2 relation types: (Comprehensive relations frames
- Partially
relations frames(
- 1. Parts already achieved
Comprehensive relations frames: almost is the master huge frame that may include establishments for other frames and might have relations with each other. Partially relations frames: have relation with each other and with the comprehensive relations frames.
Relations between the overlapped frames
- 1. Parts already achieved
- 2. Problems
No unique ID number for the establishments to be used in data linking. Disability of matching the same establishment in the related frames as it is almost the same name but partially compatible because of the nature of writing in Arabic. Disability of matching the same establishment in different frames as it exists with different names (about 20% of the frames). Frames Aggregation and unification process is not accomplished due to lack of matching techniques.
S W O T Strength Opportunity Threat Weakness
- Collecting all the 67 different establishments frames’
meta data .
- Excluding the independent frames and determining the
frames in-between relations.
- Accomplish to classify the related frames into 2 stages
(sectoral activity –relation type ).
- Having consultants that monitor the project
implementation process .
- No unique ID number for all establishments.
- Disability to aggregate the overlapped establishments
in the different frames .
- Redundancy of establishment in different frames
with same partial compatible names or different names.
- Lack of soft ware technique to solve the problem of
the natural of Arabic writing.
- Finding the suitable pronunciation phonetic
soft ware or program that matches the establishment partial compatible name’s .
- Generating unique ID number for each establishment
during the implementation process.
- Ability to create the master aggregated establishments’
frame.
- Achieving the core of making administrative data
for establishments in CAPMAS.
- 20% of the establishments might have different
names in different frames.
- Finding a soft ware that make both pronunciation
phonetic and also matches it .
- 2. Problems
- 3. Overcoming the problems
The idea of linking data here will depend on phonetic pronunciation software technique as a main part in the aggregation process to compare the data first and then linking it. The nature of Arabic language writing and its challenges for TTS software like: Writing and pronunciation
- f Arabic are Very difficult.
Arabic has some of problems to be implemented as comparing data through TTS software.
Using phonetic pronunciation software
Phonetic pronunciation comparing process contains two levels Generating speech for the establishments as by using TTS software program. Comparing the establishments' pronunciation name by using phonetic pronunciation software.
Using phonetic pronunciation software
(Text) establishment 1 (speech) establishment 1 (Text) establishment 2 (speech) establishment 2
Phonetic Comparing
Process
Y N
Aggregation +Same id Different ids
Using phonetic pronunciation software
Sample 2 Sample 1 هكرشلا ةينواعتلاتلباصتلؤل ةكرشلا ةينواعتلاتلباصتلؤل تاراقعلل ةيردنكسلبا تاراقعلل ةيردنكسلئا ةكرشلا ةشمقلؤل ةيرصملا ةكرشلا ةيرصملا
TTS program Phonetic pronunciation compare software Most of the writing mistakes that appears from the nature
- f Arabic writing will be vanished as in these two samples .
Using phonetic pronunciation software
In parallel generating a primary new id for each establishment that's code depends on many factors to be generated, these factors are: The department that include the establishment in one of it frames. The establishment sector . The eligible structure of the establishment. Whether if the establishment is a main center or a branch.
Expected results
1
- Data about one establishment will be collected once.
2
- Reduce the fieldwork cost.
3
- Excluding some surveys and affects the total cost.
4
- Helping in generating the administrative data for
establishment.
The expected results of generating the master aggregated frame will have many effects in our statistical work, economic and technical systems
conclusions
Also that phonetic software is useful in comparing and linking data if the suitable software was developed. Statisticians must study the nature of data and then think of how to use the most technological systems
- r methods to link it.
Linking incomparable data can be achieved by the analysis of the data. The step
- f finding out relations