Semantics-based reverse engineering of data models from programs - PowerPoint PPT Presentation

Semantics-based reverse engineering of data models from programs Komondoor V Raghavan IBM India Research Lab (with G. Ramalingam, J. Field, et al) 1 / 51

Understanding legacy software ● Common scenario – huge existing legacy code base – building on top of existing code – transforming existing code – integrating legacy systems ● Legacy code can be surprisingly hard to work with – lack of documentation and understanding of existing code ● Need tools to help understand legacy code 2 / 51

Reverse engineering data models ● Goal: Reverse engineer a logical data model of a given (legacy) program – or Type Inference – focused on weakly-typed languages like Cobol ● Understanding logical structure of data is key to program understanding ● A logical data model can assist in common legacy transformation and maintenance tasks 3 / 51

An example Cobol program – Data declarations 01 CARD-TRANSACTION-REC. Picture 05 LOCATION-TYPE PIC X. clauses 05 LOCATION-DETAILS PIC X(20). 05 CARD-INFO PIC X(19). 05 AMT PIC X(4). Outermost 01 ATM-DETAILS. variables 05 ATM-ID PIC X(5). 05 ATM-ADDRESS X(12). 05 ATM-OWNER-ID PIC X(3). Inner 01 MERC-DETAILS. variables 05 MERCHANT-ID PIC X(8). (fields) 05 MERCHANT-ADDRESS PIC X(12). 01 CARD-NUM PIC X(16). 01 CASHBACK-RATE X(2). 01 CASHBACK X(3). 4 / 51

Example program -- code /1/ READ CARD-TRANSACTION-REC. /2/ IF LOCATION-TYPE = 'M' /3/ MOVE LOCATION-DETAILS TO MERC-DETAILS Types not /4/ ELSE obvious! /5/ MOVE LOCATION-DETAILS TO ATM-DETAILS /6/ ENDIF CreditCdNum /7/ IF CARD-INFO[1:1] = 'C' /8/ MOVE CARD-INFO[2:3] TO CASHBACK-RATE Disjoint /9/ MOVE AMT*CASHBACK-RATE/100 TO CASHBACK union not /10/ MOVE CARD-INFO[4:19] TO CARD-NUM obvious! /11/ WRITE CARD-NUM, CASHBACK TO CASHBACK-FILE /12 ELSE DebitCdNum /13/ MOVE CARD-INFO[2:17] TO CARD-NUM /14/ ENDIF CreditCdNum | DebitCdNum /15/ IF LOCATION-TYPE = 'M' /16/ WRITE MERCHANT-ID, AMT, CARD-NUM TO M-FILE /17/ ELSE /18/ WRITE ATM-ID, ATM-OWNER-ID, AMT,CARD-NUM TO A-FILE. /19/ ENDIF 5 / 51

An example Cobol program – Data declarations Implicit aggregate 01 CARD-TRANSACTION-REC. structure! 05 LOCATION-TYPE PIC X. 05 LOCATION-DETAILS PIC X(20). 05 CARD-INFO PIC X(19). 05 AMT PIC X(4). 'C':CreditTag ; CashBkRate ; CreditCdNum AtmID ; OwnerID !{'C'}:DebitTag ; DebitCdNum ; Unused MerchantID 01 ATM-DETAILS. 05 ATM-ID PIC X(5). 05 ATM-ADDRESS X(12). 05 ATM-OWNER-ID PIC X(3). 01 MERC-DETAILS. 05 MERCHANT-ID PIC X(8). 05 MERCHANT-ADDRESS PIC X(12). 01 CARD-NUM PIC X(16). 01 CASHBACK-RATE X(2). 01 CASHBACK X(3). 6 / 51

Algorithm 1 [TACAS '05] ● A “guarded” (dependent) type system, involving guarded type variables, records (concatenation), and unions – Example : ( ‘ E’: α 1 ; β 7 ; γ 4 ; δ 2 ) | (!{‘E’}: ε 1 ; φ 9 ; η 4 ) 7 / 51

Algorithm 1 [TACAS '05] ● A “guarded” (dependent) type system, involving guarded type variables, records (concatenation), and unions – Example : ( ‘ E’:Emp 1 ; EId 7 ; Salary 4 ; Unused 2 ) | (!{‘E’}:Vis 1 ; SSN 9 ; Stipend 4 ) Meaningful  Formal characterization of a correct typing names for clarity solution for a program  Path-sensitive type inference algorithm – Improved accuracy; program-point specific types – Computed solution helps in constructing class diagram 8 / 51

Applications of guarded type system ● Program understanding ● Understanding impact of changes ● Program transformation – Field expansion (e.g., Y2K expansion) – Porting from weakly-typed languages to object-oriented languages – Refactoring data declarations to make them better reflect logical structure 9 / 51

Key features of algorithm ● Based on dataflow analysis – Dataflow fact at each point is a type for the entire memory – Each origin statement (READ, MOVE literal TO var) gets a unique type variable ● Interprets predicates of the form var == literal, var != literal ● Two key operations: – Split: Replace α i by concatenation β j ; γ k , i = j + k. – Specialize: Replace α i by union β i | γ i . 10 / 51

CARD-TRANSACTION-REC ATM- MERC- M CASHBACK CASHBACK U DETAILS DETAILS N - -RATE D R A C /1/ READ CARD-TRANSACTION-REC. b 1 a 44 c 43 b 1 a 44 c 43 /2/ IF LOCATION-TYPE = 'M' Split a 44 → b 1 ; c 43 Specialize b 1 → 'M':d 1 | !{'M'}:e 1 11 / 51

CARD-TRANSACTION-REC ATM- MERC- M CASHBACK CASHBACK U DETAILS DETAILS N - -RATE D R A C /1/ READ CARD-TRANSACTION-REC. 'M':d 1 c 43 !{'M'}:e 1 f 43 /2/ IF LOCATION-TYPE = 'M' 12 / 51

CARD-TRANSACTION-REC ATM- MERC- M CASHBACK CASHBACK U DETAILS DETAILS N - -RATE D R A C /1/ READ CARD-TRANSACTION-REC. 'M':d 1 c 43 !{'M'}:e 1 f 43 'M':d 1 c 43 !{'M'}:e 1 f 43 /2/ IF LOCATION-TYPE = 'M' 'M':d 1 c 43 /3/ MOVE LOCATION-DETAILS TO MERC-DETAILS /4/ ELSE !{'M'}:e 1 f 43 /5/ MOVE LOCATION-DETAILS TO ATM-DETAILS 13 / 51

CARD-TRANSACTION-REC ATM- MERC- M CASHBACK CASHBACK U DETAILS DETAILS N - -RATE D R A C /1/ READ CARD-TRANSACTION-REC. 'M':d 1 h 20 i 23 !{'M'}:e 1 f 43 'M':d 1 h 20 i 23 !{'M'}:e 1 f 43 /2/ IF LOCATION-TYPE = 'M' 'M':d 1 h 20 i 23 /3/ MOVE LOCATION-DETAILS TO MERC-DETAILS 'M':d 1 h 20 h 20 i 23 /4/ ELSE !{'M'}:e 1 f 43 /5/ MOVE LOCATION-DETAILS TO ATM-DETAILS 14 / 51

CARD-TRANSACTION-REC ATM- MERC- M CASHBACK CASHBACK U DETAILS DETAILS N - -RATE D R A C /1/ READ CARD-TRANSACTION-REC. 'M':d 1 h 20 i 23 !{'M'}:e 1 j 20 k 23 'M':d 1 h 20 i 23 !{'M'}:e 1 j 20 k 23 /2/ IF LOCATION-TYPE = 'M' 'M':d 1 h 20 i 23 /3/ MOVE LOCATION-DETAILS TO MERC-DETAILS 'M':d 1 h 20 i 23 h 20 /4/ ELSE !{'M'}:e 1 k 23 j 20 /5/ MOVE LOCATION-DETAILS TO ATM-DETAILS !{'M'}:e 1 j 20 k 23 j 20 15 / 51

CARD-TRANSACTION-REC ATM- MERC- M CASHBACK CASHBACK U DETAILS DETAILS N - -RATE D R A C /1/ READ CARD-TRANSACTION-REC. 'M':d 1 h 20 i 23 !{'M'}:e 1 j 20 k 23 'M':d 1 h 20 i 23 h 20 !{'M'}:e 1 j 20 k 23 j 20 /7/ IF CARD-INFO[1:1] = 'C' 16 / 51

CARD-TRANSACTION-REC ATM- MERC- M CASHBACK CASHBACK U DETAILS DETAILS N - -RATE D R A C /1/ READ CARD-TRANSACTION-REC. 'M':d 1 h 20 l 1 m 22 !{'M'}:e 1 j 20 n 1 o 22 'M':d 1 h 20 l 1 m 22 h 20 !{'M'}:e 1 j 20 n 1 o 22 j 20 /7/ IF CARD-INFO[1:1] = 'C' Specialize → l 1 1 | !{'C'}:q 1 'C':p n 1 → 'C':r 1 | !{'C'}:s 1 17 / 51

CARD-TRANSACTION-REC ATM- MERC- M CASHBACK CASHBACK U DETAILS DETAILS N - -RATE D R A C /1/ READ CARD-TRANSACTION-REC. 'M':d 1 h 20 'C':p 1 m 22 !{'M'}:e 1 j 20 'C':r 1 o 22 'M':t 1 u 20 !{'C'}:q 1 v 22 !{'M'}:w 1 x 20 !{'C'}:s 1 y 22 'M':d 1 h 20 'C':p 1 m 22 !{'M'}:e 1 j 20 'C':r 1 o 22 'M':t 1 u 20 !{'C'}:q 1 v 22 !{'M'}:w x x 20 y 22 !{'C'}:s 1 /2/ IF LOCATION-TYPE = 'M' 18 / 51

CARD-TRANSACTION-REC ATM-DETAILS MERC-DETAILS CASHBACK -RATE CASH BACK - M 'M':d 1 h 20 'C':p 1 m 22 D U R N A C !{'M'}:e 1 j 20 'C':r 1 o 22 'M':t 1 u 20 !{'C'}:q 1 v 22 !{'M'}:w 1 x 20 !{'C'}:s 1 y 22 /2/ IF LOCATION-TYPE = 'M' /3/ MOVE LOCATION-DETAILS TO MERC-DETAILS /4/ ELSE /5/ MOVE LOCATION-DETAILS TO ATM-DETAILS /7/ IF CARD-INFO[1:1] = 'C' /8/ MOVE CARD-INFO[2:3] TO CASHBACK-RATE /9/ MOVE AMT*CASHBACK-RATE/100 TO CASHBACK /10/ MOVE CARD-INFO[4:19] TO CARD-NUM /11/ WRITE CARD-NUM, CASHBACK TO CASHBACK-FILE /12 ELSE /13/ MOVE CARD-INFO[2:17] TO CARD-NUM /15/ IF LOCATION-TYPE = 'M' /16/ WRITE MERCHANT-ID, AMT, CARD-NUM TO M-FILE /17/ ELSE /18/ WRITE ATM-ID, ATM-OWNER-ID, AMT,CARD-NUM TO A-FILE. h 1 8 h 2 12 m 1 2 m 2 16 m 3 4 h 1 8 h 2 12 m 2 16 m 1 2 'M':d 1 'C':p 1 z 12 j 3 2 o 2 16 o 3 4 j 1 5 5 12 16 2 j 1 j 2 3 o 1 j 2 j 3 3 o 2 o 1 !{'M'}:e 1 'C':r 1 z 16 v 2 2 v 3 u 1 8 u 2 12 v 1 4 u 1 8 u 2 12 16 'M':t 1 v 1 !{'C'}:q 1 16 y 2 5 x 2 12 x 3 2 y 3 4 x 1 5 x 2 12 x 3 x 1 3 y 1 3 y 1 16 !{'M'}:w 1 !{'C'}:s 1 19 / 51

Correctness characterization Input: a b c d e f . . . β α γ η typing solution is correct because there exists an ....... atomization… REPEAT .... … and a typing a b c, MOVE X TO … of each atom … b c d Is type of α ; β | Is type of β ; γ … such that Runtime Typing types values solution completely describe runtime values 20 / 51

Characteristics of the solution ● Fow and path sensitive: – Each occurrence of a variable is assigned a type – Uses guards to ignore certain infeasible paths ● Determines variables of the same type, reveals record structure within variables, as well as disjoint unions ● Shortcomings: – Dataflow facts are “unfactored”, potentially of exponential size 21 / 51

/1/ READ CARD-TRANSACTION-REC. 8 12 2 16 4 'M':d 1 h 1 h 2 m 1 m 2 m 3 'C':p 1 5 12 3 2 16 4 j 1 j 2 j 3 o 1 o 2 o 3 !{'M'}:e 1 'C':r 1 u 1 8 u 2 12 v 1 16 v 2 2 4 'M':t 1 v 3 !{'C'}:q 1 16 x 1 5 x 2 12 3 y 1 y 2 2 4 x 3 y 3 !{'M'}:w 1 !{'C'}:s 1 [22: [1:1]= [22:22] 22] 'M' ='C' ='C' true true true [1:1]= [1:1]= [22:22] !{'M'} = !{'C'} !{'M'} 22 / 51

Algorithm 2 [ICSE '06, WCRE '07] 1.Compute guarded dependences 2.Compute cuts at each data-source statement (i.e., READ statement). 3.Organize the cuts as a cut-structure tree ● It is possible, but not desirable, to translate cut-structure tree directly into a class hierarchy 4.Factor the cut-structure tree to capture better the grouping/structure of sibling cuts 5.Translate cut-structure tree into a class hierarchy 23 / 51

Semantics-based reverse engineering of data models from programs - PowerPoint PPT Presentation

Semantics-based reverse engineering of data models from programs Komondoor V Raghavan IBM India Research Lab (with G. Ramalingam, J. Field, et al) 1 / 51 Understanding legacy software Common scenario huge existing legacy code base

Semantics 1 / 21 Outline What is semantics? Denotational semantics Semantics of naming What

Next-Generation Debuggers For Reverse Engineering For Reverse Engineering The ERESI team

Reverse Osmosis Reverse Osmosis Background to Market and to Market and Background Technology

Operational Semantics 1 / 14 Outline What is semantics? Operational Semantics What is

15-411: Dynamic Semantics Jan Ho ff mann Dynamic Semantics Static semantics: definition of

Polyteam Semantics Team Semantics Axiomatizations in team semantics Polyteams and Jonni

Semantics in Practice Semantics of Practice How do we write semantics? 1: pen-and-paper How do

Introductory Notes Jigsaw Semantics or: Dynamic Semantics Put Together Again Formal semantics

Polyteam Semantics Team Semantics Axiomatisations in team semantics Polyteams and

CS 166: Information Security Reverse Engineering & Digital Rights Management Prof. Tom

Reverse engineering AT32UC3As JTAG Introduction Overview LSE Summer Week 2014 TAP

Reverse Engineering TCP/ IP Reverse Engineering TCP/ IP Steven Low EAS, Caltech Joint work

Reverse Engineering CS 166 Armen Boursalian 30 Apr 2018 Reverse Engineering Take a

Reverse Logistics Woodfield Distribution, LLC v081617 Reverse Logistics About Us Description

Reverse Mathematics. Antonio Montalb an. University of Chicago. September 2011 Antonio

Remanufacturing of Products Remanufacturing of Products and Reverse Logistics and Reverse

Convergence of the Follow-The-Leader scheme for scalar conservation laws with space dependent flux

Information Systems (Informationssysteme) Jens Teubner, TU Dortmund

Concepts Jyrki Katajainen (University of Copenhagen) Sources: Perform web-search with term

Development: Risks and Challenges for Russia and Europe on the Way to the Low-Carbon Future Prof.

Atoms of regular languages Hellis Tamm Tallinn University of Technology Stellenbosch, Oct 15,

Solid State Theory: Band Structure Methods Lilia Boeri Wed., 11:15-12:45 HS P3 (PH02112)

A SHORT INTRODUCTION TO TWO-PHASE FLOWS Industrial occurrence and flow regimes Herv e

What You Don't Know Is Hurting You: How Aggressive User Research Improved Resistance 3 Drew

Sambuz

Useful Links

Newsletter

Mail Us

Semantics-based reverse engineering of data models from programs - PowerPoint PPT Presentation

Semantics-based reverse engineering of data models from programs Komondoor V Raghavan IBM India Research Lab (with G. Ramalingam, J. Field, et al) 1 / 51 Understanding legacy software Common scenario huge existing legacy code base

Semantics 1 / 21 Outline What is semantics? Denotational semantics Semantics of naming What

Next-Generation Debuggers For Reverse Engineering For Reverse Engineering The ERESI team

Reverse Osmosis Reverse Osmosis Background to Market and to Market and Background Technology

Operational Semantics 1 / 14 Outline What is semantics? Operational Semantics What is

15-411: Dynamic Semantics Jan Ho ff mann Dynamic Semantics Static semantics: definition of

Polyteam Semantics Team Semantics Axiomatizations in team semantics Polyteams and Jonni

Semantics in Practice Semantics of Practice How do we write semantics? 1: pen-and-paper How do

Introductory Notes Jigsaw Semantics or: Dynamic Semantics Put Together Again Formal semantics

Polyteam Semantics Team Semantics Axiomatisations in team semantics Polyteams and

CS 166: Information Security Reverse Engineering &amp; Digital Rights Management Prof. Tom

Reverse engineering AT32UC3As JTAG Introduction Overview LSE Summer Week 2014 TAP

Reverse Engineering TCP/ IP Reverse Engineering TCP/ IP Steven Low EAS, Caltech Joint work

Reverse Engineering CS 166 Armen Boursalian 30 Apr 2018 Reverse Engineering Take a

Reverse Logistics Woodfield Distribution, LLC v081617 Reverse Logistics About Us Description

Reverse Mathematics. Antonio Montalb an. University of Chicago. September 2011 Antonio

Remanufacturing of Products Remanufacturing of Products and Reverse Logistics and Reverse

Convergence of the Follow-The-Leader scheme for scalar conservation laws with space dependent flux

Information Systems (Informationssysteme) Jens Teubner, TU Dortmund

Concepts Jyrki Katajainen (University of Copenhagen) Sources: Perform web-search with term

Development: Risks and Challenges for Russia and Europe on the Way to the Low-Carbon Future Prof.

Atoms of regular languages Hellis Tamm Tallinn University of Technology Stellenbosch, Oct 15,

Solid State Theory: Band Structure Methods Lilia Boeri Wed., 11:15-12:45 HS P3 (PH02112)

A SHORT INTRODUCTION TO TWO-PHASE FLOWS Industrial occurrence and flow regimes Herv e

What You Don't Know Is Hurting You: How Aggressive User Research Improved Resistance 3 Drew

Sambuz

Useful Links

Newsletter

Mail Us

CS 166: Information Security Reverse Engineering & Digital Rights Management Prof. Tom