Architecture without explicit locks for logic Importance Of - - PowerPoint PPT Presentation

architecture without explicit locks for logic
SMART_READER_LITE
LIVE PREVIEW

Architecture without explicit locks for logic Importance Of - - PowerPoint PPT Presentation

SIMD simulation M. Chimeh, P. Cockshott Architecture without explicit locks for logic Importance Of Simulation simulation on SIMD machines Simulation Algorithms Circuit Representation M. Chimeh P. Cockshott SIMD Simulation Department


slide-1
SLIDE 1

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Architecture without explicit locks for logic simulation on SIMD machines

  • M. Chimeh
  • P. Cockshott

Department of Computer Science University of Glasgow

UKMAC, 2016

slide-2
SLIDE 2

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Contents

1 Importance Of Simulation 2 Simulation Algorithms 3 Circuit Representation 4 SIMD Simulation 5 Machines 6 Results

Setup Parallelism Comparisons Compilers

slide-3
SLIDE 3

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

The Importance Of Simulation

Using models to replicate the behaviour of an actual system is called simulation. A model is a simpler and abstract version

  • f a desired system. In general, simulation refers to time

evolution of a computerized version of a model. Due to the growth of design size and complexity, design verification is an important aspect of the Integrated Circuit (IC) development process. The purpose of verification is to validate that the design meets the system requirements and

  • specification. This is done by either functional or formal

verification. The most popular approach to functional verification is the use

  • f simulation based techniques.
slide-4
SLIDE 4

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Cycle based vs Event Based simulation

Cycle based Evaluates all logic gates during every simulation cycle Handles synchronous designs Suitable for circuits with high activity rate Performs unnecessary simulations (extra computation) Event based Evaluates only logic gates with a change on their inputs Handles both synchronous and asynchronous designs Suitable for circuits with low activity rate Requires a centralized scheduler that may cause large amount of overhead Maintaining queue for the list of events is challenging

slide-5
SLIDE 5

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Cycle based simulation algorithm can be used to accelerate the simulation of synchronous design that is composed of combinational blocks and latches. Cycle Based Algorithm initialize each flop flop to zero while there is more input read inputs for pd = 0 to critical path depth simulate each logic function at depth = pd update flip flops

slide-6
SLIDE 6

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Levelisation

Step 1. form set of all signals feeding the latches or outputs. Step 2. push gates whose outputs generate this set onto a stack Step 3. form set of all signals feeding the set of gates on the top of the stack Step 4. if this set is empty goto step 5 otherwise goto step 2 Step 5. set n=0 Step 6. pop the stack and label all gates with level n Step 7. if stack empty terminate, otherwise set n=n+1 and goto step 6

Inputs Outputs

Level 1 Level 2 Level d-1 Level d

Figure: Levelisation example in a circuit, each of the coloured blocks can be simulated in parallel

slide-7
SLIDE 7

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Circuit Representation

Figure: Vectors to hold the circuit specification

The comp array hold the type of logic gate. The inp0 and inp1 arrays points to a location in state array that signal values are stored.

Figure: Signal state vector

The state array contains all the signal values. Output signals

  • f logic gates at the same level are stored adjacent to each
  • ther.
slide-8
SLIDE 8

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

DFF DFF

  • utput

clk 1 6 7 2 3 5 4

7 6 3 6 0 1 2 3 4 5 6 7

L0 state [0..m] inp1 [0..n] inp0 [0..n] comp [0..n] L1 L2

NULL NULL

1 2 3

NULL

2 3 4 5

Figure: An example of a circuit with label Logic gates of the same level are shown in the same color. Figure: Illustration of input value retrieval from the state array

slide-9
SLIDE 9

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

SIMD Simulation Requirement

Figure: Example of performing SIMD operation on 512-bits of data in the integer array

Level 2 Level d

... ... ... ...

Level 0 Level 1

Figure: An example of workload among the threads per level

  • simulation. The curved lines in the figure symbolized the

synchronization between threads.

slide-10
SLIDE 10

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Lookup Table vs Direct Logic Bit Packing vs Word Packing

slide-11
SLIDE 11

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Bit Packing vs Word Packing

Figure: Signal Representation using a)word packing b)wbit packing

The state vector can either store each signal as 1 bit or use a whole word for each signal. The inp0, inp1 vectors are unaffected by this choice, but the comp vector can be discarded when using bit packing.

slide-12
SLIDE 12

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Figure: Re-arrangement of logic gates in a circuit in Bit packing Technique

This illustrates the re-arranged logic gates in comp array. Logic gates of the same type are stored next to each other. The rest

  • f arrays are organized accordingly. The top is a re-arranged,

and the bottom array is a normal array. This allows CPU AND, OR, NOT instructions to be used 32 bits at a time.

slide-13
SLIDE 13

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Xeon Phi

Parameter Intel Xeon Phi Intel Xeon Coprocessor 5110P Processor E5-2620 Core, Threads 60, 240 6, 12 Clock Speed 1.053 GHz 2 GHz Memory Capacity 8 GB 16 GB per socket Memory Speed 2.75 GHz (5.5 GT/s) 667 MHz (1333 MT/s) Memory Channels 16 4 per socket Memory Data Width 32 bits 64 bits Peak Memory Bandwidth 320 GB/s 42.6 GB/s per socket Vector Length 512 Bits (Intel IMCI) 256 Bits (Intel AVX) Data Caches 32 KB L1, 32 KB L1, 512 KB L2 per core 256 KB per core, 15 MB L3 per socket

slide-14
SLIDE 14

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Results

slide-15
SLIDE 15

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Experimental Setup

Note that our SIMD algorithm was implemented in both Pascal and C++. ZSIM was compiled with three different compilers (Intel C, Gcc, Vector Pascal)

slide-16
SLIDE 16

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Vectorization and Multicore Performance

10 10

5

10

10

1 2 3 4 5 6 7 8 9 10 11

Single core Number of Logic Gates Vectorization Performance

Xeon (Single core) Intel Xeon Phi (Single core) 10 10

5

10

10

50 100 150 200 250 300

Multicore SIMD Number of Logic Gates

Intel Xeon Phi

Parallelization Performance

Figure: Performance comparison of single and multicore SIMD with single core sequential code on Intel Xeon Phi and Xeon. Left plot

shows the speed on both machines using single core. Acceleration gain falls

  • ff for larger circuits that do not fit in 1 core’s cache. Right plot shows the

speedup when 240 threads SIMD where used on Intel Xeon Phi.

slide-17
SLIDE 17

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Performance Comparison to Xilinx Commercial Simulator

10

1

10

2

10

3

10

4

10

5

10

3

10

4

10

5

10

6

10

7

Number of Gate Transitions Per Second Number of Logic Gates

ZSIM(Xeon Phi:125 threads) ZSIM(i7:8 threads) Commercial Simulator

Figure: Log/Log plot of gate transitions per second for the Xilinx Simulator ISIM (on Intel i7), and the SIMD ZSIM running on both Intel i7 and Xeon Phi for circuits from IWLS suite

slide-18
SLIDE 18

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Performance Comparison to Xilinx Commercial Simulator

10

1

10

2

10

3

10

4

10

5

10

6

10

7

10

8

10

5

10

6

10

7

10

8

10

9

Number of Logic Gates Number of Gate Transitions per Second

ICPC (8 threads) Commercial Simulator ZSIM handles 3 orders of magnitude larger circuits, and is 3

  • rders of magnitude faster than

the commercial simulator Commercial Simulator fails at this point

Figure: Number of gate transitions per second between the Commercial Simulator and SIMD ZSIM both running on Intel i7 for synthetic circuits (with inputs from any level)

slide-19
SLIDE 19

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Performance Comparison to Blue Gene/L Supercomputer

Table: Characteristic comparison of Intel Xeon phi and IBM Blue Gene/L

Parameter IBM Blue Gene/L Intel Xeon phi Cores 1024 60 Clock Speed 700 MHz/core 1.053 GHz/core Price $0.8m - $1.3m $1600.00 - $2649.00 Size 2m height x 1m width 24.61cm x 11.12cm x 3.86cm

Table: Comparison of number events per second (IBM Blue Gene/L

  • vs. Intel Xeon Phi)

Machine Number of gates Cores/Threads Event rate (millions/sec) Blue Gene/L ≃ 216 million 512 60 1024 116 Xeon Phi ≃ 160 million 125 76.8 240 142

1 Xeon Phi thread is as powerful as 4 Blue Gene/L

slide-20
SLIDE 20

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Performance Comparison Across Compilers

10 10

5

10

10

10

4

10

5

10

6

10

7

10

8

10

9

AMD64 Number of Logic Gates Number of Gate Transitions per Second

GCC (64 threads) VPC (64 threads) 10 10

5

10

10

10

3

10

4

10

5

10

6

10

7

10

8

10

9

Xeon Phi Number of Logic Gates

ICPC (240 threads) VPC (236 threads)

Figure: Comparison of number of transitions per second of the parallel simulator across different compilers on both AMD Opteron and Xeon Phi machine

slide-21
SLIDE 21

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Performance Comparison Across Compilers

10 10

1

10

2

10

3

10

5

10

6

10

7

10

8

10

9

Number of Gate Transitions per Second Threads

AMD64 (GCC compiler) AMD64 (VP compiler) Xeon Phi (VP compiler) Xeon Phi (ICPC compiler)

Figure: Comparison of number of transitions per second of parallel simulator on both Intel Xeon Phi and AMD Opteron, compiled by both Vector Pascal and Intel compiler for circuit size of 170M

slide-22
SLIDE 22

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Summary

Verified that the data structures used allow SIMD acceleration, particularly on machines with gather instructions. Verified that, on sufficiently large circuits, substantial gains could be made from multi-core parallelism. Showed that a simulator using this approach out performed an existing commercial simulator on a standard workstation. Showed that the performance on a cheap Xeon Phi card is competitive with results reported elsewhere on much more expensive super-computers.

slide-23
SLIDE 23

SIMD simulation

  • M. Chimeh,
  • P. Cockshott

Importance Of Simulation Simulation Algorithms Circuit Representation SIMD Simulation Machines Results

Setup Parallelism Comparisons Compilers

Summary

Thank You