Schr dinger equation on Schr 256^ 4 grids 256^ 4 grids , * - - PowerPoint PPT Presentation

schr
SMART_READER_LITE
LIVE PREVIEW

Schr dinger equation on Schr 256^ 4 grids 256^ 4 grids , * - - PowerPoint PPT Presentation

High- -Performance Quantum Performance Quantum High Simulation: A challenge to Simulation: A challenge to dinger equation on Schr dinger equation on Schr 256^ 4 grids 256^ 4 grids , * Toshiyuki Imamura 13 *


slide-1
SLIDE 1

High High-

  • Performance Quantum

Performance Quantum Simulation: A challenge to Simulation: A challenge to Schr Schrö

ödinger equation on

dinger equation on 256^ 4 grids 256^ 4 grids

* *Toshiyuki Imamura

Toshiyuki Imamura13

13 今村俊幸

今村俊幸, ,

Thanks to Susumu Yamada Thanks to Susumu Yamada23

23,

, Takuma Kano Takuma Kano2

2, and Masahiko Machida

, and Masahiko Machida23

23 1.

  • 1. UEC (University of Electro

UEC (University of Electro-

  • Communications

Communications 電気通信大学 電気通信大学) )

2.

  • 2. CCSE JAEA (Japan Atomic Energy Agency)

CCSE JAEA (Japan Atomic Energy Agency)

3.

  • 3. CREST JST (Japan Science Technology)

CREST JST (Japan Science Technology)

slide-2
SLIDE 2
  • Jan. 4-8, 2008

2 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

Outline

Outline

I . I .

Physics, Review of Quantum Physics, Review of Quantum Simulation Simulation

I I . I I .

Mathematics, Numerical Algorithm Mathematics, Numerical Algorithm

I I I . I I I . Grand Challenge, Parallel

Grand Challenge, Parallel Computing Computing

  • n ES
  • n ES

I V. I V.

Numerical Results Numerical Results

V. V.

Conclusion Conclusion

slide-3
SLIDE 3

I . Physics, I . Physics, Review of Quantum Review of Quantum Simulation, etc. Simulation, etc.

slide-4
SLIDE 4
  • Jan. 4-8, 2008

4 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

  • S

W’

I S

W

down down-

  • sizing

sizing Crossover from Classical to Quantum ??? Crossover from Classical to Quantum ???

1.1, Quantum Simulation (1/ 2) (1/ 2)

Classical Equation of Motion Classical Equation of Motion Schroedinger Schroedinger Equation Equation

slide-5
SLIDE 5
  • Jan. 4-8, 2008

5 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

  • Numerical Simulation for Coupled Schrodinger Eq.

Numerical Simulation for Coupled Schrodinger Eq.

α α: :

Coupling Coupling Requirement of Exact Requirement of Exact Diagonalization Diagonalization for the Hamiltonian for the Hamiltonian

1.2, Quantum Simulation (2/ 2)

β β: :

1/Mass 1/Mass ∝

∝ 1 1/ W

/ W

β β: :

1/Mass 1/Mass ∝

∝ 1 1/ W

/ W

H

: Spectral expansion by {un } eigenvecs. Ψ : possible state not a value but a vector!

Numerical method to solve the above equation Numerical method to solve the above equation

slide-6
SLIDE 6

I I . Mathematics, I I . Mathematics, Numerical Algorithm, etc. Numerical Algorithm, etc.

slide-7
SLIDE 7
  • Jan. 4-8, 2008

7 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

2.1

2.1 Krylov Krylov Subspace I teration Subspace I teration

  • Lanczos

Lanczos (Traditional method)

(Traditional method)

  • Krylov+ GS

Krylov+ GS : Simple, but : Simple, but shift+ invert shift+ invert version is needed version is needed

  • LOBPCG

LOBPCG (Locally Optimal Block PCG)

(Locally Optimal Block PCG)

  • {

{ Krylov Krylov base, Ritz vector, prior vector} : CG approach base, Ritz vector, prior vector} : CG approach * * Restart at every iteration* * * * Restart at every iteration* *

* * I NVERSE * * I NVERSE-

  • free* *

free* * -

  • > Less Communication

> Less Communication

LOBPCG Lanczos

slide-8
SLIDE 8
  • Jan. 4-8, 2008

8 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

2.2 LOBPCG

2.2 LOBPCG

  • Costly! Since the block is updated at every

Costly! Since the block is updated at every iteration, MV operation is also required!! iteration, MV operation is also required!!

1*MV / every iteration 3*MV / every iteration

Other Difficulties in implementation Other Difficulties in implementation

  • Breakdown of linear independency

Breakdown of linear independency make our own DSYGV using LDL and deflation (not make our own DSYGV using LDL and deflation (not Cholesky Cholesky) )

  • Growth of numerical error in {W,X,P}

Growth of numerical error in {W,X,P} detect numerical error and recalculate them automatically detect numerical error and recalculate them automatically

  • Choice of the shift

Choice of the shift

  • Portability

Portability

slide-9
SLIDE 9
  • Jan. 4-8, 2008

9 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

2.3 Preconditioning

2.3 Preconditioning

  • T~ H

T~ H-

  • 1

1

H= A+ B H= A+ B1

1 + B

+ B2

2 + B

+ B3

3 + B

+ B4

4 + C

+ C12

12 + C

+ C23

23 + C

+ C34

34

1e-6 1e-5 1e-4 1e-3 0.01 0.1 1 10 100

500 400 300 200 100

No preconditioner H1 (Point Jacobi) H2 (LDL) H3 (LDL)

Iteration count Residual error

H~(A+B H~(A+B1

1 )

) H~ (A+B H~ (A+B1

1 )A

)A-

  • 1

1(A+B

(A+B2

2 )

) H~A H~A Here, A: diagonal A+Bx : block-tridiagonal shift + LDLt is used

slide-10
SLIDE 10

I I I . Grand challenge, I I I . Grand challenge, Parallel Computing on ES, Parallel Computing on ES, etc. etc.

slide-11
SLIDE 11
  • Jan. 4-8, 2008

11 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

3.2 Technical I ssues on the Earth

3.2 Technical I ssues on the Earth Simulator Simulator

  • Programming model

Programming model

  • hybrid of distributed parallelism and thread

hybrid of distributed parallelism and thread parallelism. parallelism.

Processor 0 Processor 1 Processor 7

node node

Intra-Node

Vector processing

node

Inter-Node

  • Inter

Inter-

  • Node

Node : : MPI MPI (Message Passing Interface) (Message Passing Interface) Low latency (6.63[us]) Low latency (6.63[us]) Very fast (11.63[GB/s]) Very fast (11.63[GB/s])

  • Intra

Intra-

  • Node

Node : : Auto Auto-

  • parallelization

parallelization OpenMP OpenMP (thread (thread-

  • level parallelism)

level parallelism)

  • Vector Processor (most

Vector Processor (most-

  • inner loops) :

inner loops) : Auto Auto-

  • /manual

/manual-

  • Vectorization

Vectorization

3-level parallelism

slide-12
SLIDE 12
  • Jan. 4-8, 2008

12 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

3.3 Quantum Simulation parallel code

3.3 Quantum Simulation parallel code

  • Application flow chart

Application flow chart

Eigenmode calculation Time Integrator Quantum state analyzer Parallel LOBPCG solver developed on ES Visualization Parallel code on ES Parallel code on ES Visualized by AVS

slide-13
SLIDE 13
  • Jan. 4-8, 2008

13 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

3.4 Handling of Huge Data

3.4 Handling of Huge Data

  • Data distribution in case of a 4D array

Data distribution in case of a 4D array

k i, j l i j

( k , l ) / NP intra-node parallelization

i

loop length=256

vector processing 2-dimensionnal loop decomposition 1-dimension loop decomposition ( k , l ) / NP j /MP

NP : Number of MPI processes MP : Number of microtasking processes (=8)

(k,l) (j)

slide-14
SLIDE 14
  • Jan. 4-8, 2008

14 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

3.5 Parallel LOBPCG

3.5 Parallel LOBPCG

  • Core implementation is MATRIX

Core implementation is MATRIX-

  • VECTOR

VECTOR mult mult. .

  • 3

3-

  • level parallelism is carefully done in our implementation.

level parallelism is carefully done in our implementation.

  • In Inter

In Inter-

  • node parallelization, communication pipelining is used.

node parallelization, communication pipelining is used.

  • In the Rayleigh

In the Rayleigh-

  • Ritz part, SCALAPACK is used.

Ritz part, SCALAPACK is used.

LOBPCG

do l=1,256 :: inter inter-

  • node parallelism

node parallelism do k=1,256 :: inter inter-

  • node parallelism

node parallelism do j=1,256 :: intra intra-

  • node (thread) parallelism

node (thread) parallelism do i=1,256 :: vectorization vectorization w(i,j,k,l)=a(i,j,k,l)*v(i,j,k,l) & +b*(v(i+1,j,k,l)+・・・) +c*(v(i+1,j+1,k,l)+・・・) enddo enddo enddo enddo

Acg.f Acg.f

slide-15
SLIDE 15

I V. Numerical Results, I V. Numerical Results,

slide-16
SLIDE 16
  • Jan. 4-8, 2008

16 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

4.1, Numerical Result

  • Preliminary test of our

Preliminary test of our eigensolver eigensolver

  • 4

4-

  • junction system:

junction system: -

  • > 256^ 4 dimension

> 256^ 4 dimension

CPUs CPUs time[s time[s] ] TFLOPS TFLOPS 2048 2048 3118 3118 3.65 3.65 3072 3072 2535 2535 4.49 4.49 4096 4096 1621 1621 7.02 7.02

Performance (5 eigenmodes) Convergence history (10 eigenmodes)

1e-12 1e-10 1e-8 1e-6 1e-4 1e-2 1 1e+2 1e+4 500 1000 1500 2000 2500 3000

the ground state the 2nd lowest state the 3rd lowest state the 4th lowest state the 5th lowest state the 6th lowest state the 7th lowest state the 8th lowest state the 9th lowest state the 10th lowest state

Iteration count Residual error

slide-17
SLIDE 17
  • Jan. 4-8, 2008

17 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

  • Initial State

Initial State Potential Change: Potential Change: Only a Single Junction Only a Single Junction

? ?

Capacitive Capacitive Coupling Coupling

Question: Synchronization or I ndependence (Localization) Question: Synchronization or I ndependence (Localization)

The Simplest Case: (two Junctions) The Simplest Case: (two Junctions)

4.2, Numerical Result (Scenario)

slide-18
SLIDE 18
  • Jan. 4-8, 2008

18 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

  • Two

Two-

  • Stacked Intrinsic Josephson Junction

Stacked Intrinsic Josephson Junction

1

θ

2

θ

Classical Regime: Classical Regime:

Independent Dynamics Independent Dynamics

Quantum Regime: Quantum Regime: ? ?

4.3, Numerical Result

slide-19
SLIDE 19
  • Jan. 4-8, 2008

19 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

  • q1

q2 q1 q2

t=0.0(a.u.) t=2.9(a.u.)

q1 q2 q1 q2

t=9.2(a.u.) t=10.0(a.u.)

α α= =0.4

0.4

β β= =0.2

0.2

slide-20
SLIDE 20
  • Jan. 4-8, 2008

20 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

  • t=0.0(a.u.)

t=2.5(a.u.) t=4.2(a.u.) t=10.0(a.u.)

q1 q2 q1 q2 q1 q2 q1 q2

α α= =0.4

0.4

β β= =1.0

1.0

slide-21
SLIDE 21
  • Jan. 4-8, 2008

21 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

  • Weakly

Weakly Quantum(Classical Quantum(Classical): Independence ): Independence

Strongly Quantum: Synchronization

Two Junctions Two Junctions

slide-22
SLIDE 22
  • Jan. 4-8, 2008

22 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

  • Three Junctions

Three Junctions

slide-23
SLIDE 23
  • Jan. 4-8, 2008

23 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

  • α

α= =0.4

0.4

β β= =0.2

0.2

slide-24
SLIDE 24
  • Jan. 4-8, 2008

24 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

  • α

α= =0.4

0.4

β β= =1.0

1.0

slide-25
SLIDE 25
  • Jan. 4-8, 2008

25 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

  • <q1

> <q2 > <q3 > <q4 > <q1 > <q2 > <q3 > <q4 >

t(a.u.) t(a.u.) q q

(a) (b)

4 Junctions 4 Junctions

α α= 0.4

= 0.4

β β= 0.2

= 0.2

α α= 0.4

= 0.4

β β= 1.0

= 1.0

Quantum Assisted Synchronization Quantum Assisted Synchronization

slide-26
SLIDE 26
  • V. Conclusion
  • V. Conclusion
slide-27
SLIDE 27
  • Jan. 4-8, 2008

27 RANMEP2008, NCTS, Taiwan (清華大学 新竹 台湾)

  • 5. Conclusion
  • 5. Conclusion
  • Collective MQT in Intrinsic Josephson

Collective MQT in Intrinsic Josephson Junctions via parallel computing on ES Junctions via parallel computing on ES

  • Direct Quantum Simulation (4

Direct Quantum Simulation (4-

  • Junctions)

Junctions)

  • Quantum (

Quantum (Sychronus Sychronus) ) vs vs Classical (Localized) Classical (Localized)

  • Quantum Assisted Synchronization

Quantum Assisted Synchronization

  • High Performance Computing

High Performance Computing

  • Novel

Novel eigenvalue eigenvalue algorithm LOBPCG algorithm LOBPCG

  • Communication

Communication-

  • free (or less) implementation

free (or less) implementation

  • Sustained 7TFLOPS (21.4% of Peak)

Sustained 7TFLOPS (21.4% of Peak)

  • Toward

Toward Peta Peta-

  • scale computing?

scale computing?

slide-28
SLIDE 28

Thank you! Thank you! 謝謝 謝謝

Further information Further information Physics: Physics: machida.masahiko@jaea.go.jp machida.masahiko@jaea.go.jp HPC: HPC: imamura@im.uec.ac.jp imamura@im.uec.ac.jp