Design of a smart camera SoC in a 3D IC technology R. Carmona Galn, - - PowerPoint PPT Presentation

design of a smart camera soc in a 3d ic technology
SMART_READER_LITE
LIVE PREVIEW

Design of a smart camera SoC in a 3D IC technology R. Carmona Galn, - - PowerPoint PPT Presentation

Design of a smart camera SoC in a 3D IC technology R. Carmona Galn, J. Fernndez Berni, S. Vargas Sierra, G. Lin Cembrano, . Rodrguez Vzquez, V. Brea Snchez (*) , M. Surez Cambre (*) , D. Cabello


slide-1
SLIDE 1
  • R. Carmona‐Galán, J. Fernández‐Berni, S. Vargas‐Sierra, G. Liñán‐Cembrano, Á. Rodríguez‐

Vázquez, V. Brea‐Sánchez(*), M. Suárez‐Cambre(*), D. Cabello‐Ferrer(*)

Institute of Microelectronics of Seville (IMSE‐CNM), CSIC‐Universidad de Sevilla (Spain)

(*)Information Technology Research Center (CITIUS) Univ. de Santiago de Compostela (Spain)

Design of a smart camera SoC in a 3D‐IC technology

Workshop on Architecture of Smart Camera

Clermont‐Ferrand, France April 5‐6, 2012

slide-2
SLIDE 2

WASC 2012, Clermont‐Ferrand, France 2

Main lines

  • Conventional digital signal processing architectures introduce data

bottlenecks and are inefficient when dealing with multidimensional sensory signals

  • Architectures adapted to the nature of the stimulus are more efficient

in terms of power consumption per operation but…

  • Concurrent sensing, processing and memory in planar technologies

introduces serious limitations to image resolution and image size via the penalties in fill factor and pixel pitch

  • 3D integrated circuit technologies with a dense TSV distribution

permits eliminating data bottlenecks without degrading image resolution and size.

slide-3
SLIDE 3

WASC 2012, Clermont‐Ferrand, France 3

Computational demand in artificial vision

Abstraction level (Data structure complexity) Data dimensionality (No. of objects)

  • Image capture
  • Spatial filtering
  • Edge/Motion detection

Very regular flow High computational demand

  • Decision making
  • Algorithm control
  • Conditional jumps

Irregular flow Moderate demand

  • Image segmentation
  • Object labeling
  • Feature extraction

Less regular flow Lower demand

slide-4
SLIDE 4

WASC 2012, Clermont‐Ferrand, France 4

Power and time‐critical applications

1 1

e N E Power Speed FOM

  • p

tot

  

Power‐aware applications

proc tot tot

t e N T E Power

  

Time‐critical applications

  • p

proc tot

t N N T Speed

1 1   

Power‐speed trade‐off

(Image) (Image, signal, flag, etc.)

e N E

  • p

tot

 

Energy:

proc

  • p

tot

N t N T

 

Time:

OUTPUT INPUT

  • p

N

 

slide-5
SLIDE 5

WASC 2012, Clermont‐Ferrand, France 5

Minimization of Nop

(number of operations)

Minimization of e0

(energy per operation)

Strategies for Etot minimization

  • Simplified image
  • Hierarchical processing
  • Sparse representation
  • Compressed sensing
slide-6
SLIDE 6

WASC 2012, Clermont‐Ferrand, France 6

Hierarchical processing and data reduction

[Anafocus 2010]

slide-7
SLIDE 7

WASC 2012, Clermont‐Ferrand, France 7

Feature based processing

Edge extraction Multiresolution and foveation

[Fernández‐Berni et al. 2011]

Energy‐based representation and saliency Gaussian pyramid and scale‐space

slide-8
SLIDE 8

WASC 2012, Clermont‐Ferrand, France 8

Minimization of Nop

(number of operations)

Minimization of e0

(energy per operation)

  • Simplified image
  • Hierarchical processing
  • Sparse representation
  • Compressed sensing
  • Distributed processing
  • Distributed memory
  • Distributed ADC

@ system level:

Strategies for Etot minimization

slide-9
SLIDE 9

WASC 2012, Clermont‐Ferrand, France 9

Processor/memory performance gap

[Hennessy & Patterson 2006]

Performance is measured as the number of instructions per second relative to IPS in 1980 for processors, and as the inverse of the access time relative to access time in 1980 for memories

slide-10
SLIDE 10

WASC 2012, Clermont‐Ferrand, France 10

Multicore architectures

0.0 0.5 1.0 1.5 2.0

Fclk 2*Fclk Fclk/2 Normalized power consumption Normalized computing power

1.0 1.0 2.0 1.5 0.5 1.0 0.67 1.34

slide-11
SLIDE 11

WASC 2012, Clermont‐Ferrand, France 11

Minimization of Nop

(number of operations)

Minimization of e0

(energy per operation)

  • Simplified image
  • Hierarchical processing
  • Sparse representation
  • Compressed sensing
  • Distributed processing
  • Distributed memory
  • Distributed ADC

@ system level:

  • Power efficient circuits
  • High signal/bias ratio
  • Complex dynamics

@ PE level:

Strategies for Etot minimization

slide-12
SLIDE 12

WASC 2012, Clermont‐Ferrand, France 12

CNN model for retinal signal processing

OPL IPL

Propagation of activity patterns Inhibition Bipolar cells gain control Photosensors gain control

[Roska & Werblin 2001]

4 types of interaction

1 1 1

z u  b

2 2 2

z u  b

2 2 1 1

y y w w 

2

y

1

y

1 11y

A

2 22y

A

1 12y

a

1 12y

a

Layer 1 Layer 2 Layer 3 Input Feedback Outputs

2 1 3

,   

1

2

CNN models for the OPL and IPL

[Rekeczky, Balya et al. 2000]

losses feedback feedforward bias

  

        

    

3 1 1 1 1 1 ) )( ( ) )( (

] [ )] ( [

l m n ij n j m i mn n j m i mn ij ij k

k l kl l kl k k

z u b y a t x g dt dx 

) , (

1

j i N ) , ( j i C

  • Non‐linear dynamic processors
  • Local interactions by means of

continuous signals (in amplitude and time)

  • Interconnection pattern (cloning

template) = analog program

[Chua & Yang 1988]

slide-13
SLIDE 13

WASC 2012, Clermont‐Ferrand, France 13

2‐CNN‐layer (in‐plane) chip

9.27mm 8.45mm PE’s array with boundary conditions

I/O mux‐demux digital buffers weight buffers weight buffers I/O control

I/O ctrl.

  • Prg. Ct.

program memory

weights and references memory

test

Analog Parallel Array Processor with 1024 PE’s

  • 0. 5mm standard CMOS
  • 2 CNN layers of 32 x 32 nodes
  • Programmable time constant ratio
  • Local logic unit and local memories
  • 24 programmable weights

[Carmona et al. 2002]

Realizes a set of coupled reaction‐ diffusion equations

  • Wave phenomena in active media
  • Pattern generation
  • Retinal dynamics emulation

[Petrás et al. 2003]

) , , ( ) , , ( ) , , ( ) , , ( ) , , (

2

t y x t y x t y x t y x c t y x dt d

j ij i i i i i i i

            

reaction diffusion

slide-14
SLIDE 14

WASC 2012, Clermont‐Ferrand, France 14

Performance chart

Chip Tec. Description Res. Clk (MHz) PE’s/mm2 OPS/mm2 OPS/mW

CPUs + GPUs A [Intel 2008] 45n Atom Single‐core 64b 1730 0.038 0.125G 1.32M B [Intel 2010] 45n Atom dual‐core 64b 1300 0.023 0.160G 1.64M C [Nvidia 2010] 40n Tegra (2ARM9+8CPU) 32b 1000 0.204 0.047G 4.60M Digital SIMD D [Raab 2003] 350n Parallel array 16 PEs 32b 100 0.080 0.104G 6.60M E [Komuro 2004] 500n SIMD 64 x 64 PEs 1b 10 140 1.40G 365M F [Abbo 2007] 180n Xetal‐II Het. Multicore 320PEs 16b 84 4.32 1.45G 178.3M G [Miao 2008] 180n SIMD 16 x 16 PEs 4b 300 833.3 0.094G 24.4M H [Zhang 2011] 180n Multi‐level SIMD 32+32 x 128 8b 100 317.5 3.4G 97.8M Focal‐plane processors J [Carmona 2003] 500n RD CNN 2 x 32 x 32 cells 8b 10 58.4 0.963G 250M K [Liñán 2004] 350n Parallel array 128 x 128 cells 8b 100 180 3.20G 82.5M L [Dudek 2004] 350n Current mode SIMD 39 x 48 PEs 6b 2.5 410 0.513G 104M M [Gottardi 2009] 350n Parallel array 128 x 64cells 8b 80 409.6 2.8G 4G N [Lopich 2010] 350n Cellular Proc. 19 x 22 cells 8b 75 85.5 0.25G 38M P [Lee 2011] 130n Digital CNN 80 x 60 + 120PEs 8b 200 1093 5.33G 285.7M OPS/mm2

A

10G 1G 100M

B

2003 2005 2007 2011 2009 10G 1G 100M 1M

OPS/mW

10M 2003 2005 2007 2011 2009

D E F H J K L M N P A B D E F H J K L M N P C G C G

slide-15
SLIDE 15

WASC 2012, Clermont‐Ferrand, France 15

Major drawbacks

  • Reduced fill factor
  • Large pixel pitch

→ Small image size → Limited resolution → Sensitivity vs. resolution trade‐off

Smart CIS based on FPP @IMSE

Major achievements

  • Fully programmable features
  • Large variety of functional targets
  • Image‐to‐Decision at >1,000fps using 60nW

per pixel

  • Spatio‐temporal filtering @22nJ/cycle
  • Content‐aware HDR acquisition with >145dB

intra‐frame DR

slide-16
SLIDE 16

WASC 2012, Clermont‐Ferrand, France 16

Multilayer hierarchical vision architecture

slide-17
SLIDE 17

WASC 2012, Clermont‐Ferrand, France 17

3D integration for CMOS image sensors

[OmniVision 2010] [Sony 2012]

slide-18
SLIDE 18

WASC 2012, Clermont‐Ferrand, France 18

1st attempt: bump‐bonded sensor layer

Passivation Top metal

  • pening

Routing layers Bond wires Indium bumps InGaAs or Si sensor layer Light

Si substrate

Xenon‐NC V1

Technology 0.18um UMC Die size 5x5 mm2 # Pixels 8x8 # Pixel pitch 125um # PEs 8

  • Int. word length

24b Clock frequency 80 MHz Local memory 64 words

[Rekeczky et al. 2007]

  • CMOS compatible
  • High fill factor
  • Custom spectral

responsivity

CTIA

  • Zero‐bias detection
  • Sense capacitor matching

Full‐custom ROIC

  • Reconfigurable gain
  • Adaptive sensing
slide-19
SLIDE 19

WASC 2012, Clermont‐Ferrand, France 19

2nd attempt: VISCUBE 3D IC stack

‐Project partners: ‐Funding agency:

slide-20
SLIDE 20

WASC 2012, Clermont‐Ferrand, France 20

3D‐IC fabrication process

Dedicated sensor layer Distributed Analog & digital circuitry at pixel and/or sub- frame level

MIT Lincoln Labs 0.18um FDSOI CMOS process (funded by DARPA)

slide-21
SLIDE 21

WASC 2012, Clermont‐Ferrand, France 21

Tier 3: sensor interface and feature extraction

Analog front‐end

  • Capacitive transimpedance amplifier
  • Multiplexed sensor interface
  • Full frame refresh at 1kfps

Focal‐plane processing

  • A/D conversion of 320x240 raw image
  • Binning of the 4 sensors
  • Filtering at 2 user‐selected scales
  • Subtraction of the two scales
  • Local maxima and minima detection

8.5mm 7.0mm

slide-22
SLIDE 22

WASC 2012, Clermont‐Ferrand, France 22

Tier 2: distributed image memory

Image frame buffer

  • Pitch‐aligned with MS‐layer (160*120 TSVs)
  • Dual‐port 160*120*(6x8b + 2x1b) SRAM
  • 8b parallel I/O

8.0mm 6.5mm

slide-23
SLIDE 23

WASC 2012, Clermont‐Ferrand, France 23 5mm 5 or 2.5mm 12.3mm 21.8mm 750 or 12µm 12µm 765 or 780µm (2 or 3 memory layers) “Can be further back lapped” DRAM Controller layer

  • 2 LP 0.13um CMOS chips + 3 DRAM chips
  • TSVs for connecting both sides of wafer.
  • CMOS‐to‐CMOS connected by thermocompression

Minimum pitch around 2.50um (1.20um width + 1.30um spacing)

  • CMOS‐to‐DRAM connected by microbumps

(pitch 25um)

3rd attempt: Tezzaron’s 3D IC stack

slide-24
SLIDE 24

WASC 2012, Clermont‐Ferrand, France 24

Tier2 (WBOTTOM)

  • 800 x 640 px.
  • BSI sensors
  • Global shutter
  • In‐pixel CDS
  • Raw image ADC to DRAM

Tier1 (WTOP)

  • Smaller image size
  • Bidirectional access to memory

(foveation)

  • Anisotropic diffusion
  • Gaussian and Laplacian

pyramids

  • Min/Max detection
  • Operation control?

3rd attempt: Tezzaron’s 3D IC stack

slide-25
SLIDE 25

WASC 2012, Clermont‐Ferrand, France 25

Conclusions

  • Conventional

data processing architectures introduce data bottlenecks and are inefficient when dealing with multidimensional sensory signals

  • Architectures adapted to the nature of the stimulus are more efficient

in terms of power consumption per operation

  • Concurrent sensing, processing and memory in planar technologies

introduces serious limitations to image resolution and image size via the penalties in fill factor and pixel pitch

  • 3D integrated circuit technologies with a dense TSV distribution

permits eliminating data bottlenecks without degrading image resolution and size.

slide-26
SLIDE 26

WASC 2012, Clermont‐Ferrand, France 26

Acknowledgments

This work is financially supported by Andalusian Regional Government, through project 2006‐TIC‐2352, the Spanish Ministry of Economy and Competitiveness, through projects TEC 2009‐11812 and IPT‐2011‐1625‐ 430000, both co‐funded by the EU‐ERDF and by the Office of Naval Research (USA), through grant N000141110312.