Exploiting On-Chip Memories In Linux Applications Will Newton, - - PowerPoint PPT Presentation

exploiting on chip memories in linux applications
SMART_READER_LITE
LIVE PREVIEW

Exploiting On-Chip Memories In Linux Applications Will Newton, - - PowerPoint PPT Presentation

Exploiting On-Chip Memories In Linux Applications Will Newton, Imagination Technologies What's wrong with SDRAM? 70 60 50 Cycles 40 30 20 10 0 L1 Cache Hit L1 Cache Miss 64 cycles is optimistic RAM clock often slower than core SoC


slide-1
SLIDE 1

Exploiting On-Chip Memories In Linux Applications

Will Newton, Imagination Technologies

slide-2
SLIDE 2

What's wrong with SDRAM?

10 20 30 40 50 60 70 L1 Cache Hit L1 Cache Miss Cycles

slide-3
SLIDE 3

64 cycles is optimistic

RAM clock often slower than core SoC fabric and arbiter delays SDRAM controller bursting delays TLB miss stalls

slide-4
SLIDE 4

It's not just latency

Memory bus bandwidth Memory bus power consumption Bus contention can affect other cores Non-deterministic if you're doing RT

slide-5
SLIDE 5

What solutions are available?

Core Code Core Data ROM RAM RAM

I Cache D Cache

MMU

Memory Arbiter

SDRAM

META Core

Write Combiner Internal Memory System Bus Peripherals

slide-6
SLIDE 6

Example META SoC

Hardware multi-threaded DSP core L1 cache - 16k code, 16k data Core memory - 64k code, 64k data Internal memory - 384k general purpose

slide-7
SLIDE 7

Example META SoC

10 20 30 40 50 60 70 L1 Cache Core Mem Internal SDRAM Cycles

slide-8
SLIDE 8

Using core memories

Ideally we would like usage to be transparent Fixed addresses make this difficult

slide-9
SLIDE 9

Core memory: Executables

Linker script allows placement of sections elf_map overridden in the kernel

#define __section(S) __attribute__((__section__(#S))) #define __core_text __section(.core_text) #define __core_data __section(.core_data) static int __core_data mydata; int __core_text myfunction(int a);

slide-10
SLIDE 10

Core memory: Shared libraries

Whole shared object can be placed in core Cannot mix core and MMU in one object Only useful for small objects

slide-11
SLIDE 11

Core memory: Dynamic allocation

System call API to allocate and free Can replace specific malloc/free calls Allows kernel to reserve areas

slide-12
SLIDE 12

Core memory: In practice

Not easy to get big speedups Cache manages small, frequently accessed items well Beware long branches Improved tremor decode speed by 11%

slide-13
SLIDE 13

Using internal memory

Linux supports cpu-less NUMA nodes numactl set_mempolicy(2) mbind(2)

slide-14
SLIDE 14

Internal memory: numactl

Tool to set NUMA policy of an application

numactl --preferred=1 ls

Does not build easily with uClibc Too coarse-grained for many situations

slide-15
SLIDE 15

Internal memory: set_mempolicy(2)

Sets the memory policy of the current process

int set_mempolicy(int mode, unsigned long *nodemask, unsigned long maxnode)

Does not move existing pages Memory policy can be set multiple times

slide-16
SLIDE 16

Internal memory: mbind(2)

Sets the memory policy for an address range

int mbind(void *addr, unsigned long len, int mode, unsigned long *nodemask, unsigned long maxnode, unsigned flags)

Overrides policy set by set_mempolicy(2) Capable of moving pages between nodes

slide-17
SLIDE 17

Internal memory: In practice

No nice way to implement malloc_from_node(2) Moving pages can be costly, mbind(2) should be used with precision Improved tremor decode speed by 8%

slide-18
SLIDE 18

Finding hotspots

Code profiling (gprof, oprofile, perf) Emulator Cache profiling (oprofile, perf) Simulator

slide-19
SLIDE 19

Where's the code?

Source code for released products http://www.pure.com/gpl

slide-20
SLIDE 20

Questions?

will.newton@imgtec.com