T-106.5800 Seminar on Software Techniques Seminar on Multicore - - PDF document

t 106 5800 seminar on software techniques seminar on
SMART_READER_LITE
LIVE PREVIEW

T-106.5800 Seminar on Software Techniques Seminar on Multicore - - PDF document

T-106.5800 Seminar on Software Techniques Seminar on Multicore Programming Multicore Technology in Mobile Devices Antti P Miettinen antti.p.miettinen@nokia.com February 12, 2009 Abstract power budget of roughly three Watts for a hand- held


slide-1
SLIDE 1

T-106.5800 Seminar on Software Techniques Seminar on Multicore Programming Multicore Technology in Mobile Devices

Antti P Miettinen antti.p.miettinen@nokia.com February 12, 2009

Abstract

Multicore design is ubiquitous among mobile hand- held devices. A general purpose processor coupled with a digital signal processor has been the con- figuration for even the most basic mobile phones. Energy efficiency concerns have steered the designs towards increasing integration and heterogeneous

  • setups. Typical components in a mobile applica-

tion processor are ARM and DSP cores, various hardware acceleration blocks and a set of memory and peripheral interfaces. Limited energy and power have been critical con- straints for mobile device design and trends are pointing towards these challenges becoming contin- ually more demanding. Increasing parallelism in- side the different subsystems is one way to achieve better energy efficiency. Even though heterogeneity is likely to persist in the overall structure of mobile devices, also symmetrically parallel subsystems are probably going to be employed in future.

1 Introduction

Energy efficiency is a central theme in the design

  • f mobile hand-held devices. The increasing use of

always on-line applications, multimedia, high speed wireless networking, large displays, etc. are making the challenge continually more demanding. Also, software trends towards increasing use of e.g., web applications and dynamic programming languages are making the optimization of energy efficiency ever more important. Additionally, even if the available energy would not be a limiting factor, the power budget of roughly three Watts for a hand- held device remains a valid rule because of thermal concerns [1]. Figure 1:

Clock speed per power for a collection of ARM processor cores.

Even though parallel hardware is often viewed as a challenge, it is also an opportunity for mo- bile devices because of the better energy efficiency

  • f parallel processing when compared to sequential

designs of comparable performance. Figure 1 shows an overview of data collected from ARM pub- lic web pages about various ARM processor cores with polynomial extrapolation curves for cores with three data points. As can be seen, the clock speed achievable with given power budget is at least an order of magnitude higher for small low per- formance cores than for bigger high performance cores. Even though larger cores have the poten- tial to perform more work within one cycle, this 1

slide-2
SLIDE 2

advantage is often diluted by the fact that the per- formance of modern software tends to be limited by memory effects.

2 Anatomy of a mobile device

Open discussion about the design and implementa- tion of mobile devices is challenging because of the traditionally closed nature of e.g. mobile phones. Fortunately, the public Internet contains quite in- teresting information about many mobile devices and the increasing use of open source operating sys- tems allows deducing hardware features from kernel device driver configuration. As an example of a contemporary mobile hand- held device, we can take a look at the components

  • f the Nokia N95 as described in [2]. The overall

structure of the device is a dual chip design where the two main processing engines are the application processor (Texas Instruments OMAP2420) and cel- lular modem. Connected to these main components are NOR and NAND flash memories, DRAM mem-

  • ries, energy and power management chips and

different peripherals, e.g., camera modules, Blue- tooth, accelerometer, WiFi, audio, infrared, dis- play, USB and memory card interfaces. This kind of design is typical for many versatile mobile handsets. The two major subsystems re- quiring high processing performance are the appli- cation subsystem and the cellular modem and this is often reflected by a dual chip design. Increas- ing integration has also enabled employing single chip designs, where the cellular modem and the ap- plication subsystems are deployed within the same physical hardware. Single chips designs are com- mon especially for highest volume devices where cost optimization is the overriding design concern.

3 Design of mobile application processors

As is the case with device design, discussion about mobile application processor details is hampered by the closed nature of the industry. However, for example Texas Instruments is occasionally provid- ing two variants of their OMAP processors. The OMAP3530 [3] is a catalog part available to any-

  • ne and has public documentation while appearing

functionally quite similar to OMAP3430 which is targeted for mobile hand-held devices and is avail- able only to high volume customers. The structure of the OMAP3530 processor is quite representative of the overall design of a mod- ern mobile application processor. The main ap- plication core in OMAP3530 is an ARM Cortex- A8 processor with 16K first level instruction and data caches and 256K unified second level cache. The ARM core is connected to level three inter- connect together with quite extensive set of other

  • subsystems. For imaging, video and audio process-

ing the OMAP3530 contains an TMS320DM64x+ digital signal processor. For 2D and 3D graphics, a PowerVR SGX mobile graphics processing unit is provided. Dedicated interface blocks are pro- vided for cameras, displays and USB. For connect- ing memories, an SDRAM controller and general purpose memory controller are included. For pe- ripherals with more modest throughput and latency requirements, there is a fourth level interconnect with e.g., UARTs, general purpose I/O interfaces, timers and memory card interfaces. While the main application core of a mobile ap- plication processor is more or less always based on ARM architecture there is considerable variation in the DSPs employed by different vendors. TI has their own line of the TMS320 family of DSPs whereas Freescale uses their StarCore DSPs and STMicroelectronics has their MMDSP family, etc. (see e.g., [4] for an overview of DSP vendors). DSP cores appear also often inside imaging, video and audio subsystems coupled with hardware accelera- tor blocks. Many subsystems within a mobile application processors use commercial intellectual property blocks. The PowerVR MBX and SGX are good examples of popular mobile GPUs. The use of com- mercial IP blocks applies also to e.g., the employed

  • interconnects. For example, the L3 and L4 inter-

connects inside the OMAP3530 are instantiations

  • f Sonics interconnects from Sonics, Inc. For differ-

entiation purposes, vendors also include their own IP blocks into their designs.

4 The ARM architecture

ARM Ltd is a fabless semiconductor company, i.e. it is not a chip manufacturer. Instead, it provides 2

slide-3
SLIDE 3

intellectual property products for vendors designing and manufacturing the actual integrated circuits. Many IC vendors license e.g., the ARM cores with an implementation license, where the complete pro- cessor implementation information is provided for the licensee. Some vendors have chosen to develop their own implementation based on an architecture license (e.g. the DEC/Intel/Marvell StrongARM and XScale processors). The ARM processors are characterized by var- ious versioning and categorization schemes. The basic processor architecture is defined by the ARM architecture version. In addition to the architecture version, there are various instruction set extensions and also a numbering scheme for processor families. In the latest processor versions ARM abandoned the numbering scheme and introduced the Cortex processor family with different series indicating the target segment for the series. The oldest still supported architecture version is ARMv4, which is implemented by members of the ARM7 processor family (e.g., ARM7TDMI). The ARMv5 architecture is probably currently the most widely deployed version being employed by many members of the ARM9 processor family (e.g., ARM926EJ-S). The ARMv6 architecture is em- ployed by the ARM11 processor family which con- tains the first ARM multicore design, the ARM11

  • MPCore. The latest announced architecture ver-

sion currently is ARMv7, which is employed by the Cortex family of processors. The Cortex A-series is the ARM high end category targeted for applica- tion processors and can be considered to be the evo- lution of the ARM11 processor family. The Cortex- R series targets embedded real-time systems where members of the ARM9 processor family have been traditionally used. The Cortex-M series is the low end of ARM processors targeting deeply embedded cost sensitive devices, i.e., traditional ARM7 tar- gets. The ARM architecture defines the basic 32 bit ARM instruction set and a 16 bit Thumb instruc- tion set. The ARMv7 architecture added also the Thumb-2 ISA, which is a variable length instruc- tion set, allowing both 16 bit and 32 bit instruc-

  • tions. Defined ISA extensions include Jazelle for

Java acceleration, several versions of a vector float- ing point co-processor option, TrustZone security extension and the NEON SIMD extension. The different optimization targets of the different ARM processor categories are reflected by the over- all design of the processor cores. The lowest end ARM7 and Cortex-M cores employ simple three stage pipeline, single issue, in-order design whereas the Cortex-A9 is a dual issue, out-of-order, spec- ulative super-scalar design (see Table 1 for other examples). Core pipeline stages issue rate instruction scheduling ARM7TDMI 3 1 in-order ARM926 5 1 in-order ARM1136 8 1 in-order Cortex-A8 13 2 in-order Cortex-A9 8 2

  • ut-of-order

Table 1: Design characteristics of different ARM cores. The SMP capable cores in the current ARM ar- chitecture are the ARM11 MPCore and Cortex-A9

  • MPCore. They both feature private, split level one

caches and a snoop control unit for maintaining coherence towards level two memory system. The Cortex-A9 has also an accelerator coherence port, which allows connecting peripherals directly to the processor cache hierarchy.

5 Programmers view

Even though mobile devices employ a large num- ber of programmable processor cores, there is no unified programmer’s view to these cores. The dif- ferent subsystems are encapsulated behind various interfaces and the only subsystem available for soft- ware development is typically the main ARM pro-

  • cessor. In fact, some subsystems are protected by

hardware and strong cryptography to prevent end users from tampering with e.g., cellular functional- ity and DRM. The software for the special purpose subsystems is typically provided by the chip vendor in the form

  • f binary firmware and the functionality is made

available through an application programming in- terface inside the application subsystem. The inter- faces for accessing e.g., DSP functionality are often vendor specific. Fortunately there are also various well established APIs, e.g., the Khronos standards for accessing multimedia functionality. However, standard API alone does not facilitate e.g., reason- 3

slide-4
SLIDE 4

ing about the performance characteristics of soft- ware that uses functionality provided by the black box subsystems. A distinctive characteristic for the mobile soft- ware development environments is their cross- development setup. As e.g., the input and output capabilities of physically small devices are not opti- mal for software development, native development is a rare practice. The software development kits employ simulators for testing and debugging while the software is developed on a separate host system. The use of simulators coupled with the severely closed nature of many subsystems causes a major

  • bservability challenge for the mobile software de-
  • velopment. This makes managing performance of

mobile software very challenging and the issue is likely to be aggravated in future by the increasing

  • parallelism. Hardware and software instrumenta-

tion, i.e., trace ports, performance counters and tracing and profiling tools are vital for optimiz- ing the performance and energy consumption of mobile software. As an example, the Embedded Trace Macrocell [5], which is often included in ARM processor cores, allows non-intrusive tracing of the complete execution flow of the processor core. How- ever, few tools exist for addressing nonfunctional requirements inside simulators.

6 Discussion

Mobile hand-held devices are high volume products so it is no wonder that cost is one of the most impor- tant criterions for the design of the devices. How- ever, at the same time the designs are constrained by energy and power limits. The third powerful force in the design equation is the never ending quest for higher performance. These goals are often in conflict with each other. The high cost of chip design is pushing towards general purpose solutions while performance and power concerns favor tai- lored special purpose solutions. Parallel architec- tures are especially attractive in this environment as they have the potential to provide high efficiency coupled with programmability. Increasing parallelism is definitely a trend for mobile devices just like it is a trend for desktop and server environments. However, parallelism is a major challenge for programmability and a cause

  • f quality concerns as well as increasing effort for

performance optimization. In future, mobile devices are likely to employ higher levels of integration and tighter coupling between the increasingly parallel subsystems. In- terconnects are likely to employ network-on-chip structures where bus type interconnects have been traditionally used. High speed serial interfaces be- tween chips and wide interfaces within packages through 3D integration are emerging trends fight- ing the pin-count, power wall and memory wall problems. During the megahertz race it was possible to hide the hardware development reasonably well from software. However, the multicore trend has bro- ken this tradition. Transition towards tightly cou- pled, highly parallel, heterogeneous multicore de- signs and new interconnect and memory architec- tures most probably continues the evolution where software cannot remain completely agnostic about hardware features. On one hand, abstractions are definitely going to be required also in future for managing the increasing complexity of the systems, but on the other hand, the complex performance behavior of the highly evolved hardware will require novel approaches where development tools and high level of observability are likely to be in key role.

References

[1] Y. Neuvo, Cellular phones as embedded sys- tems, Digest of Technical Papers, IEEE Int. Solid-State Circuits Conf. (2004) pp. 32–37. [2] Nokia N95, phoneWreck wiki, http: //www.phonewreck.com/wiki/index.php? title=Nokia_N95 [3] OMAP3530 Application Processor http://focus.ti.com/docs/prod/folders/ print/omap3530.html [4] Digital signal processor/Modern DSPs, Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/Digital_ signal_processor#Modern_DSPs [5] Embedded Trace Macrocell, http://www. arm.com/products/solutions/ETM.html 4