Graphics acceleration on Replicant David Ludovino (@dllud) Ricardo - - PowerPoint PPT Presentation

graphics acceleration on replicant
SMART_READER_LITE
LIVE PREVIEW

Graphics acceleration on Replicant David Ludovino (@dllud) Ricardo - - PowerPoint PPT Presentation

Graphics acceleration on Replicant David Ludovino (@dllud) Ricardo Cabrita (@GrimKriegor) NLnet - NGI0 PET Fund Saturday 27 th July, 2019 with great support from Joonas Kylm al a (@Putti) 1 / 37 Motivation All supported devices


slide-1
SLIDE 1

Graphics acceleration on Replicant

David Ludovino (@dllud) Ricardo Cabrita (@GrimKriegor)∗

NLnet - NGI0 PET Fund

Saturday 27th July, 2019

∗with great support from Joonas Kylm¨

al¨ a (@Putti)

1 / 37

slide-2
SLIDE 2

Motivation

All supported devices lack a free software GPU driver. Replicant 6 relies on libAGL which uses the libpixelflinger software render (both deprecated since 2013).

2 / 37

slide-3
SLIDE 3

Motivation

Lack of GLES 2.0 leads some critical applications to crash (e.g. Firefox) Rendering performance has degraded throughout Android versions. Replicant relies on patches to the Android framework to make things like the camera application work.

3 / 37

slide-4
SLIDE 4

Objectives

Put together a graphics stack: Compatible with Android 9’s HALs. Provides at least GLES 2.0. Flexible enough to do rendering with both Mesa and SwiftShader. Uses hardware rendering on devices with a free GPU driver.

4 / 37

slide-5
SLIDE 5

Graphics hardware architecture

5 / 37

slide-6
SLIDE 6

Graphics hardware architecture — Exynos 4412 SoC components

∗Source: Hardkernel Co., Ltd. 6 / 37

slide-7
SLIDE 7

Graphics software architecture

7 / 37

slide-8
SLIDE 8

Graphics software architecture — Android 9

∗Source: Android Open Source Project under CC BY 4.0 8 / 37

slide-9
SLIDE 9

Graphics software architecture — Replicant 9 HWC HAL

Hardware Composer HAL: drm hwcomposer Supports HWC2 HAL. Works on top of DRM (can use hardware composing acceleration). Under active maintenance (hosted by freedesktop.org). Also used by Android-x86.

9 / 37

slide-10
SLIDE 10

Graphics software architecture — Replicant 9 Gralloc HAL

Gralloc HAL: gbm gralloc Implements Android Gralloc HAL API version 0 and 1. Compatible with drm hwcomposer. Compatible with Mesa. Uses Mesa’s GBM (Generic Buffer Management) for buffer allocation through libgbm. GBM then calls DRM. Supports PRIME fd. Originally by Rob Herring, now maintained by Android-x86.

10 / 37

slide-11
SLIDE 11

Graphics software architecture — Replicant 9 GLES

OpenGL ES renderer: Mesa Support for both software and hardware rendering. Big and active community (maintained for years to come). Mesa driver: kms swrast Uses any Gallium software renderer as backend (softpipe or llvmpipe). Does mode setting through the kernel (KMS). Alternative GLES renderer: SwiftShader Optimized for ARM CPUs. Has Vulkan software rendering.

11 / 37

slide-12
SLIDE 12

Implementation

12 / 37

slide-13
SLIDE 13

Implementation — drm hwcomposer + gbm gralloc

Initially both required the use of the drm/exynos master node

1 DRM Auth hack (both on /dev/dri/card0) 2 DRM vGEM inclusion (gbm gralloc on /dev/dri/card1) 3 DRM allow dumb buffers (gbm gralloc on /dev/dri/renderD128)

At the time we had some graphical glitches we thought were due to inter driver memory sync. Running on the same driver does not require memory synchronization. Allows drm/exynos to allocate memory where adequate according to the type of plane (primary, overlay or cursor).

13 / 37

slide-14
SLIDE 14

Implementation — Allow kms swrast to use drm/exynos

Small tweak: Add exynos to the kms swrast list on external mesa3d. How to upstream this?

14 / 37

slide-15
SLIDE 15

Implementation — HW planes + devfreq

We were then using kms swrast with the softpipe backend. Enabling DRM hardware planes was another attempt at squeezing some extra performance out of the hardware. However this led to some interesting shenenigans.

15 / 37

slide-16
SLIDE 16

Implementation — HW planes + devfreq

16 / 37

slide-17
SLIDE 17

Implementation — HW planes + devfreq

Tentative explanation by ahajda:

1 devfreq lowers display clock frequencies too aggressively. 2 DMA transfers of overlays are too slow and result in screen corruption.

Temporary fix: disable devfreq.

17 / 37

slide-18
SLIDE 18

Implementation — llvmpipe

kms swrast with softpipe was unbearably slow, even with DRM HW planes enabled. Required: Finding out what Android-x86 had previously done. Porting it to Android 9.

18 / 37

slide-19
SLIDE 19

Implementation — llvmpipe

android: Enable llvmpipe when using the swrast driver https://gitlab.freedesktop.org/mesa/mesa/merge requests/1403 android: Fix build with LLVM for Android 9 https://gitlab.freedesktop.org/mesa/mesa/merge requests/1402

19 / 37

slide-20
SLIDE 20

Implementation — SwiftShader

Required: UDIV and SDIV instruction emulation (in the kernel). Android emulator composer: ranchu. Default Android gralloc. Proved to be 1.5 - 2x faster than llvmpipe.

20 / 37

slide-21
SLIDE 21

Performance

SwiftShader > llvmpipe > softpipe

21 / 37

slide-22
SLIDE 22

Performance — SwiftShader with LLVM

We managed to find a SwiftShader revision that uses LLVM as a backend instead of SubZero and is still compatible with our frameworks native. Lineage 16 / Android 9 / R e p l i c a n t 9 S u r f a c e F l i n g e r : OpenGL ES 2.0 SwiftShader 4 . 0 . 0 . 4 Android Q fde88d96a58b92beab76035393b3acd849445160 Default to LLVM 7.0 JIT in Android b u i l d S u r f a c e F l i n g e r : OpenGL ES 3.0 SwiftShader 4 . 1 . 0 . 5 No noticeable performance difference.

22 / 37

slide-23
SLIDE 23

Performance — Why is Replicant 6 much faster?

Emulator switches? NO ro.kernel.qemu=1 High end graphics options? NO ro.config.avoid gfx accel=1 Pixel format (RGB565)? Paul says YES (very hardware dependent)

23 / 37

slide-24
SLIDE 24

Future

24 / 37

slide-25
SLIDE 25

Future — RGB565 across entire stack

gbm gralloc drm hwcomposer drm/exynos All using RGB565. Potential performance breakthrough. If so, how to futureproof this?

25 / 37

slide-26
SLIDE 26

Future — devfreq: which device needs clock boost?

1 Test each device independently through sysfs. 2 Identify which one is causing the corruption (tip: FIMD/LCD path). 3 Boost clock/voltage on userspace or kernel config. 4 Re-enable devfreq. 5 Workout patch to fix upstream. 26 / 37

slide-27
SLIDE 27

Future — SwiftShader + drm hwcomposer

Advantages (vs ranchu): hardware planes DRM node instead of direct framebuffer

27 / 37

slide-28
SLIDE 28

Future — Profiling, benchmarks and conformance

Profiling: turn on profiling switch on Mesa + simpleperf? Benchmarks: ask Android-x86 (proprietary?) Conformance: dEQP (drawElements Quality Program) and piglit

28 / 37

slide-29
SLIDE 29

Future — 2D acceleration on drm hwcomposer

Software-based: Pixman (has ARM NEON fast path) Hardware-based: Exynos FIMG2D (Fully Integrated Mobile Graphics 2D)

29 / 37

slide-30
SLIDE 30

Future — SDIV/UDIV on compiler-rt

Patch with kernel emulation of SDIV/UDIV is not optimized. Try compiler-rt’s builtins instead.

30 / 37

slide-31
SLIDE 31

Future — ARM NEON on llvmpipe

ARM NEON: SIMD instructions How to use: Tune auto-vectorization on LLVM: easy to try; possible to upstream. Borrow ideas from Pixman, Skia and libyuv (all these have NEON fast paths).

31 / 37

slide-32
SLIDE 32

Future — ARM NEON on llvmpipe

ARM NEON: SIMD instructions How to use: Tune auto-vectorization on LLVM: easy to try; possible to upstream. Neon assembly: too cumbersome (e.g. manual register allocation). Borrow ideas from Pixman, Skia and libyuv (all these have NEON fast paths).

32 / 37

slide-33
SLIDE 33

Future — ARM NEON on llvmpipe

ARM NEON: SIMD instructions How to use: Tune auto-vectorization on LLVM: easy to try; possible to upstream. Ne10 library: easy to use; difficult to upstream (requires new deps). Neon assembly: too cumbersome (e.g. manual register allocation). Borrow ideas from Pixman, Skia and libyuv (all these have NEON fast paths).

33 / 37

slide-34
SLIDE 34

Future — ARM NEON on llvmpipe

ARM NEON: SIMD instructions How to use: Tune auto-vectorization on LLVM: easy to try; possible to upstream. Ne10 library: easy to use; difficult to upstream (requires new deps). Neon intrinsics: nice compromise between performance and code complexity; possible to upstream. #i n c l u d e <arm neon . h> u i n t 8 x 8 t va , vb , vr ; vr = vadd u8 ( va , vb ) ; Neon assembly: too cumbersome (e.g. manual register allocation). Borrow ideas from Pixman, Skia and libyuv (all these have NEON fast paths).

34 / 37

slide-35
SLIDE 35

Future — ARM NEON on llvmpipe

How to use intrinsics when llvmpipe must output LLVM IR? Can LLVM IR contain ARM NEON assembly code?

∗Source: ScotXW on Wikimedia under CC0 35 / 37

slide-36
SLIDE 36

Future — Lima

The holy grail. Quite active now. New commits every week. No idea of current compliance (asked devs to update features.txt). Planned approach: offload implemented GL operations to Lima. Where in the stack should we intercept GL operations? GLSL IR? TGSI? Won’t the overhead of interception, introspection and dispatch kill any performance gains?

36 / 37

slide-37
SLIDE 37

Questions?∗

∗Ask Putti the hard ones. xD 37 / 37