Blogging the Monkey

Sunday, February 11, 2018

Infrequent freedreno update

As is usually the case, I'm long overdue for an update. So this covers the last six(ish) months or so. The first part might be old news if you follow phoronix.

Older News

In the last update, I mentioned basic a5xx compute shader support. Late last year (and landing in the mesa 18.0 branch) I had a chance to revisit compute support for a5xx, and finished:

image support
shared variable support
barriers, which involved some improvements to the ir3 instruction scheduler so barriers could be scheduled in the correct order (ie. for various types of barriers, certain instructions can't be move before/after the related barrier

There were also some semi-related SSBO fixes, and additional r/e of instruction encodings, in particular for barriers (new cat7 group of instructions) and image vs SSBO (where different variation of the cat6 instruction encoding are used for images vs SSBOs).

Also I r/e'd and added support for indirect compute, indirect draw, texture-gather, stencil textures, and ARB_framebuffer_no_attachments on a5xx. Which brings us pretty close to gles31 support. And over the holiday break I r/e'd and implemented tiled texture support, because moar fps ;-)

Ilia Mirkin also implemented indirect draw, stencil texture, and ARB_framebuffer_no_attachments for a4xx. Ilia and Wladimir J. van der Laan also landed a handful of a2xx and a20x fixes. (But there are more a20x fixes hanging out on a branch which we still need to rebase and merge.) It is definitely nice seeing older hw, which blob driver has long since dropped support for, getting some attention.

Other News

Not exactly freedreno related, but probably of some interest to freedreno users.. in the 4.14 kernel, my qcom_iommu driver finally landed! This was the last piece to having the gpu working on a vanilla upstream kernel on the dragonboard 410c. In addition, the camera driver also landed in 4.14, and venus, the v4l2 mem-to-mem driver for hw video decode/encode landed in 4.13. (The venus driver also already has support for db820c.)

Fwiw, the v4l2 mem-to-mem driver interface is becoming the defacto standard for hw video decode/encode on SoC's. GStreamer has had support for a long time now. And more recently ffmpeg (v3.4) and kodi have gained support:

When I first started on freedreno, qcom support for upstream kernel was pretty dire (ie. I think serial console support might have worked on some ancient SoC). When I started, the only kernel that I could use to get the gpu running was old downstream msm android kernels (initially 2.6.35, and on later boards 3.4 and 3.10). The ifc6410 was the first board that I (eventually) could run an upstream kernel (after starting out with an msm-3.4 kernel), and the db410c was the first board I got where I never even used an downstream android kernel. Initially db410c was upstream kernel with a pile of patches, although the size of the patchset dropped over time. With db820c, that pattern is repeating again (ie. the patchset is already small enough that I managed to easily rebase it myself for after 4.14). Linaro and qcom have been working quietly in the background to upstream all the various drivers that something like drm/msm depend on to work (clk, genpd, gpio, i2c, and other lower level platform support). This is awesome to see, and the linaro/qcom developers behind this progress deserve all the thanks. Without much fanfare, snapdragon has gone from a hopeless case (from upstream perspective) to one of the better supported platforms!

Thanks to the upstream kernel support, and u-boot/UEFI support which I've mentioned before, Fedora 27 supports db410c out of the box (and the situation should be similar with other distro's that have new enough kernel (and gst/ffmpeg/kodi if you care about hw video decode). Note that the firmware for db410c (and db820c) has been merged in linux-firmware since that blog post.

More Recent News

More recently, I have been working on a batch of (mostly) compiler related enhancements to improve performance with things that have more complex shaders. In particular:

Switch over to NIR's support for lowering phi-web's to registers, instead of dealing with phi instructions in ir3. NIR has a much more sophisticated pass for coming out of SSA, which does a better job at avoiding the need to insert extra MOV instructions, although a bunch of RA (register allocation) related fixes were required. The end result is fewer instructions in resulting shader, and more importantly a reduction in register usage.
Using NIR's peephole_select pass to lower if/else, instead of our own pass. This was a pretty small change (although it took some work to arrive at a decent threshold). Previously the ir3_nir_lower_if_else pass would try to lower all if/else to select instructions, but in extreme cases this is counter-productive as it increases register pressure. (Background: in simple cases for a GPU, executing both sides of an if/else and using a select instruction to choose the results makes sense, since GPUs tend to be a SIMT arch, and if you aren't executing both sides, you are stalling threads in a warp that took the opposite direction in the if/else.. but in extreme cases this increases register usage which reduces the # of warps in flight.) End result was 4x speedup in alu2 benchmark, although in the real world it tends to matter less (ie. most shaders aren't that complex).
Better handling of sync flags across basic blocks
Better instruction scheduling across basic blocks
Better instruction scheduling for SFU instructions (ie. sqrt, rsqrt, sin, cos, etc) to avoid stalls on SFU.
R/e and add support for (sat)urate flag flag (to avoid extra sequence of min.f + max.f instructions to clamp a result)
And a few other tweaks.

The end results tend to depend on how complex the shaders that a game/benchmark uses. At the extreme high end, 4x improvement for alu2. On the other hand, probably doesn't make much difference for older games like xonotic. Supertuxkart and most of the other gfxbench benchmarks show something along the lines of 10-20% improvement. Supertuxkart, in particular, with advanced pipeline, the combination of compiler improvements with previous lrz and tiled texture (ie. FD_MESA_DEBUG=lrz,ttile) is a 30% improvement! Some of the more complex shaders I've been looking at, like shadertoy piano, show 25% improvement on the compiler changes alone. (Shadertoy isn't likely to benefit from lrz/ttile since it is basically just drawing a quad with all the rendering logic in the fragment shader.)

In other news, things are starting to get interesting for snapdragon 845 (sdm845). Initial patches for a6xx GPU support have been posted (although I still need to my hands on a6xx hw to start r/e for userspace, so those probably won't be merged soon). And qcom has drm/msm display support buried away in their msm-4.9 tree (expect to see first round of patches for upstream soon.. it's a lot of code, so expect some refactoring before it is merged, but good to get this process started now).

Sunday, August 27, 2017

About shader compilers, IR's, and where the time is spent

Occasionally the question comes up about why we convert between various IR's (intermediate representations), like glsl to NIR, in the process of compiling a shader. Wouldn't it be faster if we just skipped a step and went straight from glsl to "the final thing", which would be ir3 (freedreno), codegen (nouveau), or LLVM (radeonsi/radv). It is a reasonable question, since most people haven't worked on compilers and we probably haven't done a good job at explaining all the various passes involved in compiling a shader or presenting a breakdown of where the time is spent.

So I spent a bit of time this morning with perf to profile a shader-db run (or rather a subset of a full run to keep the perf.data size manageable, see notes at end).

A flamegraph from the shader-db run, since every blog post needs a catchy picture.

Breakdown:

parser, into glsl: 9.98%
glsl to nir: 1.3%
nir opt/lowering passes: 21.4%

CSE: 6.9%
opt algebraic: 3.5%
conversion to SSA: 2.1%
DCE: 2.0%
copy propagation: 1.3%
other lowering passes: 5.6%

nir to ir3: 1.5%
ir3 passes: 21.5%

register allocation: 5.1%
sched: 14.3%
other: 2.1%

assembly (ir3->binary): 0.66%

This is ignoring some of the fixed overheads of shader-db runner, and also doesn't capture individually a bunch of NIR lowering passes. NIR has ~40 lowering passes, some that are gl related like nir_lower_draw_pixels and nir_lower_wpos_ytransform (because for hysterical reasons textures and therefore FBO's are upside down in gl). For gallium drivers using NIR, these gl specific passes are called from mesa state-tracker.

The other lowering passes are not gl specific but tend to be specific to general GPU shader features (ie. things that you wouldn't find in a C compiler for a cpu) and things that are needed by multiple different drivers. Such as, nir_lower_tex which handles sampling from YUV textures, ie. inserting the instructions to do YUV->RGB conversion (since GLES and android strongly assume this is a thing that hardware can always do), lowering RECT textures, or clamping texture coords. These lowering passes are called from the driver backend so the driver is in control of what lowering pass are needed, including configuration about individual features in passes which handle multiple things, based on what the hardware does not support directly.

These lowering passes are mostly O(n), and lost in the noise.

Also note that freedreno, along with the other drivers that can consume NIR directly, disable a bunch of opt passes that were originally done in glsl, but that NIR (or LLVM) can do more efficiently. For freedreno, disabling the glsl opt passes shaved ~30% runtime off of a shader-db run, so spending 1.3% to convert into NIR is way more than offset.

For other drivers, the breakdown may be different. I expect radeonsi/radv skips some of the general opt passes in NIR which have a counterpart in LLVM, but re-uses other lowering passes which do not have a counterpart in LLVM.

Is it still a gallium driver?

This is a related question that comes up sometimes, is it a gallium driver if it doesn't use TGSI? Yes.

The drivers that can consume NIR and implement the gallium pipe driver interface, freedreno a3xx+, vc4, vc5, and radeonsi (optionally), are gallium drivers. They still have to accept TGSI for state trackers which do not support NIR, and various built-in shaders (blits, mipmap generation, etc). Most use the shared tgsi_to_nir pass for TGSI shaders. Note that currently tgsi_to_nir does not support all the TGSI features, but just features needed by internal shaders, and what is needed for gl3/gles3 (ie. basically what freedreno and vc4 needed before mesa state-tracker grew support for glsl_to_nir).

Notes:

Collected from shader-db run (glamor + supertuxkart + 0ad shaders) with a debug mesa build (to have debug syms and prevent inlining) but with NIR_VALIDATE=0 (otherwise results with debug builds are highly skewed). A subset of all shader-db shaders was used to keep the perf.data size manageable.

Sunday, June 25, 2017

long overdue update

Since it has been a while since the last update, I guess it is a good time to post an update on some of the progress that has been happening with freedreno and upstream support for snapdragon boards.

freedreno / mesa

While the 17.1 release included enabling reorder support by default, there have been many other interesting features landed since the 17.1 branch point (so they will be included in the future 17.2 release). Many, but not all, are related to a5xx. (Something that I just realized I forgot to blog about, but have demoed here and there.)

GL/GLES Compute Shaders:

So far this is only a5xx (although a4xx seems to work similarly, and would probably be not too hard to get working if someone had the right hardware and a bit of time). SSBOs and atomics are supported, but image support (an important part of compute shaders) is still TODO (and some r/e required, although it seems to share a lot in common with SSBOs). Adreno 3xx support for compute shaders appears to be more work (ie. less in common with a4xx/a5xx, and probably part of the reason that qualcomm never bothered adding support in android blob driver). Patches welcome, but for now a3xx compute support is far enough down my TODO list that it might not otherwise happen.

I know there is a lot of interest in open source OpenCL support for freedreno, and hopefully that is something that will come in the future. But there is the big challenge of how to get opencl shaders (kernels) into a form that can be consumed by freedreno's ir3 shader compiler backend. While there is some potential to re-use spirv_to_nir at some point, there are some complicated details. For compute kernels (ie. OpenCL) there are some restrictions lifted on SPIRV that spirv_to_nir relies on. (Little details like lack of requirement for structured flow control.)

A5xx HW Binning Support:

Traditionally hw binning support, while a pretty big perf boost, has been kinda difficult (translation: lot of things can be done wrong to lead to difficult to debug GPU lockups), this time around it wasn't so hard. I guess experience on a3xx/a4xx has helped. And everyone loves ~30% fps boost in your favorite game!

This has brought performance roughly up to the levels as ifc6540/a420. Which sounds bad, but remember we are comparing apples and oranges. On ifc6540 (snapdragon 805), we don't yet have upstream kernel support so this was using a 3.10 android kernel (with bus-scaling and all the downstream tricks to optimize memory bandwidth and overall SoC performance). But on a530 (dragonboard820c), I never had a working downstream kernel (or had to bother backporting the upstream drm/msm driver to some ancient android kernel.. hurray!). The upshot is that any perf #'s for a5xx don't include bus-scaling, cpufreq, etc. I expect a pretty big performance boost on a530 once we have a way to clock up memory/interconnects. (Ie. on micro-benchmarks a530 is >2x faster than a420 on alu limited workloads, but still a bit slower than a420 on bandwidth limited workloads, despite having a higher theoretical bandwidth.)

Side note, linaro is working on an upstream solution for bus-scaling. This is a very important improvement needed upstream for ARM SoC's, especially ones that optimize so strongly for battery life. (Keep in mind that interconnects, which span across the SoC, and memory, are a big power consumer in a modern SoC.. so a lot of qualcomm's good performance + battery life in phones comes down to these systemwide optimizations.) It is equivalent to slow memory clockings on some generations of nouveau, except in this case it is outside the gpu driver (ie. we aren't talking about vram on a discrete gpu), and the reason is to enable a high end phone SoC to last a couple days on battery, rather than keeping your video card from melting.

A5xx gles3.0/gl3.1 support:

Probably it would have made sense to spend time on this before compute shaders (since they are otherwise only exposed with $MESA_GL_VERSION_OVERRIDE tricks.. but hey, I was curious about how compute shaders worked). After an assortment of small things to r/e and implement, we where just a few (~50) texture/vbo/fb formats away from gl3.1. Nothing really exciting. Mostly just a few weekends probing unknown format #'s and seeing which piglit format tests started passing. The sort have thing that would have taken approximately 10 minutes with docs.. but hey, it needed to be done.

Switching to NIR by default:

This is one thing that benefits a3xx and a4xx as well as a5xx. While freedreno has had NIR support for a while, it hasn't been enabled by default until more recently. The issue was handling of complex dereferences (multi-dimensional arrays, arrays of structs, etc). The problem was that freedreno's ir3 backend preferred to keep things in SSA form (since that gives the instruction scheduler more flexibilty, which is pretty imprortant in the a3xx+ instruction set architecture (ir3)). Adding support to lower arrays to regs allowed moving the deref offset calculation to NIR so that we wouldn't regress by turning NIR on by default. This is useful since it cuts shader compilation time, but also because tgsi_to_nir doesn't support SSBOs, atomics, and other new shiny glsl features. (Now we only rely on tgsi_to_nir for various legacy paths and built-in blit shaders which don't need new shiny glsl features.)

A5xx HW Query Support:

Adreno 5xx changed how hw queries (ie. occlusion query and time-elapsed query, etc) work. For the better, since now we can accumulate per-tile results on the GPU. But it required some new support in freedreno for a different sort of query, and some r/e about how this actually worked. And while we had previously lied about occlusions query support (mostly to expose more than gl1.4 support), that isn't a very good long term solution. In addition, time-elapsed query is useful for performance/profiling work, so helpful for some of the following projects.

A5xx LRZ Support:

Adreno 5xx adds another cute optimization called "LRZ". (Presumably "low resolution Z (depth buffer)". I've spent a some time r/e'ing this feature and implementing support for it in freedreno. It is a neat new hw trick that a5xx has, which serves two purposes.

The basic idea is to have a per-quad depth value so that in the binning pass primitives can be rejected (per tile) based on depth (ie. reject more early). But then recycle the LRZ buffer in draw phase to function as for-free depth pre-pass (ie. reject earlier primitives based on the z value of later primitives).

The benefit depends on how well optimized the game is. Ie. games that are well optimized for traditional GPU architectures (ie. sorting geometry, already doing depth pre-passes, etc) won't benefit as much.. but this helps a lot for badly written games that relied on per-pixel deferred rendering.

Overall, for things like stk/xonotic, it seems like a ~5-10% win.

edit: I forgot to mention, this isn't enabled by default as it causes some issues (which seem like a sort of z-fighting) with 0ad. Other than that, I haven't found anything that it doesn't work with. To enable: FD_MESA_DEBUG=lrz. It would be nice if there were some way to have driver specific flags in driconf to control things like this.

The main remaining performance trick for a5xx is UBWC (ie. bandwidth compression) + tiled textures. I've worked out mostly how UBWC works (in particular texture layout, at least for 2d textures + mipmap, but I think we can infer how 2d arrays, 3d, etc, work from that). Most of the infrastructure for upload/download blits (to convert to/from linear) should be easier thanks to the reorder support. We'll see if I actually find time to implement it before the mesa 17.2 branch point.

Standardized Embedded Nonsense Hacks

Anyone who has dealt with arm (non-server) devices, should be familiar with the silly-embedded-nonsense-hacks world. In particular the non-standard boot-chain which makes it difficult for distro's to support the plethora of arm boards (let alone phones/tablets/etc) out there without per-board support. Which was fine in the early days, but N boards times M distro's, it really doesn't scale.

Thanks to work by Mateusz Kulikowski, we now have u-boot support for dragonboard 410c. It's been on my TODO list to play with for a while. But more recently I realized that u-boot, thanks to the work of many others, can provide enough of EFI runtime-services interface for grub to work. This means that it is a path forward for standardized distros on aarch64 (like fedora and opensuse), which expect UEFI, to boot on boards which don't otherwise have UEFI firmware.

So I decided to spend a bit of time pretending to be a crack smoking firmware engineer. (Not literally, of course.. that would be stupid!)

After fixing some linker script bugs with u-boot's db410c support vs efi_runtime section, and debugging some issues with grub finding the boot disk with the help of Peter Jones (the resident grub/EFI expert who conveniently sits near me), and a couple other misc u-boot fixes, I had a fedora 26 alpha image booting on the db410c.

The next step was figuring out display, so we could have grub boot menu on screen, like you would expect on a grown-up platform. As it turns out, on most devices, lk (little kernel, ie. what normally loads the kernel+initrd on snapdragon android devices) already supports lighting up the display, since most/all android devices put up the initial splash-screen before the kernel is loaded. Unfortunately this was not the case with the db410c's lk. But Archit (qcom engineer who has contributed a whole lot of drm/msm and other drm patches) pointed me at a different lk branch (among the 100's) which had msm8916 display + adv7533 dsi->hdmi bridge (like what db410c uses). After digging through a convoluted git history, I was able to track down the relevant gpio/i2c/adv7533 patches to port to the lk branch used on db410c.

After that, I added support for lk to populate a framebuffer node, using the simple-framebuffer bindings to pass the pre-configured scanout buffer (+dimensions) to u-boot. This plus a new simplefb video driver for u-boot, enables u-boot to expose display support to grub via the EFI GOP protocol. (Along the way I had to add 32bpp rgb support to lk since u-boot and grub don't understand packed 24bpp rgb.)

All this got to the point of:

This is a fedora image, booting off of usb disk (ie. not just rootfs on usb disk, but also grub/kernel/initrd/dtb). With graphical grub menu to select which kernel to boot, just like you would expect on a PC. The grubaa64.efi here is vanilla distro boot-loader, and from the point of view of the distro image, lk/u-boot is just the platform's firmware which somehow provides the UEFI interface the distro media expects. It is worth pointing out some advantages of a traditional lk->kernel boot chain:

booting from USB, network, etc (which lk cannot do)
doesn't require kernel packed in custom boot.img partition which is board specific
booting installer image (ie. from sd-card or network)

When the kernel starts, in early boot, it is using efifb, just like it would on a PC. (Ie. so you can see what is going on on-screen before hw specific drm driver kernel module is loaded).

There are still a few rough edges. The drm/msm driver and msm clk drivers are a bit surprised when some clks are already enabled when the kernel starts, and the display is already light up.. now we have a good reason to fix some of those issues. And right now we don't have a good way to load a newer device tree binary (dtb) after a distro kernel update (ie. without updating u-boot, aka "the firmware"). (For simple SoC's maybe a pre-baked dtb for the life of the board is sufficient... I have my doubts about that for SoCs as complex as the various snapdragon's, if for no other reason that we haven't even figured out how to model all the features of the existing SoCs in devicetree.) One idea is for u-boot to pass to grub the name of the board dtb file to load via EFI variables. I've sent a very early RFC to add EFI variable support in u-boot. We'll see how this goes, in the mean time there might be more "firmware" upgrades needed than you'd normally expect on a mature platform like x86.

For now, my lk + u-boot work is here:

and prebulit "firmware" is here. For now you will need to edit distro grub.cfg to add 'devicetree' commands to load appropriate dtb since what is included with u-boot.img is a very minimal fdt (ie. just enough for the drivers in u-boot).

Wednesday, November 16, 2016

a quick note for users/distros

At this point, I haven't pushed a new release tag for xf86-video-freedreno to update to latest xserver ABI. I'm inclined not to. If you are using a modern xserver you probably want to be using xf86-video-modesetting + glamor. It has more features (dri3, xv, etc) and better performance. And GL support on a3xx/a4xx is pretty solid. So distros with a modern xserver might as well drop the xf86-video-freedreno package.

The one case where xf86-video-freedreno is still useful is bringing up a new generation of adreno, since it can do dri2 with pure-sw fallbacks for all the EXA ops. But if that is what you are doing, I guess you know how to git clone and build.

The possible alternative is to push a patch that makes xf86-video-freedreno still build, but only probe (with latest xserver ABI) if some "ForceLoad" type option is given in xorg.conf, otherwise fallback to modesetting/glamor. I can't think of a good reason to do this at the moment. But as always, questions/comments/suggestions welcome.

Saturday, July 30, 2016

dirty tricks for moar fps!

This weekend I landed a patchset in mesa to add support for resource shadowing and batch re-ordering in freedreno. What this is, will take a bit of explaining, but the tl;dr: is a nice fps boost in many games/apps.

But first, a bit of background about tiling gpu's: the basic idea of a tiler is to render N draw calls a tile at a time, with a tile's worth of the "framebuffer state" (ie. each of the MRT color bufs + depth/stencil) resident in an internal tile buffer. The idea is that most of your memory traffic is to/from your color and z/s buffers. So rather than rendering each of your draw calls in it's entirety, you split the screen up into tiles and repeat each of the N draws for each tile to fast internal/on-chip memory. This avoids going back to main memory for each of the color and z/s buffer accesses, and enables a tiler to do more with less memory bandwidth. But it means there is never a single point in the sequence of draws.. ie. draw #1 for tile #2 could happen after draw #2 for tile #1. (Also, that is why GL_TIMESTAMP queries are bonkers for tilers.)

For purpose of discussion (and also how things are named in the code, if you look), I will define a tile-pass, ie. rendering of N draws for each tile in succession (or even if multiple tiles are rendered in parallel) as a "batch".

Unfortunately, many games/apps are not written with tilers in mind. There are a handful of common anti-patterns which force a driver for a tiling gpu to flush the current batch. Examples are unnecessary FBO switches, and texture or UBO uploads mid-batch.

For example, with a 1920x1080 r8g8b8a8 render target, with z24s8 depth/stencil buffer, an unnecessary batch flush costs you 16MB of write memory bandwidth, plus another 16MB of read when we later need to pull the data back into the tile buffer. That number can easily get much bigger with games using float16 or float32 (rather than 8 bits per component) intermediate buffers, and/or multiple render targets. Ie. two MRT's with float16 internal-format plus z24s8 z/s would be 40MB write + 40MB read per extra flush.

So, take the example of a UBO update, at a point where you are not otherwise needing to flush the batch (ie. swapbuffers or FBO switch). A straightforward gl driver for a tiler would need to flush the current batch, so each of the draws before the UBO update would see the old state, and each of the draws after the UBO update would see the new state.

Enter resource shadowing and batch reordering. Two reasonably big (ie. touches a lot of the code) changes in the driver which combine to avoid these extra batch flushes, as much as possible.

Resource shadowing is allocating a new backing GEM buffer object (BO) for the resource (texture/UBO/VBO/etc), and if necessary copying parts of the BO contents to the new buffer (back-blit).

So for the example of the UBO update, rather than taking the 16MB+16MB (or more) hit of a tile flush, why not just create two versions of the UBO. It might involve copying a few KB's of UBO (ie. whatever was not overwritten by the game), but that is a lot less than 32MB?

But of course, it is not that simple. Was the buffer or texture level mapped with GL_MAP_INVALIDATE_BUFFER_BIT or GL_MAP_INVALIDATE_RANGE_BIT? (Or GL API that implies the equivalent, although fortunately as a gallium driver we don't have to care so much about all the various different GL paths that amount to the same thing for the hw.) For a texture with mipmap levels, we unfortunately don't know at the time where we need to create the new shadow BO whether the next GL calls will glGenerateMipmap() or upload the remaining mipmap levels. So there is a bit of complexity in handling all the cases properly. There may be a few more cases we could handle without falling back to flushing the current batch, but for now we handle all the common cases.

The batch re-ordering component of this allows any potential back-blits from the shadow'd BO to the new BO (when resource shadowing kicks in), to be split out into a separate batch. The resource/dependency tracking between batches and resources (ie. if various batches need to read from a given resource, we need to know that so they can be executed before something writes to the resource) lets us know which order to flush various in-flight batches to achieve correct results. Note that this is partly because we use util_blitter, which turns any internally generated resource-shadowing back-blits into normal draw calls (since we don't have a dedicated blit pipe).. but this approach also handles the unnecessary FBO switch case for free.

Unfortunately, the batch re-ordering required a bit of an overhaul about how cmdstream buffers are handled, which required changes in all layers of the stack (mesa + libdrm + kernel). The kernel changes are in drm-next for 4.8 and libdrm parts are in the latest libdrm release. And while things will continue to work with a new userspace and old kernel, all these new optimizations will be disabled.

(And, while there is a growing number of snapdragon/adreno SBC's and phones/tablets getting upstream attention, if you are stuck on a downstream 3.10 kernel, look here.)

And for now, even with a new enough kernel, for the time being reorder support is not enabled by default. There are a couple more piglit tests remaining to investigate, but I'll probably flip it to be enabled by default (if you have a new enough kernel) before the next mesa release branch. Until then, use FD_MESA_DEBUG=reorder (and once the default is switched, that would be FD_MESA_DEBUG=noreorder to disable).

I'll cover the implementation and tricks to keep the CPU overhead of all this extra bookkeeping small later (probably at XDC2016), since this post is already getting rather long. But the juicy bits: ~30% gain in supertuxkart (new render engine) and ~20% gain in manhattan are the big winners. In general at least a few percent gain in most things I looked at, generally in the 5-10% range.

Wednesday, May 4, 2016

Freedreno (not so) periodic update

Since I seem to be not so good at finding time for blog updates recently, this update probably covers a greater timespan than it should, and some of this is already old news ;-)

Already quite some time ago, but in case you didn't already notice: with the mesa 11.1 release, freedreno now supports up to (desktop) gl3.1 on both a3xx and a4xx (in addition to gles3). Which is high enough to show up on the front page at glxinfo. (Which, btw, is a useful tool to see exactly which gl/gles extensions are supported by which version of mesa on various different hw.)

A couple months back, I spent a bit of time starting to look at performance. On master now (so will be in 11.3), we have timestamp and time-elapsed query support for a4xx, and I may expose a few more performance counters (mostly for the benefit of gallium HUD). I still need to add support for a3xx, but already this is useful to help profile. In addition, I've cobbled together a simple fdperf cmdline tool:

I also got around to (finally) implementing hw binning support for a4xx, which for *some* games can have a pretty big perf boost:

glmark2 'refract' bench (an extreme example): 31fps -> 124fps
xonotic (med): 44.4fps -> 50.3fps
supertuxkart (new render engine): 15fps -> 19fps

More recently I've started to run the dEQP gles3 tests against freedreno. Initially the results where not too good, but since then I've fixed a couple thousand test cases.. fortunately it was just a few bugs and a couple missing workaround for hw bug/limitations (depending on how you look at it) which counted for the bulk of the fails. Now we are at 98.9% pass (or 99.5% if you don't count the 'skips' against the pass ratio). These fixes have also helped piglit, where we are now up to 98.3% pass. These figures are a4xx, but most of the fixes apply to a3xx as well.

I've also made some improvements in ir3 (shader compiler for a3xx and later) so the code it generates is starting to be pretty decent. The immediate->const lowering that I just pushed helps reduce register pressure in a lot of cases. We still need support for spilling, but at least now shadertoy (which is some sort of cruel joke against shader compiler writers) isn't a complete horror show:

In other cool news, in case you had not already seen: Rob Herring and John Stultz from linaro have been doing some cool work, with Rob getting android running on an upstream kernel plus mesa running on db410c and qemu (with freedreno and virtgl), and John taking all that, and getting it all running on a nexus7 tablet. (And more recently, getting wifi working as well.) I had the opportunity to see this in person when I was at Linaro Connect in March. It might not seem impressive if you are unfamiliar with the extent to which android device kernels diverge from upstream, but to see an upstream kernel running on an actual device with only ~50patches is quite a feat:

The UI was actually reasonably fast, despite not yet using overlays to bypass GPU for composition. But as ongoing work in drm/kms for explicit fencing, and mesa EGL_ANDROID_native_fence_sync land, we should be able to get hw composition working.

Saturday, August 15, 2015

freedreno - mesa 11.0 progress update, OpenGLES3 and more

So the big news for the upcoming mesa 11.0 release is gl4.x support for radeon and nouveau. Which has been in the works for a long time, and a pretty tremendous milestone (and the reason that the next mesa release is 11.0 rather than 10.7). But on the freedreno side of things, we haven't been sitting still either. In fact, with the transform-feedback support I landed a couple weeks ago (for a3xx+a4xx), plus MRT+z32s8 support for a4xx (Ilia landed the a3xx parts of those a while back), we now support OpenGLES 3.0[1] on both adreno 3xx and 4xx!!

In addition, with the TBO support that landed a few days ago, plus handful of other fixes in the last few days, we have the new antarctica gl3.1 render engine for supertuxkart working!

Note that you need to use MESA_GL_VERSION_OVERRIDE=3.1 and MESA_GLSL_VERSION_OVERRIDE=140, since while we support everything that stk needs, we don't yet support everything needed to advertise gl3.1. (But hey, according to qualcomm, adreno 3xx doesn't even support higher than gles3.0.. I guess we'll have to show them ;-))

The nice thing to see about this working, is that it is utilizing pretty much all of the recent freedreno features (transform feedback, MRT, UBO's, TBO's, etc).

Of course, the new render engine is considerably more heavyweight compared to older versions of stk. But I think there is some low hanging fruit on the stk engine side of things to reclaim some of those lost fps.

update: oh, and the first time around, I completely forgot to mention that qualcomm has recently published *some* gpu docs, for a3xx, for the dragonboard 410c. Not quite as extensive as what broadcom has published for vc4, but it gives us all the a3xx registers, which is quite a lot more than any other SoC vendor has done to date :-)

[1] minus MSAA.. There is a bigger task, which is on the TODO list, to teach mesa st about some extensions to support MSAA resolve on tile->mem.. such as EXT_multisampled_render_to_texture, plus perhaps driconf option to enable it for apps that are not aware, which would make MSAA much more useful on a tiling gpu. Until then, mesa doesn't check MSAA for gles3, and if it did we could advertise PIPE_CAP_FAKE_SW_MSAA. Plus, who really cares about MSAA on a 5" 4k screen?