Monday, June 23, 2014

Fire in the (root) hole!

This will, I think, be the first time blogging about something quite so retroactively, but for reasons which should be apparent, I could not blog about this little adventure until now.  This is the story of CVE-2014-0972 (QCIR-2014-00004-1), and (at least part of) how I was able to install fedora on my firetv:


Back in April, I bought myself a Fire TV, with the thought that it would make a nice fedora xbmc htpc setup, complete with open src drivers, to replace my aging pandaboard.  But, of course, as delivered the Fire TV is locked down with no root access.

At the same time, there was a feature of the downstream android kernel gpu driver (kgsl), per-context pagetables, which had been on my TODO list for the upstream drm/msm driver for a while now.  But, I needed to understand better what kgsl was doing and the interactions with the hardware, in particular the behaviour of the CP (command processor), in order to convince myself that such a feature was safe.  People generally frown on introducing root holes in the upstream kernel, and I didn't exactly have documentation about the hardware.  So it was time to roll up my sleeves and get some hands-on experience (translation: try to poke and crash the gpu in lots of different ways and try to make sense of the result).

Into the rabbit hole..

The modern snapdragon SoCs use IOMMUs everywhere.  Including the GPU.  To implement per-context gpu pagetables, basically all the driver needs to do is to bang a few IOMMU registers to change the pagetable base addr and invalidate the TLB.  But this must be done when you are sure the GPU is not still trying to access memory mapped in the old page tables.  Since a GPU is a highly asynchronous device, it would be a big performance hit to stall until GPU ringbuffer drains, then reprogram IOMMU, then resume the GPU with commands from the new context.  To avoid this performance hit, kgsl maps some of the IOMMU registers into the GPU's virtual address space, and emits commands into the ringbuffer for the CP to write the necessary registers to switch pagetables and invalidate TLB.

It was this reprogramming of IOMMU from the GPU itself which I needed to understand better.  Anyone who understands GPU's would have the initial reaction that this is extremely dangerous.  But kgsl was, it seemed, taking some protections.  However, I needed to be sure I properly understood how this worked, to see if there was something that was overlooked.

The GPU, in fact, has two hw contexts which it can switch between.  Essentially it is in some ways similar to supervisor vs user context on a CPU.  The way kgsl uses this is to map the IOMMU registers into the supervisor context, but not user contexts.  The ringbuffer is mapped into all the user contexts, plus supervisor context, at the same device virtual address.  The idea being that if the ringbuffer is mapped in the same position in all contexts, you can safely context switch from commands in the ringbuffer.

To do this, kgsl emits commands for the CP to write a special bit in CP_STATE_DEBUG_INDEX to switch to the "supervisor" context.  Then commands to write IOMMU registers, followed by write to CP_STATE_DEBUG_INDEX to switch back to user context.  (I'm over-simplifying slightly, as there are some barriers needed to account for asynchronous writes.)  But userspace constructed commands never execute from the ringbuffer, instead the kernel puts an IB (indirect branch) into the ringbuffer to jump to the userspace constructed cmdstream buffer.  This userspace cmdstream buffer is never mapped into supervisor context, or into other user's contexts.  So in theory, if userspace tried to write CP_STATE_DEBUG_INDEX to switch to supervisor mode (and gain access to the IOMMU registers), the GPU would immediately page fault, since the cmdstream it was in the middle of executing is no longer mapped.  Ok, so far, so good.

Where it breaks down..

From my attempts at switching to supervisor mode from IB1, and deciphering the fault address where the gpu crashed, and iommu register dumps, I could tell that the next few commands after the switch to supervisor mode where excuted without problem.. there is some prefetch/pipelining!

But much more conveniently, while poking around, I realized that there were a couple pages mapped globally (in supervisor and all user contexts), which where mapped writable in user contexts.  I used the so called "setstate" buffer.  So I simply had to construct a cmdstream buffer to write the commands I wanted to execute into the setstate buffer, and then do an IB to that buffer and do the supervisor switch in IB2.

Ok.. but do do anything useful with this, I'd need a reasonable chunk of physically contiguous pages, at a known physical address.. in particular 16K for first level pagetables and 16K second level pagetables.  Fortunately ION comes to the rescue here, with it's physically contiguous carveouts at known physical addresses.  In this case, allocate from the multimedia pool when there is no video playback, etc, going on.  This way ION allocates from the beginning of the carveout pool, a known address.

Into this buffer, construct a new set of pagetables, which map whatever physical address you want to read/write (hint, any of kernel lowmem), a replacement page for the setstate buffer (since we don't know the original setstate buffer's physical address.. which means we actually have two copies of the commands copied into setstate buffer, one copied via gpu to original setstate page, and one written directly by cpu in the replacement setstate page).

The proof of concept that I made simply copied the string "Kilroy was here" into a kernel buffer.  But quite easily any random app downloaded from an untrusted source could access any memory, become root, etc.  Not the sort of thing you want falling into the wrong hands.

Once I managed to prove to myself that I understood properly how the hw was working, I wrote up a short report, and submitted it (plus proof of concept) to the qualcomm security team.

Now that the vulnerability is no longer embargoed, I've made available the proof of concept and report here.

Originally I planned to (once fixes were pushed out, so as to not put someone who did not intend to root their device at risk) release a jailbreak based on this vulnerability.  But once towelroot was released, there was no longer a need for me to turn this into an actual firetv jailbreak.  Which saves me from having to figure out how to make an apk.

Parting thoughts..

  1. Well, knownledge about physical addresses and contiguous memory in userspace, while it might not be a security problem in and of itself, it sure helps turn other theoritical exploits into actual exploits.
  2. As far as downstream vendor drivers go, the kgsl driver is actually pretty decent, in terms of code quality, etc.  I've seen far worse.  Admittedly this was not a trivial hole.  But imagine what issues lurk in other downstream gpu/camera/video/etc drivers.  Security is often not simple, and I really doubt whether the other downstream drivers are getting a critical look (from good-guys who will report the issue responsibly).
  3. I used to think of the whole one-kernel-branch-per-device wild-west ways of android as a bit of a headache.  Now I realize it is a security nightmare.  An important part of platform security is being able to react quickly when (not if) vulnaribilites are found.  In the desktop/server world, CVEs are usually not embargoed for more than a week.. that is all you need, since fortunately we don't need a different kernel for each different make and model of server, laptop, etc.  In the mobile device world, it is quite a different story!

Tuesday, May 13, 2014

Freedreno turns gl 2.0 today!

I've just pushed to upstream mesa support for occlusion query, which means that freedreno now advertises OpenGL 2.0:

OpenGL vendor string: freedreno
OpenGL renderer string: Gallium 0.4 on FD320
OpenGL version string: 2.0 Mesa 10.3.0-devel (git-00fcf8b)
OpenGL shading language version string: 1.20

Note that this is desktop OpenGL.  Freedreno has supported OpenGLES 2.0 for quite a long time now.

Implementing occlusion query was a bit interesting due to the way the tiling works on adreno.  We have to track query results per tile.  I've written up a bit of a description about how it works on the wiki: Hardware Queries

Looks like next up is sRGB support which gets us up to GL 2.1.  And then the fun begins with work on GL/GLES 3.0 :-)

EDIT: turns out sRGB texture support is pretty easy.  So now we are GL 2.1.  (GL/GLES 3.0 also needs sRGB render target support which is a bit more involved.  But there that is just one of several features needed for 3.0).

Friday, March 7, 2014

mesa git repo for f20

a quick PSA:

For those using my prebuilt freedreno binaries for fedora, there is now a much better way.  Nicolas Chauvet has created a repo w/ latest mesa which will work with freedreno:

Big thanks Nicolas!

Wednesday, February 5, 2014

freedreno: new compiler

Complementing the hw binning support which landed earlier this year, and is now enabled by default, I've recently pushed the initial round of new-compiler work to mesa.  Initially I was going to keep it on a branch until I had a chance to sort out a better register allocation (RA) algorithm, but the improved instruction scheduling fixed so many bugs that I decided it should be merged in it's current form.

Or explained another way, ever since fedora updated to supertuxkart 0.8.1, about half the tracks had rendering problems and/or triggered gpu hangs.  The new compiler fixed all those problems (and more).  And I like supertuxkart :-)


The original a3xx compiler was more of a simple TGSI translator.  It translated each TGSI opcode into a simple sequence of one or more native instructions.  There was a fixed (per-shader) mapping between TGSI INPUT, OUTPUT, and TEMP vec4 register files to the native (flat) scalar register file.  A not-insignificant part of the code was relatively generic, in concept but not implementation, lowering of TGSI opcodes that relate more closely to old ARB shader instructions, (SCS - Sine Cosine, LIT - Light Coefficients, etc) than the instruction set of any modern GPU.

The simple TGSI translator approach works fine with simple shader ISA's.  It worked ok for a2xx, other than slightly suboptimal register usage.  But the problem is that a3xx (and a4xx) is not such a simple instruction set architecture.  In particular, the instruction scheduling required that the compiler be aware of the shader instruction pipeline(s).  

This was obvious pretty early on in the reverse engineering stage.  But in the early days of the gallium a3xx support, there were too many other things to do... spending the needed time on the compiler then was not really an option.  Instead the "use lots of nop's and hope for the best" strategy was employed.

And while it worked as a stop-gap solution, it turns out that there are a lot of edge cases where "hope for the best" does not really work out that well in practice.  After debugging a number of rendering bugs and piglit failures which all traced back to instruction scheduling problems, it was becoming clear that it was time for a more permanent solution.

In with the new:

First thing I wanted to do before adding a lot more complexity is to rip out a bunch of code.  With that in mind I implemented a generic TGSI lowering pass, to replace about a dozen opcodes with sequences of equivalent simpler instructions.  This probably should be made configurable and moved to util, I think most of the lowerings would be useful to other gallium drivers.

Once the handling of the now unneeded TGSI opcodes was removed, I copied fd3_compiler to fd3_compiler_old.  Originally the plan was to remove this before pushing upstream.  I just wanted a way to compare the results from the original compiler to the new compiler to help during testing and debugging.  But currently shaders with relative addressing need to fall back to the old compiler, so it stays for now.

The next step was to turn ir3 (the a3xx IR), which originates from the fdre-a3xx shader assembler into something more useful.  The approach I settled on (mostly to ease the transition) was to add a few extra "meta-instructions" to hold some additional information which would be needed in later passes, including Φ (Phi) instructions where a result depends on flow control.  Plus a few extra instruction and register flags, the important one being IR3_REG_SSA, used for src register nodes to indicate that the register node points to the dependent instruction.  Now what used to be the compiler (well, roughly 2/3rds of it) is the front-end.  Instead of producing a linear sequence of instructions fed directly to the assembler/codegen, the frontend is now generating a graph of instructions modified by subsequent passes until we have something suitable for codegen.

For each output, we keep the pointer to the instruction which generates that value (at the scalar level), which in turn has the pointer to the instructions generating it's srcs/inputs, and so on.  As before, the front end is generating sequences of scalar instructions for each (written) component in a TGSI vector instruction.  Although now instructions whose result is not used simply has nobody pointing to them so they naturally vanish.

At the same time, mostly to preserve my sanity while debugging, but partially also to make nifty pictures, I implemented an "ir3 dumper" which would dump out the graph in .dot syntax:

The first pass eliminates some redundant moves (some of which come from the front end, some from TGSI itself).  Probably the front end could be a bit more clever about not inserting unneeded moves, but since TGSI has separate INPUT/OUTPUT/TEMP register files, there will always be some extra moves which need eliminating.

After that, I calculate a "depth" for each instruction, where the depth is the number of instruction cycles/slots required to compute that value:

    dd(instr, n): depth(instr->src[n]) + delay(instr->src[n], instr)
    depth(instr): 1 + max(dd(instr, 0), ..., dd(instr, N))

where delay(p,c) gives the required number of instruction slots between an instruction which produces a value and an instruction which consumes a value.

The depth is used for scheduling.  The short version of how it works is to recursively schedule output instructions with the greatest depth until no more instructions can be scheduled (more delay slots needed).  For instructions with multiple inputs/srcs, the unscheduled src instruction with the greatest depth is scheduled first.  Once we hit a point where there are some delay slots to fill, we switch to the next deepest output, and so on until the needed delay slots are filled.  If there are no instructions that can be scheduled, then we insert nop's.

Once the graph is scheduled, we have a linear sequence of instructions, at which point we do RA.  I won't say too much about that now, since it is already a long post and I'll probably change the algorithm.  It is worth noting that some register assignment algorithms can coalesce unneeded moves.  Although moves factor into the scheduling decisions for the a3xx ISA, so I'm not really sure that this is too useful me.

The end result, thanks to a combination of removal of scalar instructions to calculate TGSI vec4 register components which are unused, plus removal of unnecessary moves, plus scheduling other instructions rather than filling with no-op's everywhere, for non trivial shaders it is not uncommon to see the compiler use ~33% the number of instructions, and half the number of registers.


Validating compilers is hard.  Piglit has a number of tests to exercise relatively specific features.  But with games, it isn't always the case that an incorrect shader produces (visually) incorrect results.  And visually incorrect results are not always straightforward to trace back to the problem.  Ie. games typically have many shaders, many draw calls, tracking down the problematic draw and it's shaders is not always easy.

So I wrote a very simplistic emulator for testing the output of the compiler.  I captured the TGSI dumps of all the shaders from various apps (ST_DEBUG=tgsi).  The test app would assemble the TGSI, feed into both the old and new compiler, then run same sets of randomized inputs through the resulting shaders and compare outputs.

There are a few cases where differing output is expected, since the new compiler has slightly more well defined undefined behaviour for shaders that use uninitialized values... to avoid invalid pointers in the graph produced by the front-end, uninitialized values get a 'mov Rdst, immed{0.0}' instruction.  So there are some cases where the resulting shader needs to be manually validated.  But in general this let me test (and debug) the new compiler with 100's of shaders in a relatively short amount of time.


So the obvious question, what does this all mean in terms of performance?  Well, start with the easy results, es2gears[1]:
  • original compiler: ~435fps
  • new compiler: ~539fps
With supertuxkart, the result is a bit easier to show in pictures.  Part of the problem is that the tracks that are heavy enough on the GPU to not be purely CPU limited, didn't actually work before with the original compiler.  That plus, as far as I know, there is no simple benchmark mode which spits out a number at the end, as with xonotic.  So I used the trace points + timechart approach, mentioned in a previous post.

    supertuxkart -f --track fortmagma --profile-laps=1

I manually took one second long captures, in as close to the same spot as possible (just after light turns green):

    ./perf timechart record -a -g -o sleep 1

In this case I was running on an apq8074/a330 device, fwiw.  Our starting point is:

Then once hw binning is in place, we are starting to look more CPU limited than anything:

And with addition of new compiler, the GPU is idle more of the time, but since the GPU is no longer the bottleneck (on the less demanding tracks) there isn't too much change in framerate:

Still, it could help power if the GPU can shut off sooner, and other levels which push the GPU harder benefit.

With binning plus improved compiler, there should not be any more huge performance gaps compared to the blob compiler.  Without linux blob drivers, there is no way to make a real apples to apples comparison, but remaining things that could be improved should be a few percent here and there.  Which is a good thing.  There are still plenty of missing features and undiscovered bugs, I'm sure.  But I'm hopefully that we can at least have things in good shape for a3xx before the first a4xx devices ship ;-)

[1] Windowed apps would benefit somewhat from XA support in DDX, avoiding stall for GPU to complete before sw blit (memcpy) to front buffer.. but the small default window size for 'gears means that hw binning does not have much impact.  The remaining figures are for fullscreen 1280x720.

Wednesday, January 8, 2014

freedreno update: new year edition

Time for another freedreno update.  hw binning support, and fun with gallium HUD.


The big news is that hw binning pass support (for a3xx) is working.   This is a pre-pass for all the draws which generates a visibility stream (ie. basically which vertices apply to which tiles) used to speed up the tile rendering step by filtering out non visible vertices for a given tile.

tl;dr: games or anything with a healthy vertex loading (ie. not window managers) are showing 35-45% fps boost.

Currently it is not enabled by default.  I'd like some time for it to get more testing before it is enabled by default.  For now, use the FD_MESA_DEBUG environment variable to enable it, ie:

  FD_MESA_DEBUG=binning supertuxkart

Also, since I was looking for a way to correlate fps with various other statistics (in particular batches per second vs frames per second), I started playing with the gallium performance monitor HUD (heads-up-display).  With the addition of a few driver custom queries, I had what I needed:

The driver custom queries:
  • draw-calls
  • batches - number of batches per second, sum of batches-sysmem plus batches-gmem
  • batches-gmem - a set of tiles in GMEM rendered, for each tile (optionally) system mem -> gmem (restore), plus N draws, plus gmem -> system mem (resolve); value in batches per second
  • batches-sysmem - draws to system memory (GMEM bypass) per second
  • restores - number of GMEM batches that required restore per second
So above screenshot was generated with:

 export GALLIUM_HUD=cpu0+cpu1+cpu2+cpu3,fps+batches-sysmem+batches-gmem+restores,draw-calls
 export FD_MESA_DEBUG=binning
 supertuxkart -s 1280x720 --demo-mode 1

The binning and query support are on mesa master.

Sunday, November 24, 2013

freedreno update

It's been a while since I've posted an update about the progress of freedreno.. so no major/big headlines, just lots of small stuff.

Mesa 10

I finally polished up the support for emulating (via index buffer) GL_QUAD and other desktop GL primitives which aren't supported in hardware by adreno.  This is needed for gnome-shell and compiz (and probably other compositing window managers using opengl).  The u_primconvert utility could be handy in case any of the other upcoming drivers for SoC GPU's need to emulate any GL primitives which are not in GLES.  This, plus some other fixes needed for latest gnome-shell in fedora 20 where merged prior to the mesa 10.0 branch point, meaning that once Mesa 10 trickles into distributions, you should be able to use distro packaged freedreno rather than needing to rebuild mesa from git.


Since last blog post, I've added support for relative addressing (needed by chromium gl rendering, and a bunch of piglit tests), and fixed a whole bunch of little bugs or missing bits.  And I've started publishing piglit results.  Don't read too much into the absolute numbers, the all_es2 tests from Tom Gall's gles2-all branch still has a number of bogus tests (ie. shaders with precision specifier issues, etc), so not all the failures are freedreno bugs.  But there has been an increase in pass's (and no more crashers) over last few months.

I do really badly need a better collection of GLES2 tests ;-)


The IFC6410 is finally shipping out in larger numbers, as more folks in #freedreno are starting to receive their boards.  This board has been my primary freedreno dev platform for a while now.  If you are looking for a nice small SBC type ARM board with open source graphics, this is a pretty sweet little board.  Pico-itx, APQ8064 (1.5GHz quad core krait + adreno 320), 2GiB DDR3, SATA and gigabit-ethernet (hooked up via pci-e, not usb :-)).  Only downside is upstream kernel support for APQ8064 is pretty non-existent[1], there is only a downstream msm-3.4 based kernel (see ifc6410-drm branch).

And more recently I received a bStem board.  This board is more targeted at robotics (bunch of sensors, FPGA, and various add on boards for motor/RC control, etc).  But it has APQ8060A (1.7GHz dual core krait + adreno 320), and the typical hdmi and usb connectors.  I've pushed initial kernel msm drm/kms support to the bstem-drm branch.  I'm using the same Fedora 20 filesystem that I use with the ifc.

[1] APQ8x74 (aka snapdragon 800) seems to be getting into better shape in upstream kernel, so hopefully we start seeing APQ8074 versions of some of these boards at some point.

Adreno 4xx

Last week qualcomm announced their first adreno 420 device.  We knew this was coming, since support has been starting to show up in qualcomm's downstream android kernel driver (kgsl) in the last few months.  It unfortunately doesn't contain nearly as many useful hints as kgsl did for 2xx and 3xx, but it does give us a few register names.  And fwiw, more recent versions of the android blob userspace GLES drivers appear to have support for 4xx.

The recent announcements don't give too much details, but previous leaked specs indicate DX11 feature-set, and this seems to be backed up by handful of register names we can see from downstream kgsl driver.  (ie. hull/tesselator/domain/geometry shaders, etc).

From what I can tell so far, 4xx appears to be same shader ISA as 3xx (phew!), but pretty much all registers change or at least move, and a lot more features in hw.  So hopefully shouldn't take as long to figure out compared to 3xx (which had both new shader ISA plus register reshuffling).. at least for getting basics running.

Since the recent blob drivers have 4xx support, it should be possible to make a reasonable amount of progress on 4xx r/e before we can get our hands on actual devices.  Of course, there is still much to do on 3xx, so for the time being 4xx is not a priority.

Mailing List, etc

Since more folks are starting to play with freedreno (on IFC6410 and other devices), the whole email-questions-directly-to-rob thing is starting to look like it might not scale too well in the long run.  And, asking questions on IRC doesn't work out too well if you don't have a bip or screen setup to keep your connection alive until someone has a chance to wake up and answer.  So now we have a mailing list:

That plus steadily improving docs and info on the wiki should hopefully help.

Saturday, September 14, 2013

freedreno update: moar fps!

Now that msm drm/kms kernel driver is merged upstream, I've spent the last few weeks on a bit of a debugging / fixing spree.  (Yes, an odd way to start a post about performance/profiling.)  I added proper support for mipmaps/cubemaps/etc (multi-slice resources), killed a few gpu lockup bugs, installed a bunch of games and went looking for and fixing rendering issues.  I've put together a status table on the freedreno wiki.

In the process, I noticed some games, such as supertuxkart, which had low fps, also also had unusually low gpu utilization (30-50%).  Now, a new graphics driver stack will always have lots of room for optimization (which is certainly true of freedreno).  The key is to know which optimization to work on first.  It does no good to make the shader compiler generate 2x faster shaders (which I think is currently possible) if that is just going to take you from 30-50% utilization to 15-25% utilization at roughly the same fps.  So before we get to the fun optimizations, we need to take care of any of the cpu side bottlenecks in the driver.

Now the linux perf tool is pretty nice just for identifying purely cpu bottlenecks.  In fact it showed me pretty quickly that the upstream IOMMU framework struggles with gpu type workloads.  Mapping/unmapping individual pages is not really the way to do it.  On the downstream msm-3.4 based android kernel, we have iommu_map_range() and iommu_unmap_range()[1]... using these instead is worth 2-3 fps in xonotic, and probably more in supertuxkart, but we'll come back to that.

But perf tool does not really help much with gpu or cpu/gpu interactions, at least not by itself.  So, first I added some trace points in the kernel drm/kms driver.. in particular, I put tracepoints:
  1. tracing the fence # when work is submitted to the gpu, and when we get the completion interrupt.
  2. tracing the fence # when cpu waits on a fence and when it finishes waiting
  3. and when pageflip is requested and when it completes (after rendering completes and after vsync)
And then I hacked up the perf timechart tool to display gpu information in the timechart, for a nice timeline overview.  Currently I have it looking for the msm trace events, but I think that it would be useful to have a small set of generic trace events which all the drm drivers can use, so that tools won't have to be looking for driver specific traces.  I think what I have is a reasonable start, but probably needs a bit of work to handle gpu's that have multiple rings, etc.

With that, I fired up supertuxkart again (in demo mode so it will drive itself), and then perf timechart record for a couple seconds to capture a short trace:

You can see above, there is a new bar at the top, below the cpu bars, for the gpu, showing when the gpu is active.  And a green overlay bar on the gpu showing where pageflip has been requested (typically right after rendering submitted), and when pageflip completes (next vblank after rendering completes.  And below, in the per-process bars, a yellow overlay marker when the process is pending on a fence (waiting for some gpu rendering to complete).

And immediately we can see see that that the bottleneck is a fence that supertuxkart is stalling on before it is able to submit rendering for the next frame.  After a little bit of poking, I realized that I should implement support for PIPE_TRANSFER_DISCARD_WHOLE_RESOURCE in the freedreno gallium driver.  If this usage bit is set, it is a hint to the gallium driver that the previous buffer contents do not need to be preserved after the upload.  So in cases that the backing gem buffer object (bo) is still busy (referenced by previous rendering which is not yet complete), it is better to just delete the bo and create a new one, rather than stalling the cpu.  The drm driver holds a ref for bo's that are associated to gpu rendering which has not yet completed, so the pages for the old bo don't go away until the gpu is finished with them.

With this change, things have improved, but there is still a bottleneck:

(note that the timescale differs between these three timecharts, since the capture duration differed)

Oddly we see a lot of activity on kworker (workqueue worker thread in the kernel).  This is mainly retire_worker, in particular releasing the reference that the driver holds to bo's for rendering which is now completed.  After a bit more digging, it turns out that supertuxkart is creating on the order of 150-200 transient buffers per frame.  Unref'ing these, unmapping from IOMMU and cpu, and deleting backing pages for that many buffers takes some time.  Even with some optimization in the kernel, there is still going to be a lot of overhead in the associated vma setup/teardown (since many of these buffers are used for vertex/attribute upload, and will need to be mmap'd), zeroing out pages before the next allocation, etc.

So borrowing an idea from i915, I implemented a bo cache in userspace, in libdrm_freedreno.  On new allocations, we round up to the next bucket size, and if there is a unused buffer in the bucket cache which is not still busy, we take that buffer instead of allocating a new one.  (If I add a BO_FOR_RENDERING flag, like i915, I could take a still-busy gem bo for cases where I know cpu access will not be needed... by the time the gpu starts writing to the buffer, it will be no longer busy.)

With this, things look much better:

As you can see, the gpu is nearly continuously occupied.  And a nice benefit is a drop in cpu utilization.  To do this properly, I need to add a MADVISE style ioctl in msm drm/kms driver, so userspace can advise the kernel that it is keeping a bo around in a cache, and that the kernel is free to free the backing pages under memory pressure, tear down the cpu mapping, etc.  This will prevent the wrath of the OOM killer :-)

So now with the bottlenecks in the driver worked out, future work to make the gpu render faster (ie, hw binning pass, shader compiler optimizations, etc) will actually bring a meaningful benefit.

[1] just fwiw, the ideal IOMMU API would give me a way to make multiple map/unmap updates without tlb/etc flush.  This should be even better than the map/unmap_range variants.  I know when I'm submitting rendering jobs which reference the buffers to the GPU, so I have good points for a batch IOMMU update flush.