High Performance Batching w/ GFX

honey the codewitch

4.80/5 (3 votes)

Apr 25, 2022

MIT

4 min read

4715

Use batching to increase performance during complex rendering operations.

Introduction

GFX has had draw destination batching since its inception but you can't readily take advantage of it directly, and there are limited situations in which GFX can use it. This limited your ability to push a rectangular window of pixels to the display as fast as possible. Usually, you'd have to allocate a temporary bitmap to do so, write to the bitmap and then send it all at once, memory permitting.

The problems with that are that it's added code complexity, it requires the memory to complete the operation, and there's no fallback mechanism.

User level batching solves that problem by handling the bitmap technique for you, while falling back to driver level batching when there's not enough memory to use bitmaps.

With this article, I will endeavor to explain how to use it.

Concepts

The slowest part of displaying any graphics is communicating over the bus. This raw I/O is ultimately the determining factor of performance, but you can increase performance if you can reduce this traffic.

Little IoT display controllers almost all work roughly the same way, and due to the way they work, sending a pixel takes a significant amount of overhead, while sending a whole rectangle of pixels takes very little extra overhead.

Batching in GFX is the process of specifying a rectangular window to write to, and then writing out the pixels top to bottom, left to right without specifying the coordinates of the individual pixels. Not specifying the coordinates is where we see the reduction in bus traffic.

While low level/draw destination level batching is more efficient, you can get more efficient still by sending bitmaps. This isn't because of bus traffic, but because of SPI transaction overhead on the MCU itself, which we also want to reduce. Basically, sending a whole stream of data is faster than sending say, 16-bits of it at a time, regardless of the fact that it's the same traffic.

The bottom line is bitmaps are the preferred mechanism, memory permitting. We want to fall back to low level batching if there's no memory for the bitmaps, and only go pixel by pixel if batching isn't supported at all, or if the destination supports blting in which case we don't need to worry about a bus.

One of the other advantages of using bitmaps is we can do asynchronous DMA transfers as you write out pixels, such that it's sending in the background while you're writing, increasing efficiency. In order to take advantage of this, you have to set up your driver's bus to enable DMA.

The Demo

What it looks like:

And now onto the meat:

main.cpp

First include our headers and import the namespaces:

#include <Arduino.h>
// the bus framework
#include <tft_io.hpp>
// the display driver
#include <ili9341.hpp>
// GFX for C++14
#include <gfx_cpp14.hpp>
// the font
#include "DEFTONE.hpp"
// driver/bus namespace
using namespace arduino;
// GFX namespace
using namespace gfx;

Now here's our wiring and configuration. The project supports the ESP WROVER KIT 4.1, or a standard ESP32 with the display wired to the default pins for VSPI: MOSI 23, MISO 19, SCLK 18. In addition the other pins are CS 5, DC 2, RST 4, and BCKL 15.

#define LCD_HOST    VSPI
#ifdef ESP_WROVER_KIT // don't change these
#define PIN_NUM_MISO 25
#define PIN_NUM_MOSI 23
#define PIN_NUM_CLK  19
#define PIN_NUM_CS   22

#define PIN_NUM_DC   21
#define PIN_NUM_RST  18
#define PIN_NUM_BCKL 5
#define BCKL_HIGH false
#else // change these to your setup. below is fastest
#define PIN_NUM_MISO 19
#define PIN_NUM_MOSI 23
#define PIN_NUM_CLK  18
#define PIN_NUM_CS   5

#define PIN_NUM_DC   2
#define PIN_NUM_RST  4
#define PIN_NUM_BCKL 15
#define BCKL_HIGH true
#endif

If you don't see any display, trying changing BCKL_HIGH to false, as some displays require it to be pulled low instead of high.

Now we declare our bus and driver using the above settings:

// declare the bus for the driver
// must use tft_spi_ex in order to
// enable DMA for async operations
using bus_t = tft_spi_ex<VSPI,
                        PIN_NUM_CS,
                        PIN_NUM_MOSI,
                        PIN_NUM_MISO,
                        PIN_NUM_CLK,
                        SPI_MODE0,
                        false,
                        320*240*2+8,
                        2>;
// declare the driver
using lcd_t = ili9341<PIN_NUM_DC,
                      PIN_NUM_RST,
                      PIN_NUM_BCKL,
                      bus_t,
                      1,
                      BCKL_HIGH,
                      400, // 40MHz writes
                      200>; // 20MHz reads

On the ESP32, the DMA figure is derived by computing the total bytes to hold the framebuffer plus 8, in this case 320*240*2+8 because each pixel is 2 bytes at 320x240. We like DMA channel 2 on this platform because sometimes DMA channel 1 is used for other purposes.

The following just makes it so we can get X11 colors for this display easily. For example, we can do color_t::sky_blue.

// easy access to the color enum
using color_t = color<typename lcd_t::pixel_type>;

After that, we have some settings that dictate the behavior of the application:

// app settings

// true to draw a gradient, false for
// a checkerboard
const bool gradient = false;
// true to use async batching
const bool async = true;
// the text
const char* text = "hello world!";
// the text height in pixels
const uint16_t text_height = 75;
// the font
const open_font& text_font = DEFTONE_ttf;

Hopefully, the comments make it clear, but you can always play around with the values to get a better idea of how they work.

On to the global variables, of which we have several:

// globals
lcd_t lcd;
float hue;
srect16 text_rect;
float text_scale;

These hold the driver instance, the current hue, the rectangle where the text will be drawn, and the precomputed scale factor for the text, respectively.

Next we just premeasure the text in setup. Since the text is always the same, we do it here so it's only done once.

void setup() {
  Serial.begin(115200);
  // premeasure the text and center it
  // so we don't have to every time
  text_scale = text_font.scale(text_height);
  text_rect = text_font.measure_text(ssize16::max(),
                                    spoint16::zero(),
                                    text,
                                    text_scale).
                                      bounds().
                                      center((srect16)lcd.
                                                      bounds());
}

Finally, we get to the good stuff, the batching:

void loop() {
  // current background color
  hsv_pixel<24> px(true,hue,1,1);
  
  // use batching here
  auto ba = (async)?draw::batch_async(lcd,lcd.bounds()):
                  draw::batch(lcd,lcd.bounds());
  
  if(gradient) {
    // draw a gradient
    for(int y = 0;y<lcd.dimensions().height;++y) {
      px.template channelr<channel_name::S>(((double)y)/lcd.bounds().y2);
      for(int x = 0;x<lcd.dimensions().width;++x) {
        px.template channelr<channel_name::V>(((double)x)/lcd.bounds().x2);
        ba.write(px);
      }
    }
  } else {
    // draw a checkerboard pattern
    for(int y = 0;y<lcd.dimensions().height;y+=16) {
      for(int yy=0;yy<16;++yy) {
        for(int x = 0;x<lcd.dimensions().width;x+=16) {
          for(int xx=0;xx<16;++xx) {
            if (0 != ((x + y) % 32)) {
              ba.write(px);
            } else {
              ba.write(color_t::white);
            }
          }
        }
      }
    }
  }
  // commit what we wrote
  ba.commit();
  // offset the hue by half the total hue range
  float hue2 = hue+.5;
  if(hue2>1.0) {
    hue2-=1.0;
  }
  px=hsv_pixel<24>(true,hue2,1,1);
  // convert the HSV pixel to RGBA because HSV doesn't antialias
  // and so we can alpha blend
  rgba_pixel<32> px2,px3;
  convert(px,&px2);
  // set the alpha channel tp 75%
  px2.channelr<channel_name::A>(.75);
  // draw the text
  draw::text(lcd,text_rect,spoint16::zero(),text,text_font,text_scale,px2);
  // increment the hue
  hue+=.1;
  if(hue>1.0) {
    hue = 0.0;
  }
}

You can see most of the complexity here is the actual drawing, and part of that is the requirement that the pixels be in order from left to right, top to bottom. Particularly with the checkerboard, it required a little creativity via those quadruple nested loops. The actual batching is quite simple, requiring three calls - batch<>()/batch_async<>(), write<>(), and commit(). If you don't commit explicitly, the batch will be committed when it goes out of scope. However, if you attempt to draw other things while a batch is in progress, the results are undefined, so it's best to commit explicitly in order to avoid the situation.

It should be noted that there's a much faster way to draw the checkerboard pattern without batching simply by drawing the squares using draw::filled_rectangle<>(). The reason being it simply takes less bus traffic than updating every pixel of the display, which is what we do when we batch in this case. The reason it was done this way though was to demonstrate batching.

History

25^th April, 2022 - Initial submission