Asynchrony g++2d Asynchronous Drawing Driver Internet of Things Arduino Graphics C++14 Design / Graphics Advanced Dev C++

GFX Sidebar: Inside the ILI9341 Display Driver on the ESP-IDF

honey the codewitch

5.00/5 (2 votes)

May 6, 2021

MIT

26 min read

7227

105

Explore the inner workings of a highly capable IoT display driver for the ESP32

ILI9341 demo

Introduction

In GFX Part 4, I introduced an ILI9341 display driver but I did not explain how it worked. Since it's a very capable driver supporting the full gamut of write operations, it would be nice to understand it in order to implement your own. To that end, we're going to explore the code for this driver. Furthermore, given all of the functionality of the driver, getting into it should hopefully be interesting.

Building this Mess

You'll need Visual Studio Code with the Platform IO extension installed. You'll need an ESP32 with a connected ILI9341 LCD display. I recommend the Espressif ESP-WROVER-KIT development board which has an integrated display and several other pre-wired peripherals, plus an integrated debugger and a superior USB to serial bridge with faster upload speeds. They can be harder to find than a standard ESP32 devboard, but I found them at JAMECO and Mouser for about $40 USD. They're well worth the investment if you do ESP32 development. The integrated debugger, though very slow compared to a PC, is faster than you can get with an external JTAG probe attached to a standard WROVER devboard.

This project is set up for the above kit by default. If you're using a generic ESP32, you'll have to set your configuration to the generic-esp32 setting (as listed in the platformio.ini file). Make sure to select the appropriate configuration when you build. In addition, you'll need to change the SPI pin settings near the start of main.cs to whatever your pins are. The defaults are for the ESP-WROVER-KIT. You'll also need to change the specific extra pins for the ILI9341, like the DC, the RST and the backlight pins.

Before you can run it, you must Upload Filesystem Image under the Platform IO sidebar - Tasks.

Note: The Platform IO IDE is kind of cantankerous sometimes. The first time you open the project, you'll probably need to go to the Platform IO icon on the left side - it looks like an alien. Click it to open up the sidebar and look under Quick Access|Miscellaneous for Platform IO Core CLI. Click it, and then when you get a prompt type pio run to force it to download necessary components and build. You shouldn't need to do this again, unless you start getting errors again while trying to build.

Conceptualizing this Mess

Structurally, the GFX library has no special knowledge of any display drivers, but display drivers know about GFX if they can bind to its drawing functionality. Therefore, there is a dependency on GFX for any display driver that serves as a draw target for GFX. GFX itself however, has no dependency on any display drivers.

Because of the fact that there are GFX bindings on the driver, some small amount of familiarity with GFX is assumed. It should be noted however, that the GFX bindings themselves are simply thin wrappers over the underlying driver functionality and they can potentially be removed in order to remove the dependency to GFX.

The display drivers themselves may support various kinds of operations, as indicated by the caps member. GFX uses this member to determine how to call the draw target. Due to the fact that a draw target may be practically anything that can bind to GFX, naturally different things have different capabilities. For example, bitmaps do not expose asynchronous operations because it wouldn't make sense to - writes to bitmap memory are already nearly instant and making them asynchronous would just be extra overhead. That's a different story with a display device that has to deal with being connected via a relatively slow bus. In that case, asynchronously performing operations means you can be drawing the next thing while the bus is still sending data from the last thing. The caps structure simply lets GFX know what the draw target is capable of.

The drivers also expose a pixel_type alias which indicates the type of pixel the driver's framebuffer supports. It should be noted that pixel formats are not switchable at runtime with this library. You must commit to a pixel type at compile time. This driver only supports RGB565 currently so the pixel type is not configurable. However, the display device also supports two more pixel formats but the driver doesn't support them yet.

Drivers must support writing, and may support reading. The ILI9341 driver currently does not support reading even though the device does. The reason being is the datasheet for the read commands isn't clear and I haven't figured out how to get data from it yet.

In terms of writing, the basic operations are clearing a rectangle, filling a rectangle, writing a source such as a bitmap to the frame buffer, and writing a pixel to the frame buffer.

In addition to that, a driver may support batching writes such that you can specify a write window rectangle up front and then do buffered writes of the pixels, and commit when finished. Doing so is significantly more efficient than writing pixel by pixel. If the driver supports batch operations GFX will use them in many if not most cases.

On top of that, a driver can support asynchronous versions of its methods. Using these, operations can be dispatched and control returned before the operation is complete. Control is not necessarily returned immediately if there's no more room in the queue for a transaction. In that case, the function will wait until the queue has a free slot. Even in that case, queuing can provide a significant optimization opportunity wherein you can begin drawing the next frame while the current frame is being sent to the display.

On the ILI9341 driver, the raw driver methods and the GFX bindings that wrap them are exposed on the same class. They didn't have to be, but doing this simplified the code, even if it means the interface is kind of cluttered. Since the driver is low level, I didn't prioritize making the surface area of it "clean" and minimal. In general, the driver's low level functions are named like noun_verb while the GFX bindings are named like verb_noun.

Using this Mess

Before we dive into how it works, we should talk about how to use it.

Since the display uses SPI to connect to the ESP32, first we have to initialize the SPI bus:

spi_master spi_host(&error_result_code,
                    LCD_HOST,
                    PIN_NUM_CLK,
                    PIN_NUM_MISO,
                    PIN_NUM_MOSI,
                    GPIO_NUM_NC,
                    GPIO_NUM_NC,
                    DMA_SIZE,
                    DMA_CHAN);

You can see we've got an error_result_code followed by what looks like a bunch of #defined values (that's exactly what they are) passed to the constructor. For the most part, these represent your GPIO pin numbers for your SPI. Currently, I do not automatically assign them to sane defaults (which are different for different ESP32 kinds) so you must specify them. We also declare the maximum size of the DMA transfers here. For graphics, this should be at least as large as either your display buffer, or the largest bitmap you plan to blt to the screen, plus 8, whichever is largest. We specify a DMA channel as well since we'll need DMA for fast transfers. The error_result_code meanwhile, can be null or it can be the address of a value that receives the result/status of the SPI initialization.

The driver itself is a template, and in order to use it, you first have to fill in all the template parameters and establish a concrete type with it:

using lcd_type = ili9341<LCD_HOST,
                        PIN_NUM_CS,
                        PIN_NUM_DC,
                        PIN_NUM_RST,
                        PIN_NUM_BCKL>;

Here, we've used a bunch of #defined values to set the pin numbers. This will work as long as those values are defined such that they match your wiring on your board. I've excluded some of the parameters that you usually don't need to set, like the buffer size and the timeout. You can see here we've aliased this declaration as lcd_type. You'll want an alias since we need to refer to the type as we use the driver.

Now it's time to instantiate the driver:

lcd_type lcd;

That's all there is to instantiating it. We don't need any constructor parameters because we handled all the necessary information by passing it as template arguments.

The LCD will not be initialized on construction, because I've had stability problems with doing I/O before app_main() in the globals section. Because of that, it uses lazy initialization such that it is initialized on the first call to a graphics function, or when you call initialize(), which will force an initialization if one didn't already occur.

First up on the class itself, there are several static members that get things like the width and height of the display, as well as providing access to the values of all the template arguments that were passed in.

As I mentioned, there are two layers of driver calls on the ili9341<> template class. First, we have the low level native layer, and then we have the GFX bindings that use that layer and allow the device to be drawn to by GFX. This is on a single class for simplicity and expediency. The driver isn't designed to be used directly, but it can be. Consequently, it's not designed to have a clean API footprint, but rather, a functional one with a full set of features but a minimum of fluff.

The Native Layer

Since this isn't an article about GFX itself, we're going to start with the native layer.

The Basics

Keep in mind that the native layer is driver specific. Other drivers may expose their functionality in an entirely different manner. That works because GFX only cares about GFX bindings.

As far as the native layer, we again have a couple of basic conceptual operations. We can write a pixel to the display, or we can write bitmap data to the frame buffer. Simple enough, right? Right.

First up, we have pixel_write(x,y,pixel_color) allowing you to write a single pixel to the display.

Next we have frame_write(x1,y1,x2,y2,bmp_data) allowing you to write an in memory bitmap to a portion of the display.

for(uint16_t y=0;y<lcd_type::height;++y) {
    for(uint16_t x=0;x<lcd_type::width;++x) {
        // alternate white and black
        uint16_t v=0xFFFF*((x+y)%2);
        if(lcd_type::result::success!=lcd.pixel_write(x,y,v)) {
            printf("write pixel failed\r\n");
            y=lcd_type::height;
            return;
        }
    }
}

Above, we write a black and white hatched pattern/dither across the entire display by alternating between black and white. Easy enough, and so slow it will make you want to get out and push. It's slow because it floods the SPI bus. Flooding the bus can trigger a watchdog timeout, although in my experience it doesn't poke it hard enough to cause it to reboot the machine, but rather just causes it to lodge a complaint over the serial port.

There's got to be a better way to fill the screen than this. Fortunately, there is.

Batching

Enter batching. As I said, writing one pixel at a time generates a lot of SPI traffic, and the I/O causes significant overhead. It's fine to use write_pixel() when you have to, but there are much more efficient methods to write in many cases. The primary way is batching. Batching works by setting a rectangular target area, and then writing pixel colors out left to right, top to bottom until the target area is filled (for example, an 8x8 area would require 64 pixels to be written). Since the output is buffered, when you're done you commit any remaining data.

Let's do the code from above, better, faster!

// batched, much faster and more reliable
lcd.batch_write_begin(0,0,lcd_type::width-1,lcd_type::height-1);
for(uint16_t y=0;y<lcd_type::height;++y) {
    for(uint16_t x=0;x<lcd_type::width;++x) {
        uint16_t v=0xFFFF*((x+y)%2);
        if(lcd_type::result::success!=lcd.batch_write(&v,1)) {
            printf("write pixel failed\r\n");
            y=lcd_type::height;
            return;
        }
    }
}
lcd.batch_write_commit();

I've bolded the relevant changes. The first and last lines are important, since they establish the scope of our batch operation. Inside it, we're calling batch_write() passing two parameters. The first is an array of one or more pixel values, and the second is the count in that array. Since it's C++, we can simply pass the scalar value's address and treat it as a 1 length array. I love C++. I should note that it's not that much faster to pass multiple values at once, since underneath all of this the batch writes are buffered anyway, but it does save the miniscule overhead of calling batch_write() as many times as you otherwise would have to so you can use it if you want.

Make sure you commit your batches! Almost all operations will already commit pending batches when they have to before starting whatever it is they're doing, but it's not always necessary and these routines are often as lazy as possible, so if you don't commit, then your final data may collect dust in the buffer until another operation finally requires it to be written to the display. When this happens depends on several factors, including how many SPI transactions the display is set up for.

Asynchronous Queuing

If you need to, you can sometimes squeeze even more speed out of your app using asynchronously queued driver operations. These aren't actually "faster" than the unqueued variety. In fact, they take a little bit more overhead. The advantage of them however, is that they return sooner - sometimes immediately - and process remaining work in the background. This frees up your code to be calculating the next frame while the operation completes. The SPI host itself will continue the transfers using DMA memory in the background. All of the above operations have queued counterparts that return immediately, although queued_write_pixel() is not very useful in practice. It was provided for completeness and consistency.

Below is an example of what not to do. It is the same code as the above, except using queued calls:

// queued/asynchronous batching - SLOWER than the above
lcd.queued_batch_write_begin(0,0,lcd_type::width-1,lcd_type::height-1);
for(uint16_t y=0;y<lcd_type::height;++y) {
    for(uint16_t x=0;x<lcd_type::width;++x) {
        // here's the reason this is no good:
        // we're simply not doing enough work
        // below to make it worthwhile. The
        // only way queued operations are
        // worth it is if your work between
        // driver calls is more than the 
        // additional overhead incurred by
        // queuing. Since all we're doing
        // below is some very basic math,
        // this doesn't pay for itself
        uint16_t v=0xFFFF*((x+y)%2);
        if(lcd_type::result::success!=lcd.queued_batch_write(&v,1)) {
            printf("write pixel failed\r\n");
            y=lcd_type::height;
            return;
        }           
    }
}
lcd.queued_batch_write_commit();

Let me see if I can explain the comments in the middle a little more. What's happening here, is we're underutilizing our asynchronicity. In order for there to be a net performance win, your CPU must be busy for as much of the time as possible between queued operations. This is because those operations happen in the background. Think about it. You want the backgrounded/queued operation to be running at the same time as your CPU is doing a lot of work. That way, both things are "spinning" as fast as possible, and everything is in parallel. The above isn't very parallel. The CPU isn't being used enough between calls for it to be worth the extra overhead of queuing background work. That's the bottom line.

The queuing features work best for large transfers. The more data you're sending, the better. For batching operations, this is dictated by the BufferSize template parameter/buffer_size member. A bigger value makes for more efficient batching transfers but takes more memory. The default is 64 bytes, which is on the small end of reasonable. Don't mess with it until you profile. For queued frame writes however, the buffer_size is irrelevant. The "buffer" is actually the bitmap data you passed in, so it will transfer that. The bitmap data size limit is dictated by the DMA transfer size specified in the SPI master (spi_master) host.

The reason larger transfers are better is your CPU gets more breathing room as the bus takes longer to send more data. So for example if your bus took 100ns to send some data, that's 100ns you have free to do what you want while the data is sending. The less time the bus transfer takes to complete, the more frequently the CPU has to step in to queue more data, increasing the overhead. I hope that makes sense.

Don't worry, because while I've showed you how not to do asynchronously queued operations, before the end of the article, we'll explore the right way to do it.

The Native API

Here is a brief description of the API calls for the native driver functionality.

host_id - The identifier for the SPI host used by the driver
pin_cs - The CS pin
pin_dc - The DC pin
pin_rst - The RST pin
pin_backlight - The BKL pin
buffer_size - The buffer size for batching operations
width - The width of the display in pixels
height - The height of the display in pixels
max_transactions - The number of transactions allowable by the driver
timeout - The timeout for queued operations

result - An enumeration of error codes reported by the driver

initialized() - Indicates if the driver has been initialized
initialize() - Forces the driver to initialize itself. Normally the driver waits until it is first used.
frame_write() - Writes bitmap data to part of the frame buffer
queued_frame_write() - An asynchronously queued version of the above
frame_fill() - Fills part of the frame buffer with a color
queued_frame_fill() - An asynchronously queued version of the above
batch_write_begin() - Begins a batch write operation for the target frame buffer window
queued_batch_write_begin() - An asynchronously queued version of the above
batch_write() - Writes one or more pixels to the batch
queued_batch_write() - An asynchronously queued version of the above
batch_write_commit() - Sends any remaining data in the batch buffer
queued_batch_write_commit() - An asynchronously queued version of the above
pixel_write() - Plots a pixel to the frame buffer
queued_pixel_write() - An asynchronously queued version of the above
queued_wait() - Waits for all pending queued operations to complete

The GFX Binding Layer

The native layer is fine, but GFX has no idea what to do with it. In order for GFX to be able to draw to the screen, the driver must expose its functionality in such a way that GFX can use it. To that end, this driver has a GFX layer exposed on the same class, which delegates to the native layer to do most of the heavy lifting.

GFX was designed to be able to use almost any driver, even ones that don't have all of the features of this one. Therefore, a GFX enabled driver exposes a member that tells GFX what features the driver supports. GFX will then use that information to decide how to call the driver.

Beyond that the GFX driver exposes similar functionality as the underlying driver because that just makes sense. GFX supports batching, and asynchronous operations, as well as pixel and frame writes as long as the underlying driver supports it, but it exposes them a little bit differently. For one thing, it uses GFX types like rect16 and gfx_result to take parameters and report result codes, so that GFX can use it in a generalized way.

Let's cover the bindings now:

type - Indicates the type itself. This isn't actually used by GFX, but it's exposed by convention
pixel_type - indicates the type of pixel this driver supports. Note that it's fixed at compile time. This is because in order to efficiently handle different pixel formats, code must be generated to deal with specific formats rather than computing those operations at runtime. The bottom line is it's either compile time, or slowing graphics down considerably.
caps - This reports the capabilities of the driver that I touched on earlier. It can be queried to figure out which features the driver supports.

dimensions() - indicates the dimensions of the display
bounds() - indicates the bounding rectangle for the display
point() - writes a pixel at the specified location
point_async() - an asynchronous version of the above
fill() - fills a portion of the display with the specified pixel
fill_async() - an asynchronous version of the above
clear() - zeroes a portion of the display. For this driver, it's equivalent of filling a region to black, but some pixel formats are not black when zeroed. If this driver used such a format, the result will not be black. For some drivers, this operation is faster than a fill, but not this one.
begin_batch() - begins a batch operation to the target rectangle
begin_batch_async() - an asynchronous version of the above
write_batch() - writes a pixel to the current batching operation
write_batch_async() - asynchronous version of the above
commit_batch() - commits any unwritten batched data
commit_batch_async() - asynchronous version of the above
write_frame<>() - writes source data to the frame buffer
write_frame_async<>() - asynchronous version of the above

As I've said, for the most part, these are just thin wrappers that delegate to the native layer. However, the exceptions are the template methods write_frame<>() and write_frame_async<>() which write source data to a part of the frame buffer. The concept of a "source" in GFX is loose, and it means roughly, something that supports GFX bindings for reading. Theoretically this source could be a bitmap or another display driver so long as that driver supported reading - this one does not. Due to the loose nature of a "source", its type must be taken as a template argument and the methods must be templatized.

Coding this Mess

Now we get into how it works.

The Demo

Here's a video of the demo output. As you can see, it is doing fluid full screen animation in real time. This is possible because of the asynchronous queuing which allows us to process the next frame while we're still sending the previous frame over the SPI bus. You can do it synchronously with a simple modification of the code and see the difference. The asynchronous queuing gives us a heck of a throughput increase in this case, but it's in large part because of how much data we're sending. As I said before, the more data you send the better, and here we're sending exactly 320x16 pixels or exactly 10kB in each transfer, which is actually quite a lot for this little device. While that 10kB transfer is going on we can work on computing the next 16 lines, which we'll send to the display as well in the next iteration.

It should be noted that I shamelessly lifted the effect here and idea for the demo from a public domain offering by Espressif for a demo that ships burned into the ESP-WROVER-KIT's firmware when you unbox it. However, that code was in C, not C++ and didn't use my driver or my GFX library, so I adapted it. What's left is just the core effect and the concept of queuing 16 lines at a time while computing that effect. Everything else, except where otherwise noted, is my code.

I won't be covering the entire demo here because it's a few files. I'll cover the important stuff.

First, the initialization, which we do in the global scope - I always put my devices in the global scope since hardware is effectively global as a concept:

// To speed up transfers, every SPI transfer sends as much data as possible. 
// This define specifies how much. More means more memory use, but less 
// overhead for setting up / finishing transfers. Must be divisible by 240
#define PARALLEL_LINES 16 // max in this case is 24 before we run out of RAM

// configure the spi bus. Must be done before the driver
spi_master spi_host(nullptr,
                    LCD_HOST,
                    PIN_NUM_CLK,
                    PIN_NUM_MISO,
                    PIN_NUM_MOSI,
                    GPIO_NUM_NC,
                    GPIO_NUM_NC,
                    PARALLEL_LINES*320*2+8,
                    DMA_CHAN);

// alias our driver config
using lcd_type = ili9341<LCD_HOST,
                        PIN_NUM_CS,
                        PIN_NUM_DC,
                        PIN_NUM_RST,
                        PIN_NUM_BCKL>;

// instantiate the driver:
lcd_type lcd;

Here's the meat of the demo. This spins a loop and alternates sending two buffers. It works on one, while transferring the other, than switches up, and works on the other one while transferring the first one, back and forth until all the lines are processed, and then it does it again and again forever:

// Simple routine to generate some patterns and send them to
// the LCD. Don't expect anything too impressive. Because the
// SPI driver handles transactions in the background, we can
// calculate the next line while the previous one is being sent.
static void display_pretty_colors()
{
    uint16_t *lines[2];
    //Allocate memory for the pixel buffers
    for (int i=0; i<2; i++) {
        lines[i]=(uint16_t*)heap_caps_malloc(
        320*PARALLEL_LINES*sizeof(uint16_t), MALLOC_CAP_DMA);
        assert(lines[i]!=NULL);
    }
    int frame=0;
    //Indexes of the line currently being sent to the LCD and the line we're calculating.
    int sending_line=-1;
    int calc_line=0;

    while(true) {
        ++frame;
        for (int y=0; y<240; y+=PARALLEL_LINES) {
            //Calculate a line.
            pretty_effect_calc_lines(lines[calc_line], y, frame, PARALLEL_LINES);
            //Finish up the sending process of the previous line, if any
            //if (sending_line!=-1) lcd.queued_wait();//send_line_finish(spi);
            //Swap sending_line and calc_line
            sending_line=calc_line;
            calc_line=(calc_line==1)?0:1;
            //Send the line we currently calculated.
            // queued_frame_write works better the larger the transfer size.
            lcd.queued_frame_write(0,
                y,
                lcd_type::width-1,
                y+PARALLEL_LINES-1,
                (uint8_t*)lines[sending_line]);
            //The line set is queued up for sending now; the actual sending happens in the
            //background. We can go on to calculate the next line set as long as we do not
            //touch line[sending_line]; the SPI sending process is still reading from that.
        }
    }
}

Now let's move to the actual ili9341.hpp header and explore the class therein.

SPI Management

The SPI can be kind of tricky. There are some rules to working with SPI under the ESP32:

SPI devices essentially must always be accessed from the same thread
SPI polling transactions are more efficient than queued operations but they block
The SPI device cannot execute polling transactions while there are pending queued transactions
The SPI transaction structures and memory must be kept around for the duration of the transaction
There must be one completion call for every queued transaction
There cannot be more transactions than there are max_transactions.

Number 1 is simple. The driver is not thread safe so nothing needs to be done.

Number 2 is important because it means we shouldn't always use queued transactions

Number 3 can be dicey. What we do is we make sure we've committed any pending asynchronous operations before we start a polling operation. This is handled inside send_transaction(), although in certain cases, we commit manually.

Number 4 is interesting. While it's the caller's responsibility to hold their data around, the transaction structures themselves must also be held onto. We solve this by allocating as many transaction structures as there are max_transactions. These are in an array class member called m_trans. Whenever we begin a new transaction we use the next free slot in the array, wrapping around if we hit the length. This creates a round-robin sequential scheduling scheme where the oldest structure becomes the next structure. Since a transaction must be finished before a new transaction is started if we are at max_transactions, we will always have an available slot in the array. It turns out this was easy. This is handled by send_next_cmd() and send_next_data().

Number 5 dictates that we force you to call a completion method (queued_wait()) or otherwise have some way of managing transactions such that they are completed when finished. That latter bit is actually handled by the same mechanism that handles number 6.

Number 6 is handled by keeping track of the number of queued transactions that are running. When that number is at max_transactions and we need another transaction, we first wait on previous transaction in order to free up a slot. This too, is handled in send_transaction().

All of this delegates to the spi_device class which handles lower level I/O.

Let's take a look at the code.

result send_transaction(spi_transaction_t* trans,bool queued,bool skip_batch_commit=false) {
    // initialize the display if necessary
    result r = initialize();
    if(result::success!=r)
        return r;
    spi_result rr;
    spi_transaction_t tmp;
    // if we're not queuing but there are queued transactions currently
    // we have to flush everything:
    bool batch_committed=false;
    if(!queued && 0!=m_queued_transactions) {
        // commit the batch if we have to
        if(!skip_batch_commit && 0!=m_batch_left) {
            r=commit_batch_internal(&tmp,true);
            if(result::success!=r)
                return r;
            batch_committed=true;
            // wait for everything to complete
            if(0!=m_queued_transactions)
                r= queued_wait();
        } else {
            // wait for everything to complete
            r=queued_wait();
        }
        if(result::success!=r)
            return r;
        
    } 
    // commit the batch if necessary and we haven't already
    if(!batch_committed && !skip_batch_commit&&0!=m_batch_left) {
        if(!queued) {
            r=commit_batch_internal(&tmp,false);
            if(result::success!=r)
                return r;
        } else {
            // HACK: We can't use tmp here because
            // the transaction won't complete immediately
            // so what we have to do is forcibly open
            // a new slot in m_trans, move the current
            // *trans to the new slot, and then replace
            // *the current slot* with the batch commit
            r=ensure_free_transaction();
            if(result::success!=r)
                return r;
            size_t next_free = (m_queued_transactions+1)%max_transactions;
            memcpy(&m_trans[next_free],trans,sizeof(spi_transaction_t));
            r=commit_batch_internal(trans,true);
            if(result::success!=r) {
                return r;
            }
            trans = &m_trans[next_free];
            m_batch_left=0;
        }
        m_batch_left=0;
        batch_committed=true;
    }
    // now actually send the transaction
    if(queued) {
        r=ensure_free_transaction();
        if(result::success!=r)
            return r;
        rr = m_spi.queue_transaction(trans,timeout);
    } else {
        rr = m_spi.transaction(trans,true);
    }
    if(spi_result::success!=rr) {
        return xlt_err(rr);
    }
    if(queued)
        ++m_queued_transactions;
    return result::success;
}

You can see this routine is rather complicated, but it's responsible for several things. It handles lazy initialization of the display, it manages the queued transactions, and it automatically commits the uncommitted batch, if any.

The nasty part is the hack section. Basically what happens is you need to keep the transaction around for the duration of the queued operation so we can't use tmp since it's on the stack. Well, we normally don't pick the next m_trans[x] in this routine - it's already picked for us by the time we're called from send_next_cmd() or send_next_data(). However, here we have to recycle it to use it to store our batch commit transaction. Then we need to get the next free slot and fill it with trans so essentially we're queuing two transactions rather than one in this case.

The way this routine works, when queuing it does a FIFO scheme such that it always keeps a free transaction at the ready, so as you queue up a new transaction, and old transaction get finished if necessary in order to free up a slot. That way, your transactions flow smoothly from one to the next, rather than waiting for all of them once the queue gets full. However, if you currently have queued transactions waiting to finish, and you want to do a non-queued operation, the routine will wait for all queued transactions to finish before executing the transaction in order to satisfy rule #3.

This is all delegated to from send_next_cmd() or send_next_data() which simply choose a free slot and prepare the next transaction. They also set the user defined data field for the transaction to 0 for commands, and 1 for data. This signals our pre-transaction callback to set the DC line low (0) or high (1) because the display device expects the DC line to be used so it can distinguish between commands and data. It's unfortunate, really, because if it weren't necessary to control the DC line we could pack several commands into one transaction for maximum efficiency. Too bad, but I try to make up for some of the difference by managing the transactions as efficiently as possible.

This is basically the heart of our actual SPI communication layer for the driver.

The only exception to this is the initialization which uses raw SPI writes to spi_device for loading the static const array of commands at the bottom of the source file into the driver. It does not queue the initalization - it's always synchronous using the faster polling transactions that block, favoring speed over throughput for that phase.

Primary LCD Operations

On top of the communication layer sits our display command layer, and it's here where we translate method calls for manipulating the display into SPI transactions on our communication layer. This is the Native Layer we covered earlier.

One thing you'll notice is a lot of delegation to private XXXXX_impl() methods. This is because those internal methods hold the meat of a shared implementation between the synchronous and queued operations, which we expose as two different methods, for example batch_write() and queued_batch_write(), rather than having a bool queued parameter like our internal routines do.

There's also a weirdity where we call the batch_write_begin() method(s) from routines like frame_write() and queued_frame_fill() without actually performing a batch, or committing it. The reason is that all the batch begin routines do is commit any pending batch and then set the address window and switch the display to write mode. We repurpose the routine to do exactly that, and then rather than performing a batch, we take over from there and basically hijack the rest of the batch, bypassing it and replacing it with our own writes. I could have made a separate routine and delegated to that, and given it a more appropriate name, but it would have simply increased the size of the source, without providing much benefit.

Here's queued_frame_write():

// queues a frame write operation. The bitmap data must be valid 
// for the duration of the operation (until queued_wait())
result queued_frame_write(uint16_t x1,
                        uint16_t y1, 
                        uint16_t x2, 
                        uint16_t y2,
                        uint8_t* bmp_data,
                        bool preflush=false) {
    // normalize values
    uint16_t tmp;
    if(x1>x2) {
        tmp=x1;
        x1=x2;
        x2=tmp;
    }
    if(y1>y2) {
        tmp=y1;
        y1=y2;
        y2=tmp;
    }
    if(x1>=width || y1>=height)
        return result::success;
    result r;
    if(preflush) {
        // flush any pending batches or 
        // transactions if necessary:
        r=batch_write_commit_impl(true);
        if(result::success!=r) {
            return r;
        }
        r=queued_wait();
        if(result::success!=r)
            return r;
    }
    // set the address window - we don't actually do a batch
    // here, but we use this for our own purposes
    r=batch_write_begin_impl(x1,y1,x2,y2,true);
    if(result::success!=r)
        return r;
    
    r=send_next_data(bmp_data,
                    (x2-x1+1)*(y2-y1+1)*2,
                    true);
    
    // When we are here, the SPI driver is busy (in the background) 
    // getting the transactions sent. That happens mostly using DMA, 
    // so the CPU doesn't have much to do here. We're not going to 
    // wait for the transaction to finish because we may as well spend
    // the time doing something else. When that is done, we can call
    // queued_wait(), which will wait for the transfers to be done.
    // otherwise, the transactions will be queued as the old ones finish
    return r;  
}

That's actually pretty straightforward. Preflush allows us to commit any batch and then clear all the pending transactions before starting this one. Sometimes, that can be helpful, depending on the circumstances, so that you're starting with a fresh, empty transaction queue before writing. Keep in mind one frame write takes 6 transactions due to the fact that we're toggling the DC line low and high between command and data sends because the display device demands it.

Batching is a bit more interesting because this class controls an internal buffer that gets sent every time it gets full. By default, the buffer is only 32 pixels long, but that's surprisingly reasonable, if you're looking for standard use cases like drawing your application's screen. It might not be as effective for real time animation using batching, but you can always increase the buffer size to lower your bus overhead. It's better to do asynchronously queued frame writes for intensive animation though. Consequently, the defaults are geared for more the standard application UI use case.

Anyway, first things first, when we begin a batch, we have to set the address window and turn write mode on:

result batch_write_begin_impl(uint16_t x1,
                            uint16_t y1,
                            uint16_t x2,
                            uint16_t y2,
                            bool queued) {
    // normalize values
    uint16_t tmp;
    if(x1>x2) {
        tmp=x1;
        x1=x2;
        x2=tmp;
    }
    if(y1>y2) {
        tmp=y1;
        y1=y2;
        y2=tmp;
    }
    //Column Address Set
    result r=send_next_command(0x2A,queued);
    if(result::success!=r)
        return r;
    uint8_t tx_data[4];
    tx_data[0]=x1>>8;             //Start Col High
    tx_data[1]=x1&0xFF;           //Start Col Low
    tx_data[2]=x2>>8;             //End Col High
    tx_data[3]=x2&0xff;           //End Col Low
    r=send_next_data(tx_data,4,queued,true);
    if(result::success!=r)
        return r;
    //Page address set
    r=send_next_command(0x2B,queued,true);
    if(result::success!=r)
        return r;
    tx_data[0]=y1>>8;        //Start page high
    tx_data[1]=y1&0xff;      //start page low
    tx_data[2]=y2>>8;        //end page high
    tx_data[3]=y2&0xff;      //end page low
    r=send_next_data(tx_data,4,queued,true);
    // Memory write
    return send_next_command(0x2C,queued,true);
}

There's nothing to it other than a few commands. At this point, we haven't done any bookkeeping for the actual batch operation, short of committing any pending batch that was already present (via the first send_next_command() call.)

When we do a batch write, that's when it gets fun:

result batch_write_impl(const uint16_t* pixels,
                        size_t count,
                        bool queued) {
    if(!m_initialized)
        return result::io_error;
    result r;
    size_t index = m_batch_left;
    if(index==buffer_size/2) {
            r=send_next_data(m_buffer,buffer_size,queued,true);
        if(result::success!=r) {
            return r;
        }
        m_batch_left=0;
        index = 0;
    }
    uint16_t* p=((uint16_t*)m_buffer)+index;
    while(0<count) {    
        *p=*pixels;
        --count;
        ++m_batch_left;
        ++pixels;
        ++p;
        if(m_batch_left==(buffer_size/2)) {
            r=send_next_data(m_buffer,buffer_size,queued,true);
            if(result::success!=r)
                return r;
            p=(uint16_t*)m_buffer;
            m_batch_left=0;
        }
    }
    return result::success;
}

What we're doing here is copying the incoming values into the buffer, which we increment the index to, and every time the buffer gets full, we send it.

I'd show you batch write commit but it's hardly worth it because all it does is flush any remaining data in the buffer and send it as a final transaction.

Let's see it in action with frame_fill()'s implementation:

result frame_fill_impl(uint16_t x1,
                    uint16_t y1, 
                    uint16_t x2,
                    uint16_t y2,
                    uint16_t color,
                    bool queued) {
    // normalize values
    uint16_t tmp;
    if(x1>x2) {
        tmp=x1;
        x1=x2;
        x2=tmp;
    }
    if(y1>y2) {
        tmp=y1;
        y1=y2;
        y2=tmp;
    }
    uint16_t w = x2-x1+1;
    uint16_t h = y2-y1+1;
    result r=batch_write_begin_impl(x1,y1,x2,y2,queued);
    if(result::success!=r)
        return r;
    size_t pc=w*h;
    while(pc>0) {
        r=batch_write_impl(&color,1,queued);
        if(result::success!=r)
            return r;
        --pc;
    }
    r=batch_write_commit_impl(queued);
    return r;           
}

The only reason we're calling the _impl methods directly is because they take a bool parameter for queued, same as this routine, so it's less code.

Last, and certainly least, we have the unbatched write of a pixel. This should be used as a last resort, since it generates a lot of bus traffic - enough to tickle the task watchdog timer and barf to the serial port if you do it a lot.

result pixel_write_impl(uint16_t x,
                        uint16_t y,
                        uint16_t color,
                        bool queued) {
    // check values
    if(x>=width || y>=height)
        return result::success;
    
    // set the address window. we're not
    // actually batching here.
    result r=batch_write_begin_impl(x,y,x,y,queued);
    if(result::success!=r)
        return r;
    return send_next_data((uint8_t*)&color,2,queued);
}

As you can see, it generates the same overhead as an entire frame write! The device simply does not have an abbreviated call for getting or setting a single pixel.

GFX Bindings

I'll be covering the GFX bindings for drivers in a future article when I cover GFX driver integration as part of the GFX library article series.

Points of Interest

The ST7789V display is nearly identical in terms of the commands it accepts, so I'll be expanding this driver to be able to control both.

History

6^th May, 2021 - Initial submission