Big Stuff on Little Widgets: An Exploration of Doing More With Less

honey the codewitch

5.00/5 (4 votes)

Mar 25, 2022

CPOL

10 min read

7961

On PC platforms, just getting it done is king. On IoT, the trick is doing it in the first place.

ESP32 WROVER IoT device

Introduction

When you start to do more than basic things on IoT, you can quickly run into a wall. Typically, if you try to attack programming problems in a traditional manner, that wall will be RAM based - meaning you don't have enough of it. RAM is probably the most important factor, because there is no virtual memory. You have what you have and without enough of it, your code simply will not function.

Next you have I/O use, which if you require it for say, updating a display, your code will be dependent on the speed of your display bus. On an IoT device, that means serious latency as a percentage of execution time.

Finally, you have to deal with CPU usage, which while limited, probably offers a bit more "headroom" in terms of the performance of your code. After all, code that runs slowly still runs at least, and the CPU is still much faster than I/O, so there's that.

There is the additional issue of program space, which is also a major factor - without enough of it, your code can't be used on that device. However, we won't be deep diving into that in this article. Generally, avoiding things like templates and also use routines for repetitive code and you should be fine.

Planning Your Project

Know Your Environment

RAM: Whatcha Got?

Before you even begin to design, you should be able to tell me how much available RAM - not total RAM but available RAM, you have upon your entry point being hit. I'll ask you simply because I'm difficult like that, but more importantly, you need that figure. Once you have that number, keep it around, since it will help you make design decisions. Without it, you may as well code blindfolded. At least, doing that would keep the rest of us entertained.

What's more difficult is knowing how much stack you have. Rather than try to find out up front, I've simply gone with a trial and error approach, which varies by platform, just like total RAM does, because that's how the professionals do it, or so I'm told. (I hope there are no professionals reading this.)

I/O: Storage and Peripherals and Drivers, Oh My!

First and foremost, do you have storage available, and what properties does it have? Is it non-volatile? Is it removable or is it reliably available? How much available working space does it have, if the latter? How fast is it? Does it come in pink? The reason all of this is important is that it may be possible to use storage to offload work that might otherwise take RAM to use. If certain conditions are met - such as the storage being reliably available then this is crucial - or at the very least fashionable, to take into account when planning your software.

Secondly, do your I/O or your peripherals require RAM in order to operate? For example, some display drivers may require a frame buffer to hold display data. If so, subtract that from the available RAM you have. You're also going to need to account for bus latency when using peripherals on performance critical code paths. If you don't, your code will be so slow you'll want to get out and push.

CPU: Is Using it All Even Worth It?

Does your platform have more than one core you plan on utilizing? If so, you need to consider the overhead of multithreaded programming, both in terms of runtime performance as well as development effort and complexity. Without a decent debugger, it's also a great way to drive you mad, if you're planning on going that route anyway. Remember to factor in the stack space for each additional thread and subtract that from your available RAM. Are we having fun yet?

Rather than all of that mess, consider coding using a single core and thread, and save yourself the power consumption, complexity and additional RAM use. You can always use "cooperative multithreading" as long as you plan your code for it. We'll cover that further down.

Efficiency and Power: Slow and Steady Wins the Race

The most efficient line of code is one that never gets executed. It's the fastest code, and perhaps more importantly, it is also the code that uses the least amount of power. Your device may run on batteries. As if you didn't have enough worries, now you have to consider how often you are going to annoy your end user by making them recharge your gadget.

At least half the time, speed isn't your priority - lower power consumption is. Fortunately, as I alluded to above, the two goals typically dovetail with each other. There are areas however, where you have to make design decisions that trade speed for battery use, like avoiding using the 2nd core.

But also, remember to turn off your peripherals, like your Bluetooth radio when you're not using it.

Pitfalls

Arrays: They Are Not Your Friends

Seriously though, arrays, or really any memory backed buffer takes precious RAM. Hoarding RAM is a healthy attitude to have, so long as the habit doesn't spill over into real life. You should put in considerable effort to avoid non-trivial sized arrays.

You can use techniques like streaming and coroutines to engage in incremental processing rather than having to load an entire buffer at once and processing that. We'll be covering those and other techniques as we get further along. For the most part, avoid designing your interfaces to take arrays (or pointers to buffers), or you'll wish you had.

Heap: Avoiding Smashing it Into Little Pieces

So you think you are hot stuff, with your 150kB of free RAM up to this point? I bet you do, and you'll continue to think that until you try to do void* foo = malloc(150*1024); and it fails on you.

Why would it fail if you have that much RAM?

The answer is because in life, nothing worth having comes easy. More specifically, it's because you don't have 150kB of contiguous space available. Instead you might have say, a 100kB region and a 50kB region available, or two 75kB regions available, or even two 100kB regions available but you don't have a 150kB region. This is due to heap fragmentation, which works like hard drive fragmentation except you can't unfragment your heap exactly for the same reason you can't perform a brain transplant.

Normally on a PC platform, you probably don't notice fragmentation because your 32GB machine has heap all day long. You only notice heap fragmentation when you start running out of heap. Well on an IoT device, you're pretty much always "running out" in that you're operating in very low memory conditions. Heap fragmentation is always your enemy now. You have to code defensively to avoid it.

You'll want to avoid lots of little mallocs that stay around for a long time. If you do use heap, it's best to use it briefly and then toss it, unless it's a large allocation. Try to code accordingly.

The STL: To STL or Not to STL?

Some platforms may not have the STL available or at least it's not complete or conformant. In this case, you'll have to make do, and probably need some fundamental boilerplate code and abstract base classes of your own to fill in where you need to in its absence. Have fun learning how some of the basic functions of the STL are implemented, because you'll be rewriting a bit of it yourself.

I personally prefer to avoid the STL on IoT platforms for a number of reasons, not the least of which is it's really difficult to get it to use the heap responsibly. It tends to lead to heap fragmentation.

Resource Management: Nothing Lives Forever (Hopefully)

Resources (memory, CPU time, bus, semaphores, etc) are precious.

Remember the kid's game hot potato? Treat your resources like hot potatoes - use them and get rid of them as fast as you can. The less you hold around in one part of your code at any given time, the more other parts of your code can do at that same time. There's metaphorically a ticking clock on every buffer and every handle you use, so use it and be done with it.

Decreasing CPU Load: Grab the Cache and Dash

Cache is everything in IoT. For one thing, your programs don't run directly from program flash space. They must be copied into memory a bit at a time as they execute.

Furthermore, you can save power and/or increase performance by using hashtables to cache complicated calculations on critical code paths. Used judicially, this can increase your efficiency by orders of magnitude.

Techniques

Stream a Little Stream

Some months ago, I got overly ambitious and decided to implement Truetype fonts under Arduino. The issue is that your average TTF file is about 250kB. On most IoT gadgets, you are not loading that into RAM.

Short of requiring it to be loaded into program flash space, you can stream it from another source, like an SD card. I took some public domain code for rendering Truetype and converted it all to use a stream instead of a memory buffer. This took some doing, but it paid off in spades. The downside is it's at least marginally slower, even when using it from RAM because a stream requires you to copy the data out into temporary RAM before you can use it. Still, streaming can make the impossible possible. It's a good tool to have in your toolbelt.

htcw_io can give you basic stream functionality in your projects without reliance on the STL.

Cooperate With Coroutines

These can be a little difficult to manually implement because they usually require turning while loops into gotos and otherwise building a state machine. The idea is that instead of doing a large amount of work all at once, your routine should process a portion of the work at a time, processing the next portion each time it is called. Obviously that requires keeping some state, and the aforementioned state machine at least in most cases.

Doing this allows you to avoid having your code block for long periods.

Similarly, you'll want to avoid calls like delay() in your code, preferring to check the time with say, Arduino's millis() function in an if statement to see if the time has elapsed. That way, you can continue to do useful work while you're waiting instead of just blocking.

In general, the idea is not to block, or rather, to block for as little time as possible at any given point.

Throw Cache at It

Internally, your little MCU is caching code. To maximise this effect, you'll want to keep your routines brief and your locality of reference high. In other words, try to avoid jumping around except for short distances if you want your code to all stay in the cache, and don't create huge routines or loops where possible. Obviously, you need to do what you need to do, but keeping critical code in RAM can dramatically speed it up. The IRAM_ATTR attribute can be applied to a function on the ESP32 to keep it loaded into RAM. Use this sparingly, as you're robbing Peter to pay Paul. This takes precious SRAM.

Aside from that, you may want to cache complicated calculations in order to save power and/or increase performance. Using a hashtable is the go to technique for this.

htcw_data contains a simple STL free hashtable suitable for caching.

Keeping Your Threads from Unraveling

If you must use preemptive multithreading, keep in mind that you don't have a PC's debugging capabilities, and so debugging race conditions goes from really hard to almost impossible.

Due to this, your priority should be to keep things simple, predictable and robust. I stole an idea from .NET and use "synchronization contexts" and thread pools to do my threading.

FreeRTOS Thread Pack can provide you with these tools. It's ESP32 only although it can pretty readily be adapted to other platforms that run FreeRTOS.

Conclusion

Coding for IoT sometimes requires wildly different priorities and coding techniques than development for PCs and servers. Hopefully, these concepts and techniques are useful to you in your future projects.

History

25^th March, 2022 - Initial submission