32-bit pointers in a 64-bit world

cppnow

3.18/5 (6 votes)

Aug 25, 2008

CPOL

3 min read

76967

224

64 bit pointers are wasteful if you don't need to access TBs of RAM

Download source code - 2.7 KB

Introduction

64 bit is wonderful - we are no longer limited to just 2GB (per Win32 process) - you can now address a full 8TB of memory (per Win64 process)... which is great if your application really needs 8TB of memory.

If you don't need the full address space, 64-bit can be wasteful: pointers take twice the memory, which may actually result in a slower application, especially when using a pointer-laden data structure such as a binary tree (larger structures take more cache space).

A simple scheme is presented that allows using 32-bit "pointers" to access up to 64GB of memory on a 64-bit machine.

Encoding 64-bit pointers in 32-bits

The main idea of sptr is data alignment - modern processors feel much more comfortable when accessing "aligned" addresses: accessing a char at address 0xC0001 means more work than accessing an int at 0xC0000.

malloc() (which does the heavy lifting for new) only returns aligned addresses (see http://msdn.microsoft.com/en-us/library/ycsb6wwf.aspx) - a pointer returned by malloc() will always be aligned to a 16-byte boundary on Visual Studio - the 4 lowest bits will always be 0.

Therefore, at least for pointers we allocate on the heap, we can just "delete" the 4 lowest bits (by shifting right) without losing information.

If all memory addresses were allocated from the "bottom" (the highest bits are also 0), we could just encode a pointer as follows:

#include <stddef.h>  // for uintptr_t
typedef unsigned __int32 uint32_t;

uint32_t encode(void* p)
{
  uintptr_t i = reinterpret_cast<uintptr_t>(p) >> 4;
  return static_cast<uint32_t>(i);
}

(uintptr_t is an integral data type that is guaranteed to be the same size as a pointer.)

This will encode any (16-byte aligned) address in the range 0x00 0000 0000 - 0x10 0000 0000 into a 32-bit integer.

However, the Operating System is free to map physical memory into other virtual memory segments - for example, mapping kernel memory "at the top", so it is preferable to keep such a map ourselves:

#include <boost/thread/mutex.hpp>
#include <boost/thread/locks.hpp>

class sptr_base
{   
protected:
  static const uint32_t ALIGNMENT_BITS = 4;
  static const uint32_t ALIGNMENT = (1<<ALIGNMENT_BITS);
  static const uintptr_t ALIGNMENT_MASK = ALIGNMENT - 1;

protected:
  static uintptr_t     _seg_map[ALIGNMENT];
  static uintptr_t     _segs;
  static boost::mutex  _m;
  inline static uintptr_t ptr2seg(uintptr_t p)
  {
    p &= ~0xFFFFFFFFULL; // Keep only high part
    uintptr_t s = _segs;
    uintptr_t i = 0;
    for (; i < s; ++i)
      if (_seg_map[i] == p)
    return i;

    // Not found - now we do it the "right" way (mutex and all)
    boost::lock_guard<boost::mutex> lock(_m);
    for (i = 0; i < s; ++i)
      if (_seg_map[i] == p)
        return i;

    i = _segs++;
    if (_segs > ALIGNMENT)
      throw bad_segment("Segment out of range");

    _seg_map[i] = p;
    return i;
  }
};

ptr2seg() maps the high bits of a pointer to one of a few segments. The code refrains from using a mutex when reading the map, based on the atomicity of integer assignment.

The actual pointer encode/decode is then as follows:

#include <boost/pool/detail/singleton.hpp>
typedef boost::details::pool::singleton_default<segment_mapper> the_segment_mapper;

uint32_t encode(const_native_pointer_type ptr)
{
  uintptr_t p = reinterpret_cast<uintptr_t>(ptr);
  if ((p & ALIGNMENT_MASK) != 0)
    throw bad_alignment("Pointer is not aligned");
  return (uint32_t)(ptr2seg(p) + p);
}

inline native_pointer_type decode(uint32_t e)
{
   uintptr_t el = e;
   uintptr_t ptr = (_seg_map[el & ALIGNMENT_MASK] + el) & ~ALIGNMENT_MASK;
   return reinterpret_cast<native_pointer_type>(ptr);
}

sptr

sptr gives a pointer-like interface to the "compressed" pointer. It allows accessing our "pointer" similar to a real pointer:

sptr<int> p = new int;
*p = 5;
delete p;

// Conversions from/to pointers and pointer arithmetics work
int* q = new int;
p = q;
p++;
q = p;

Limitations

sptr has several limitations (which may make it inappropriate in many places).

First, sptr can't access the whole 64-bit address space (duh!). It is appropriate only for applications requiring just a bit more than the 32-bit address space (server machines are usually limited to 16GB-32GB of memory; Desktops usually can't deal with more than 8GB).

Second, although an address allocated on the heap will be aligned properly, pointer arithmetic may complicate things:

sptr<char> buf = new char[100];

will work just fine as new returns an aligned address.

However, something like:

for (sptr<char> p = buf; p != buf + 100; ++p)
   *p = '0';

will not work as p will now point to an unaligned address (the sptr implementation will assert at compile time).

However, in most cases, a standard pointer can be used:

for (char* p = buf; p != buf + 100; ++p)
   *p = '0';

Third, mix sptr and multiple-inheritence with caution.

For example, given:

struct A { char i; };
struct B { int j; };
struct C : public A, public B { };

This may or may not work (depending on structure alignment):

sptr<C> c = new C;
sptr<B> b = c;

This is because b does not point to the same address as c (this is not the place for a discussion on multiple-inheritance, but you can try the following code to convince yourself):

C* c = new C;
B* b = c; 
std::cout << ((uintptr_t)b - (uintptr_t)c) << std::endl;

There is no problem accessing the object via a standard pointer:

sptr<C> c = new C;
B* b = c;

Performance

A small benchmark is provided with the code, implementing a simple linked-list which is repeatedly traversed.

The list's node is:

struct node
{
   int val;
   node_ptr prev;
   node_ptr next;
};

The size of the node is therefore 20 bytes with standard pointers vs. 12 bytes with sptr.

However, this comes at a tradeoff - the extra bit shuffling impacts performance - the standard pointer version will perform 1.5x faster on the linked-list benchmark.

Accessing the map accounts for much of this performance hit. If it could be guaranteed that all addresses are allocated from the region 0x00 0000 0000 - 0x10 0000 0000, we could discard the map, significantly improving performance.