Robust C++: Operational Aspects

Greg Utas

5.00/5 (4 votes)

Oct 28, 2020

GPL3

15 min read

11435

150

The well-tempered server

Introduction

In the articles that I've written about the Robust Services Core (RSC), the code shown in the article is simplified from what appears in the download. The reason for this is that the full version contains things that are not relevant to the article's topic. So why are they there?

The answer is that, although RSC can be used as a framework for many types of applications, one of these is large, embedded systems, such as servers. These types of systems must be configured for each site where they are deployed. To offer high availability, they must inform operators of problems, allow them to take corrective action, and provide debugging information so that developers can pinpoint and fix the software faults that made it into a release. This article covers the capabilities that serve those purposes. Specifically, it will discuss

configuration parameters
command line parameters
statistics
logs
alarms
trace tools

Those things are irrelevant in many other articles, but here they are the focus. Most of them would actually be aspects in aspect-oriented programming. But RSC is written in C++, so trying to consolidate their occurrences instead of just sprinkling them throughout the code would only add overhead and create an inscrutable mess.

This article uses object pools to provide examples of how the above capabilities are used. If you haven't read the article about object pools, taking a quick look at it will help you to understand what's going on in this one. And if this is the first time that you're reading an article about RSC, please take a few minutes to read this preface.

Configuration Parameters

Configuration parameters are the primary way to customize a software load. Each configuration parameter is read from the configuration file soon after the system begins to initialize. The file's format is simple. A line that begins with a slash (/) is treated as a comment and is ignored. Otherwise, the system creates a CfgTuple instance for each line in the file. Each line contains two strings: a parameter's name (its key) and its value.

As the system continues to initialize, its subsystems create subclasses of CfgParm, which contain the following data members:

the CfgTuple (tuple_) associated with the parameter: this is found using the parameter's key
a string (default_) that sets the parameter's value if the configuration file did not specify one
a string (expl_) that explains the parameter's purpose
the restart level (level_) required to change the parameter's value when the system is in service

If the CfgTuple for a newly created CfgParm wasn't found, an instance is created using its default value. This occurs when the configuration file did not contain a key-value entry for the parameter.

CfgParm::CfgParm(c_string key, c_string def, c_string expl) :
   tuple_(nullptr),
   default_(def),
   expl_(expl),
   level_(RestartNone)
{
   auto reg = Singleton<CfgParmRegistry>::Instance();
   tuple_ = reg->FindTuple(key);
   if(tuple_ == nullptr)
   {
      tuple_ = new CfgTuple(key, default_);
      reg->BindTuple(*tuple_);
   }
}

Immediately after a CfgParm is created, it must be registered with the singleton CfgParmRegistry so that its value can be set from its CfgTuple:

void CfgParmRegistry::BindParm(CfgParm& parm)
{
   auto key0 = strLower(parm.Key());

   //  Register parameters by key, in alphabetical order.
   //
   CfgParm* prev = nullptr;

   for(auto next = parmq_.First(); next != nullptr; parmq_.Next(next))
   {
      auto key1 = strLower(next->Key());
      if(key0 < key1) break;
      prev = next;
   }

   parmq_.Insert(prev, parm);
   parm.SetFromTuple();
}

Setting a value is a two-step process: set it as the parameter's next value, and then make it the current value. The steps are separated because, sometimes, the new value can only be pre-loaded, after which the system must be restarted to apply it. (A restart is a partial reinitialization of the system, as described Robust C++: Initialization and Restarts.)

bool CfgParm::SetFromTuple()
{
   auto input = tuple_->Input();

   if(SetNext(input))
   {
      SetCurr();
      return true;
   }

   SetNext(default_);  // input was invalid
   SetCurr();
   return false;
}

CfgParm is actually a virtual class that is subclassed to support different types of parameters:

CfgBoolParm supports a bool parameter.
CfgFlagParm supports a bit in a std::bitset.
CfgIntParm supports an int parameter.
CfgStrParm supports a std::string parameter.

CLI Commands

The CLI provides the >cfgparms command for accessing configuration parameters:

Let's list all the configuration parameters and their values:

The rest of this section uses the configuration parameter for the size of an object pool to illustrate how one of these parameters is created and used.

Creating a CfgTuple for an ObjectPool's Size

One of the entries in the configuration file is

NumOfMsgBuffers 2

This defines the size (2K) of the MsgBuffer pool, which provides blocks for the objects used during inter-thread messaging. When CfgParmRegistry::LoadTuples is invoked early on during system initialization, it creates a CfgTuple when it reaches the above entry. Its key is "NumOfMsgBuffers" and its value is "2".

Creating a CfgParm for an ObjectPool's Size

A little later during initialization, the following line in NbModule::Startup creates the pool for MsgBuffers:

Singleton<MsgBufferPool>::Instance()->Startup(level);

Each object pool is a singleton whose constructor invokes the base ObjectPool constructor, which in turn creates the pool's ObjectPoolSizeCfg. This is the parameter that defines the pool's size; it is derived from CfgIntParm. The sequence looks like this, omitting code that isn't involved in creating the parameter:

MsgBufferPool::MsgBufferPool() :
   ObjectPool(MsgBufferObjPoolId, MemDynamic, BlockSize, "MsgBuffers") { }


ObjectPool::ObjectPool
   (ObjectPoolId pid, MemoryType mem, size_t size, const string& name) :
   name_(name.c_str()),
   key_("NumOf" + name_),
   targSegmentsCfg_(nullptr)
{
   targSegmentsCfg_.reset(new ObjectPoolSizeCfg(this));
   Singleton<CfgParmRegistry>::Instance()->BindParm(*targSegmentsCfg_);
}

ObjectPoolSizeCfg::ObjectPoolSizeCfg(ObjectPool* pool) :
   CfgIntParm(pool->key_.c_str(), "1", 0,
      ObjectPool::MaxSegments, "number of segments of 1K objects") { }

CfgIntParm::CfgIntParm(c_string key, c_string def, word min, word max,
   c_string expl) : CfgParm(key, def, expl) { }

When the ObjectPool constructor invokes CfgParmRegistry::BindParm, the parameter's value is set by finding the CfgTuple associated with its key. The pool's initial set of blocks is allocated when Startup is finally invoked on the MsgBufferPool singleton, which inherits that function from its base class:

void ObjectPool::Startup(RestartLevel level)
{
   AllocBlocks();
}

And AllocBlocks creates the number of blocks specified by the pool's configuration parameter.

Increasing an ObjectPool's Size

The size of an object pool can be increased at run-time using the CLI command >cfgparms set, which is followed by a key and value:

nb>cfgparms set NumOfMsgBuffers 4

This leads to the invocation of the following:

void ObjectPoolSizeCfg::SetCurr()
{
   CfgIntParm::SetCurr();

   //  If the pool contains no blocks, it is currently being constructed,
   //  so do nothing.  But if it already contains blocks, expand its size
   //  to the new value.
   //
   if(pool_->currSegments_ > 0)
   {
      pool_->AllocBlocks();
   }
}

Decreasing an ObjectPool's Size

The above CLI command is also used to decrease the size of an object pool:

nb>cfgparms set NumOfMsgBuffers 3

This again leads to the invocation of ObjectPoolSizeCfg::SetCurr. But when AllocBlocks is invoked this time, nothing happens. The reason is that the blocks in an object pool are allocated in segments, and all segments could contain in-use blocks. So although the reduced size is recorded, it will not take effect until a restart frees all of the existing blocks and reallocates them.

Command Line Parameters

Although the configuration file is read very early during system initialization, there are situations in which a customizable value is needed before this occurs. Although these situations should rarely occur, they must be supported by command line parameters.

At present, RSC's only command line parameter is one that defines the size of the protected heap. This heap is used when creating objects that will be write-protected once the system is in service. Some of these objects are created before the configuration file is read, so the protected heap must be created at that time. The size of this heap can therefore be customized with a command line parameter.

When main is entered, RSC saves each command line parameter in memory that is immediately write-protected. This safeguards its value and allows it to be looked up when needed. See main.cpp and MainArgs.h.

Statistics

Performance statistics provide insight to the system's internal behavior. This allows administrators to

verify that the system is performing as expected
monitor the system's throughput
calculate the system's peak capacity
determine how many resources are needed when running at peak capacity
uncover unexpected behaviors

Base classes for individual performance statistics are defined in Statistics.h:

Statistic is the virtual base class for all statistics. It contains several atomic integers that support the statistic.
Counter supports a statistic that is incremented when an event occurs.
Accumulator supports a statistic to which a positive value is added when an event occurs.
HighWatermark supports a statistic that tracks the greatest observed value of something.
LowWatermark supports a statistic that tracks the least observed value of something.

When a statistic is created, it gets added to the global StatisticsRegistry. Because the system contains a large number of statistics, each one is also added to a StatisticsGroup, whose purpose is to display related statistics together. When one of these groups is created, it also gets added to the global StatisticsRegistry.

All of the system's statistics are reported in a log that, by default, appears every 15 minutes. A report is also generated immediately before a system restart. After one of these reports, each statistic is merged into an "overall" value, saved in a "previous" value, and finally cleared to prepare for the next reporting interval. The following values can therefore be seen for each statistic:

curr: its value, up to this point, during the current reporting interval
prev: its final value during the previous reporting interval
all: its overall value across all reporting intervals

To illustrate the use of statistics, let's return to object pools. Each pool provides the following statistics:

a LowWatermark for the fewest available blocks
a Counter for the number of successful allocations
a Counter for the number of deallocations
a Counter for the number of unsuccessful allocations (no available blocks)
a Counter for the number of orphaned blocks recovered by the audit
a Counter for the number of times that the pool's size was automatically expanded
a LowWatermark for the fewest spare bytes left in a block after constructing an object within it

As an aside, I recall being told that it was a mistake to provide internally verifiable statistics! For example, if there are three statistics such that A=B+C should always hold, an obscure bug could occasionally cause this identity to not quite hold. This will disconcert some customers and result in a bug report of exaggerated importance.

CLI Commands

The CLI provides the >stats command for accessing statistics:

Let's list all the statistics groups:

Now to look at the object pool statistics:

Not much to see. That's because, by default, the report is "brief", in which case statistics still in their initial state aren't displayed. The system just initialized, so only some MsgBuffers (for inter-thread communication) have been used. But if we run the traffic generator for POTS calls to load test the system for a while, we can see that much more has occurred:

Creating a Statistic

Each Statistic subclass defines a typedef that wraps the statistic in a std::unique_ptr. A set of related statistics is allocated by defining them as members of a class that is, in turn, allocated as a member of the class that uses those statistics.

The base class Statistic derives from Dynamic, which uses a heap that is freed during most system restarts (partial reinitalizations). Statistics are therefore allocated dynamically because most of the classes that use them survive a restart that frees their statistics. During the shutdown phase of such a restart, the class must nullify the pointer that manages the statistics. And during the startup phase, it must reallocate them. Here is the code that is associated with managing an object pool's statistics:

typedef std::unique_ptr<Counter> CounterPtr;
typedef std::unique_ptr<LowWaterMark> LowWatermarkPtr;

//  Statistics for each object pool.
//
class ObjectPoolStats : public Dynamic
{
public:
   ObjectPoolStats();

   LowWatermarkPtr lowCount_;
   CounterPtr      allocCount_;
   CounterPtr      freeCount_;
   CounterPtr      failCount_;
   CounterPtr      auditCount_;
   CounterPtr      expansions_;
   LowWatermarkPtr lowExcess_;
};


ObjectPoolStats::ObjectPoolStats()
{
   lowCount_.reset(new LowWatermark("fewest remaining blocks"));
   allocCount_.reset(new Counter("successful allocations"));
   freeCount_.reset(new Counter("deallocations"));
   failCount_.reset(new Counter("unsuccessful allocations"));
   auditCount_.reset(new Counter("blocks recovered by audit"));
   expansions_.reset(new Counter("number of times pool was expanded"));
   lowExcess_.reset(new LowWatermark("size of block minus largest object"));
}

ObjectPool::ObjectPool(...)  // lots of code deleted
{
   stats_.reset(new ObjectPoolStats);
}

void ObjectPool::Shutdown(RestartLevel level)
{
   //  An object pool resides in protected memory.  If the restart will free
   //  it, do nothing.  If the restart will free the statistics, nullify the
   //  unique_ptr that manages them.
   //
   if(Restart::ClearsMemory(MemType())) return;
   FunctionGuard guard(Guard_MemUnprotect);
   Restart::Release(stats_);
}

void ObjectPool::Startup(RestartLevel level)
{
   //  If we don't have any statistics, they were freed during a restart, so
   //  reallocate them.
   //
   FunctionGuard guard(Guard_MemUnprotect);
   if(stats_ == nullptr) stats_.reset(new ObjectPoolStats);
}

Updating a Statistic

Each Statistic subclass provides a function for updating its value:

Counter::Incr()
Accumulator::Add(size_t count)
LowWatermark::Update(size_t count)
HighWatermark::Update(size_t count)

So when an object pool allocates a block, updating the associated statistic is as simple as

stats_->allocCount_->Incr();

Logs

Logs are the primary way that a system provides status updates. Some logs highlight problems that operators may be able to fix, whereas others are simply informational. There are also software logs, which are meaningless to operators but which help developers with debugging.

A large system can generate many types of logs, so it is helpful to categorize them. A log group contains all the logs related to a specific subsystem. Object pools, for example, have their own log group. RSC's log subsystem defines the class LogGroup, and each instance registers a short string that identifies the logs in that group (e.g., OBJ for logs associated with object pools).

Each log is also categorized by the circumstances that generate it. Regardless of the group that it belongs to, it is also identified using one of the enumerators defined by LogType:

A TroubleLog highlights a problem that may be fixed by operator intervention (100-199).
A ThresholdLog indicates that a level has been reached or exceeded (200-299).
A StateLog reports a state change or the progress of a background activity (300-399).
A PeriodicLog reports information (e.g., statistics) at regular intervals (400-499).
An InfoLog reports an event that does not require intervention (500-699).
A MiscLog is one that doesn't fit into one of the above categories (700-899).
A DebugLog is forwarded to developers to aid in debugging (900-999).

The integer ranges mean, for example, that each trouble log in a given log group uses an integer in the range 100-199 to uniquely identify itself. The log's system-wide identifier suffixes the log's identifier to its group. For example, OBJ200 must be a threshold log for object pools; it is, in fact, the log generated when the number of available blocks in a pool falls below a defined threshold.

CLI commands

The CLI provides the >logs command for accessing logs:

Let's list all log groups, the logs defined for object pools, and then see what one of those logs means:

Some of the other log commands are used during testing or log floods:

count simply displays the total number of logs generated so far.
suppress stops generating all logs in a specified group.
throttle only generates every nth occurrence of a specified log.

To check that no logs occurred during a test, a script uses count to compare the number of logs reported before and after the test. If some log might or might not occur during the test, throttle or suppress will stop it from appearing. The latter commands are also used to reduce the number of logs if the system experiences a log flood.

Logs are placed in a log buffer until written to a log file, and a new log buffer is created during each restart. The remaining log commands manage the log buffers:

buffers indicates which log buffers contain logs; it can also display them.
write streams a buffer's pending logs to its log file.
free deallocates a log buffer.

Logs survive restarts so that they can help to debug the onset of a system-initiated restart. Although the system tries to write all pending logs to a log file during a restart, this might not always occur. The above commands are then used to write the pending logs to disk and free any log buffers that are no longer needed.

Defining a Log

A log must belong to a log group, so the latter must be created first. Here is the code that defines the object pool log group and its various logs:

enum LogType
{
   TroubleLog = 100,    // 100-199: fault; intervention may be possible
   ThresholdLog = 200,  // 200-299: level reached or exceeded
   StateLog = 300,      // 300-399: state change or progress update
   PeriodicLog = 400,   // 400-499: automatic report
   InfoLog = 500,       // 500-699: no intervention required
   MiscLog = 700,       // 700-899: other types of logs
   DebugLog = 900       // 900-999: to help debug software
};

constexpr LogId ObjPoolExpansionFailed = TroubleLog;
constexpr LogId ObjPoolBlocksInUse = ThresholdLog;
constexpr LogId ObjPoolExpanded = StateLog;
constexpr LogId ObjPoolQueueCorrupt = DebugLog;
constexpr LogId ObjPoolQueueCount = DebugLog + 1;
constexpr LogId ObjPoolBlockRecovered = DebugLog + 2;
constexpr LogId ObjPoolBlocksRecovered = DebugLog + 3;

fixed_string ObjPoolLogGroup = "OBJ";

auto group = new LogGroup(ObjPoolLogGroup, "Object Pools");
new Log(group, ObjPoolExpansionFailed, "Object pool expansion failed");
new Log(group, ObjPoolBlocksInUse, "Object pool blocks in use");
new Log(group, ObjPoolExpanded, "Object pool size expanded");
new Log(group, ObjPoolQueueCorrupt, "Object pool queue corrupt");
new Log(group, ObjPoolQueueCount, "Object pool queue count incorrect");
new Log(group, ObjPoolBlockRecovered, "Object pool block recovered");
new Log(group, ObjPoolBlocksRecovered, "Object pool blocks recovered");

The LogGroup constructor registers the group with the global LogGroupRegistry, and the Log constructor registers the log with its LogGroup.

Generating a Log

The functions Log::Create and Log::Submit are used to generate a log at run-time. Create allocates and returns a std::ostringstream that is wrapped by a std::unique_ptr. This allows log-specific information to be added before submitting the log. The arguments to Create were defined in the code above, namely the string that identifies the log group, followed by the log's identifier within its group:

auto log = Log::Create(ObjPoolLogGroup, ObjPoolExpanded);

if(log != nullptr)
{
   *log << Log::Tab << "pool=" << name_;
   *log << "  new segments=" << currSegments_;
   Log::Submit(log);
}

Each log begins with a standard header that Create inserts. The header contains the log's identifier, the time, the system's name, and the log's sequence number (the 12^th log) since the last restart. The above log therefore looks like this:

Alarms

When a log is generated, it also has the ability to update an alarm. An alarm is raised when operator intervention may resolve the problem highlighted by the log. An alarm remains active until it is cleared, which occurs when the problem no longer exists. Only trouble and threshold logs can set an alarm, but any type of log can clear one.

Each alarm has a severity level: critical, major, minor, or off. A log that clears an alarm looks like a regular log. But if the alarm is on, one or more asterisks precede the standard log header to specify the alarm's severity: *** for a critical, ** for a major, and * for a minor alarm.

CLI commands

The CLI provides the >alarms command for accessing alarms:

If we list all of the system's alarms, each one's state is also shown, so looking for asterisks tells us whether any alarm is active. The explain command provides more information about an alarm:

There is also the clear command, which is used to reset an alarm if the system failed to do so.

Defining an Alarm

The ObjectPool constructor invokes EnsureAlarm:

void ObjectPool::EnsureAlarm()
{
   //  If the high usage alarm is not registered, create it.
   //
   auto reg = Singleton<AlarmRegistry>::Instance();
   auto alarmName = "OBJPOOL" + std::to_string(Pid());
   alarm_ = reg->Find(alarmName);

   if(alarm_ == nullptr)
   {
      auto alarmExpl = "High percentage of in-use " + name_;
      FunctionGuard guard(Guard_ImmUnprotect);
      alarm_ = new Alarm(alarmName.c_str(), alarmExpl.c_str(), 30);
   }
}

Because they are an extension of the log subsystem, alarms survive all restarts. But object pools don't, which is why EnsureAlarm checks if the alarm already exists before creating it.

To create an alarm, its name and a brief explanation must be provided, as well as a delay (here, 30 seconds). The delay is for hysteresis. An alarm's severity can be increased immediately, but it will not be decreased until the delay interval has passed. Without this delay, rapidly changing conditions could cause a log flood.

Raising an Alarm

An object pool updates its alarm by invoking the following:

void ObjectPool::UpdateAlarm()
{
   //  The alarm level is determined by the number of available blocks
   //  compared to the total number of blocks allocated:
   //    o critical: less than 1/32nd available
   //    o major: less than 1/16th available
   //    o minor: less than 1/8th available
   //    o none: more than 1/8th available
   //
   dyn_->delta_ = 0;
   auto status = NoAlarm;

   if(dyn_->availCount_ <= (dyn_->totalCount_ >> 5))
      status = CriticalAlarm;
   else if(dyn_->availCount_ <= (dyn_->totalCount_ >> 4))
      status = MajorAlarm;
   else if(dyn_->availCount_ <= (dyn_->totalCount_ >> 3))
      status = MinorAlarm;

   auto log = alarm_->Create(ObjPoolLogGroup, ObjPoolBlocksInUse, status);
   if(log != nullptr) Log::Submit(log);
}

This is similar to generating a log, but it uses Alarm::Create instead of Log::Create and also provides the alarm's severity level.

Invoking the above function every time a block is allocated from, or returned to, its pool would add considerable overhead. The pool therefore maintains a running count (delta_) and only invokes the function once the number of available blocks has increased or decreased by a net of 50. Note also that Alarm::Create returns nullptr unless a log should actually be generated (that is, when the alarm's severity has changed and the hysteresis delay, if applicable, has passed).

When the alarm appears, it looks like a regular log, but with one or more asterisks prefixed:

Trace Tools

Trace tools support debugging by recording various types of events in a buffer whose contents can be dumped once tracing is stopped. RSC's trace tools use a common framework so that events from all enabled tools appear in an integrated dump that orders events by the time when they occurred.

To use trace tools in a live system, filters must be provided to significantly reduce the number of events that will be recorded. This prevents the system from slowing down to such an extent that its response time becomes unacceptable, and it also prevents the trace buffer from quickly filling up with events that are irrelevant to the problem being debugged.

When the trace tool for object pools is enabled, it records an event when a pool's block is

allocated
freed
claimed (to prevent it from being recovered by the object pool audit)
recovered

RSC also provides many other trace tools. For more details, see Debugging Live Systems.

History

11^th November, 2020: Expand section on Statistics; add sections on Logs and Alarms
28^th October, 2020: Initial version