Retro Game Cartography: Part Deux

Mladen Janković

5.00/5 (14 votes)

Sep 9, 2021

CPOL

22 min read

7373

226

Improved algorithm for reconstructing game world map from captured game play

Introduction

This article describes the effort to improve performance of the original algorithm for recreating game world maps. Original implementations is described in this article.

Improvements are in made in several segments of the algorithm:

run-time performance
reduction of frame "misplacement"
artifact filtering

To achieve these goals there were several changes in approach to the problem. Changes range from implementation details to more fundamental modifications of the algorithm.

First and the most obvious change is switch from managed F# code to native C++ code. In addition to that, AVX2 is also employed to improve run-time performance of low-level operations on images.

The most fundamental change in the approach is that the algorithm no longer keeps a global view (keypoints database) of the world, nor it tries to match frames against it. Instead each successive frame is compared only to its predecessor and only relative change in position is calculated. Stream of consecutive frames that can be matched form a fragment, or part of the map. If a new frame cannot match the previous one, a new fragment is started. At the end, fragments are matched against each other and spliced together. This solves some issues when it comes to run-time performance, fluctuating keypoints and erratic placement of frames, but it also introduces new problems which are going to be discussed later in the article.

Additional improvements are also made to foreground extraction and artifact filtering to further reduce random pixels left from moving objects. To improve foreground extraction, the map building is split in two phases. The first one builds a map and applies only the most basic filtering. More involved filtering is left for the second phase that goes over the all frames again and tries to identify and remove all moving objects. Finally simple statistical analysis is applied to remove remaining artifacts - random pixels were not correctly identified as foreground.

Building Fragments

This version of the algorithm does not keep a single global map of the world, instead it tries to construct a map of local regions called fragments. Current position in the fragment is kept and updated with each new frame. To get the position of the next frame, the relative displacement (or offset) from the previous frame is calculated and the current position is updated by applying this. Content of the frame is then written to the fragment at the calculated position. If offset cannot be calculated, the current fragment is sealed and the frame is moved to a new fragment.

To calculate offset between two frames, keypoints are extracted from a frame and then they are matched against keypoints of the previous frame. Matching algorithm tries to ensure that two frames are actually next to each other and only then calculates the offset between them.

Flow of fragment collection/build is illustrated in the following diagram:

Fragment is a sequence of consecutive frames that are successfully matched and forms one piece of the world map. It is represented by matrix that stores history of each pixel. Every element of the matrix is a "dot" that tracks how many times each color appeared at the location. Adding frame (or "blitting") into the region just increases the count for the color that frame had for the pixel. Fragments can be "blended" into images. The process simply takes the most frequent color from each dot as the pixel value of the output image. Blending also outputs a bit mask that tells which dots actually observed frames/color data and which are "empty" and can be ignored/filtered. Each frame added to the fragment is also preserved in a compressed format with its location information, since this data will be needed again in later phases of the algorithm.

fgm namespace wraps code dealing with fragment representation and manipulation. fragment class is the most important one.

collector from frc namespace represents the loop that collects and builds fragments from a stream of frames. Collector takes a feeder object that is producing frames. In the example application, the feeder produces frames by reading them from directory.

Keypoint Extraction

Keypoint extraction has been changed significantly from the first version of the algorithm. To identify keypoints, the median filter is applied twice to the frame with two different window sizes: 3x3 and 5x5, then the two resulting images and the original image are compared.

We define keypoint(i,j) as a predicate that is true if pixel is identified as keypoint (original_ij stands for pixel of the original image, while median_ij represents pixel of image that went through median filter):

$\begin{aligned} keypoint(i,j) \Rightarrow original_{ij} \neq \overset{3x3}{median}_{ij} \wedge \overset{3x3}{median}_{ij} \neq \overset{5x5}{median}_{ij} \end{aligned}$

weight(i,j) is a function that maps keypoint to its weight for pixel (i,j):

$\begin{aligned} \operatorname{weight}(i,j) = \begin{cases} 2,\text{ if } keypoint(i,j) \land original_{ij} \neq \overset{5x5}{median}_{ij} \\ 1,\text{ if } keypoint(i,j) \land original_{ij} = \overset{5x5}{median}_{ij} \\ 0,\text{ otherwise} \end{cases} \end{aligned}$

input image

keypoints with weight = 1	keypoints with weight = 2

keypoints combined

Usually only keypoints with higher weight value are used for matching and offset calculation, but in regions of frame sparsely populated with those, the algorithm falls back and uses keypoints with lower weight as well. Once a keypoint is identified, its description is encoded by extracting pixels from the surrounding 5x5 window and concatenating weight to the resulting byte array to generate 104-bit code.

When extracting keypoints, the frame is "split" by a grid of a certain configuration. Each extracted keypoint is assigned to one or more regions of the grid. Later when the algorithm tries to match two frames it will only try to compare keypoints belonging to the same regions.

Median Filter for Native Images

Native images are using native color codes for pixels. Actual numeric value of the color code does not correspond to its real RGB color or its intensity or anything "physical", so they are not suitable for doing filtering of any kind. To make images suitable for processing, color codes of pixels are converted to values that correspond to brightness of their native color (the format is called native ordered values). The conversion is done in the following way:

native color code acts as input format
RGB conversion is done using lookup table
intensity of RGB color is calculated based on standard formula
ordered value is equal to index of the color in the array of all colors sorted by their intensity

Before the median filter is applied, native image is converted to this "ordered value" format, then the filter is applied. This format can also be converted to the native one if needed.

Matching Frames

Keypoint matching works by calculating all offsets between each keypoint in the current frame and corresponding keypoints in the previous frame. Corresponding keypoints are the ones with the same code residing in the same regions of their frames. Process keeps the count of each distinct offset that appears.

If we let K_P be set of all keypoints of the previous frame and we define functions code and region that maps keypoint to its code and region to which it belongs, we can identify set C_P that contains all matched keypoints between two frames:

$\begin{aligned} C_P(c) \in \{p \in K_P \vert \forall p[\operatorname{code}(p) = \operatorname{code}(c) \wedge \operatorname{region}(p) = \operatorname{region}(c)]\} \end{aligned}$

We also define function dist_θ that maps pair of keypoints to their in specified dimension as (pos_θ is a function that maps keypoint to one coordinate of its location):

$\begin{aligned} \operatorname{dist}_{\theta}(c,p) = \operatorname{pos_{\theta}}(c) - \operatorname{pos_{\theta}}(p) \text{ where } \theta \in \{x, y\} \end{aligned}$

Now we can build a sequence M_P(c) of all offsets which can be used for matching a single keypoint c of the current frame with corresponding keypoints from the previous frame:

$\begin{aligned} M_P(c)=\langle \langle \Delta x, \Delta y \rangle \in \mathbb{Z}^2\vert (\forall p \in C_P(c))[\Delta x = \operatorname{dist_x}(c, p) \wedge \Delta y = \operatorname{dist_y}(c,p)] \rangle \end{aligned}$

Finally the sequence M of all possible offsets can be defined as:

$\begin{aligned} M = M_P(c_1) ^\frown M_P(c_2) ^\frown... ^\frown M_P(c_n) \text { where } c \in K_C \end{aligned}$

Once all possible offset are defined, we can assign scores to them using score function:

$\begin{aligned} \operatorname{score}(\Delta x, \Delta y) = \sum_{i=1}^{|M|}[M_i=(\Delta x, \Delta y)] \end{aligned}$

When matching two frames, keypoints are grouped in 8 regions. These regions are divided by a 4x2 grid. Each region has 16 pixels overlap with neighboring regions. Since keypoints matching is local to the frame region, overlapping is necessary to avoid exclusion of keypoints located near borders of the regions. If there are enough keypoints of weight 2 in the region, only those will be used. If a region is sparsely populated with such keypoints, matching will switch to using keypoints of weight 1.

Once scores of all offsets for the regions are obtained, vote power is assigned to them. Top three offsets in each region are selected, based on their score. The offset with the highest score gets vote power of 3, while the 3rd ranked offset gets vote power of 1. Finally vote power of each offset is summed across the regions. Offset with greatest total power is declared as the offset of the whole frame, but only if its total power value is larger than then the next smaller one by at least 4 points (half the number of regions). This structure of voting power is established to prevent a single keypoint-rich region from overpowering all other sparsely populated regions and to remove possible ambiguity if there are multiple offsets with close value of vote power. This way multiple regions must agree on the actual offset of the whole frame and if they fail to do so, the frame will be declared as non-matching and a new fragment will be spawned.

Fragment Splicing

Output of the algorithm's first phase is a set of fragments. Each fragment represents part of the game world map made from consecutive frames that could be matched. These fragments have to be spliced together in the next step.

Process of splicing fragments into a map starts by matching each fragment against all others. Fragment matching works in a similar fashion to frame matching. Keypoint identification and extraction are the same, but there is no grid splitting fragment into regions that group keypoints (or stated differently grid is of size 1x1). Matching keypoints is also similar, but it has different policies to ensure that false matches are filtered. This process outputs a queue containing pairs of fragments that could be matched along with calculated offsets and matching scores.

Once the queue is built, the pair with the highest score is selected and its fragments are spliced together. Splicing is done by blitting one fragment into the other (blitting fragments preserve all pixel history information). Fusion produces a new fragment and the two source fragments have to be retired. It means the queue has to be purged of all pairs referencing one of the source fragments. After queue purge, this new fragment is matched against remaining fragments and matching pairs are added back to the queue. The splicing process loops back to select the next highest scoring pair and the process stops once the queue is empty (no pair of fragments has left that could be matched).

Matching Fragments

Keypoints are extracted from the fragment in the same way they are extracted from the frames. To extract keypoints, a fragment is blended to get the native image from the "dot" matrix. Keypoints are not grouped into regions like in case of frame matching, which means all keypoints in the fragment are compared against all keypoints in another fragment. Another difference to frame matching is that keypoints of all weights are used.

Each pair of matching keypoints produces a possible matching offset. Offset with the greatest count of matching keypoints is declared as a winner. Once the offset is selected, it has to be verified. To verify it, an overlapping region of the two fragments has to be calculated. This region of interest then is split into cells of fixed size and each keypoint from both fragments is assigned to a cell which encompasses the position of that keypoint. Matching is valid if at least 2/3 of cells in the regions of interest are populated by at least one matching keypoint. Cells not populated by any keypoints from both fragments are ignored from the total count of cells in the region of interest.

We choose a pair of keypoints that are matching, one from each fragment and label them as l₀ and r₀. Then offset of the frames can be defined as:

$\begin{aligned} \Delta p_x = \operatorname{pos_x}(l_0) - \operatorname{pos_x}(r_0), \, \Delta p_y = \operatorname{pos_y}(l_0) - \operatorname{pos_y}(r_0) \end{aligned}$

Next we define predicate match(l,r) that is true if keypoint l from the first fragment is matching keypoint r from the second, under assumption that l₀ and r₀ are matching:

$\begin{aligned} loc_\theta(l, r) \Leftrightarrow \operatorname{pos_\theta}(l) - \operatorname{pos_\theta}(r) = \Delta p_\theta \text{, where } \theta \text{ is } x, y \\ match(l,r) \Leftrightarrow \operatorname{code}(l) = \operatorname{code}(r) \land loc_x(l, r) \land loc_y(l, r) \end{aligned}$

We also need to define functions cell_l(l) and cell_r(r) that maps a keypoint to a cell of region of interest to which it belongs (λ represents the cell size):

$\begin{aligned} \operatorname{cell_l}(l) = \left \langle \left \lfloor \frac{\operatorname{pos_x}(l) + \Delta {p_x}^-}{\lambda} \right \rfloor, \left \lfloor \frac{\operatorname{pos_y}(l) + \Delta {p_y}^-}{\lambda} \right \rfloor \right \rangle \\ \operatorname{cell_r}(r) = \left \langle \left \lfloor \frac{\operatorname{pos_x}(r) + \Delta {p_x}^+}{\lambda} \right \rfloor, \left \lfloor \frac{\operatorname{pos_y}(r) + \Delta {p_y}^+}{\lambda} \right \rfloor \right \rangle \end{aligned}$

We use d_x and d_y that represent height and width of region of interest (while width and height of fragments are represented by width_l, width_r, height_l and height_r)

$\begin{aligned} d_x = \operatorname{min}(width_l, width_r - \Delta p_x) - \Delta {p_x}^+ \\ d_y = \operatorname{min}(height_l, height_r -\Delta p_y) - \Delta {p_y}^+ \end{aligned}$

Set C_O of all cells in the region of interest can be defined as:

$\begin{aligned} C_O \in \left \{\langle x, y \rangle \in \mathbb{N} \times \mathbb{N} \vert x < \left \lceil \frac{d_x}{\lambda} \right \rceil \wedge y < \left \lceil \frac{d_y}{\lambda} \right \rceil \right \} \end{aligned}$

If we have K_L and K_R as sets of all keypoints belonging to these two fragments that we are matching, then set C_E of ignored cells is can be defined:

$\begin{aligned} C_E \in \{c \in C_O \vert \neg [\exists l \in K_L, \exists r \in K_R, (\operatorname{cell_l}(l) = c \lor \operatorname{cell_r}(r) = c)] \} \end{aligned}$

C_M is a set of all the cells that contains at least one pair of matching keypoints:

$\begin{aligned} C_M \in \{c \in C_O \vert \exists l \in K_L, \exists r \in K_R, \operatorname{match}(l,r) \wedge \operatorname{cell_l}(l) = c) \} \end{aligned}$

Now that we have all these cell sets defined we can define predicate valid that is true if proposed offset (Δp_x, Δp_y) is valid:

$\begin{aligned} valid \Rightarrow \frac{\vert C_M \vert}{\vert C_O \setminus C_E \vert } \geq \frac{2}{3} \end{aligned}$

If this offset cannot be validated, fragments are declared as non-matching and they are not considered as candidates for splicing.

kpm namespace is also home for the fragment matching algorithm which is implemented by another overload of match function.

Foreground Filtering

Moving objects leave all kinds of artifacts over the map image, so in order to have a clean map they have to be filtered out.

Simplest form of foreground filtering is implemented by the fragment blending process. It takes the most frequent color of each pixel in a fragment and writes it to image output. The process removes most artifacts created by moving objects from the output, but there are still some remaining from objects that are moving at a slower pace, that are more static or that are moving in more confined spaces.

After splicing fragments, all historic data about each pixel will be captured in a single fragment. Once all the data is available, fragments can be blended to get a preliminary background that has most of the moving object removed. This background image will be used for identifying contours of foreground objects in the source frames.

The sequence of frames for each fragment is replayed one more time and frames are blitted into the new fragment. This time, frames are compared against the preliminary background of the original fragment. Each mismatched pixel is marked and contours to which such pixels belong are extracted and filtered out. The whole rectangle enclosing the foreground contour is removed from the frame and is not blitted into the target fragment.

This process will also exclude some contours belonging to the background. To limit consequences of this unwanted effect, only the contours whose sizes are comparable to native sprite sizes are candidates for filtering, all larger contours having are still blitted, even if they have mismatched pixels.

background

input frame

extracted foreground	foreground mask

This process is the reason (compressed) copies of source frames are kept in memory: so sequence could be replied. To save processing time , the position of each frame within the fragment is preserved. Fragment splicing affects position of these frames relative to fragment origin, but each time fragments are spliced, relative frame positions are adjusted. This is computationally much cheaper than trying to read and match frames against fragments again, but it consumes more memory since content of the frames have to be preserved.

Namespace fde contains extractor class that is responsible for extracting foreground contours from frame. mask function will generate a blit mask from extracted contours, or rather their enclosing rectangles.

fdf namespace has filter function that will do the filtering of the foreground.

Artifact Filtering

After removing the foreground with the process described in the previous section there is still going to be some artifact remaining. The remaining artifacts are identified by determining the frequency of each pixel sequence of a certain size . For each pixel in the fragment's image, a sequence of neighboring pixels is created and a count for each unique sequence is kept. Only pixels belonging to the same axis are used for forming the sequences. Sequence counting is done for both axes and the counts are kept separately. Then the number is added together for each pixel and those pixels with lower numbers are marked as potential artifacts.

We define H_{<x, y>} and V_{<x, y>} as horizontal and vertical sequences of colors assigned to pixels neighboring to x,y like this (we use n to represent size of the filter and k is the number of preceding or succeeding pixels that are taken into the account by filter):

$\begin{aligned} k = \left \lfloor \frac{n}{2} \right \rfloor \\H_{\langle x,y \rangle} = \left \langle p_{\left \langle x-k,y \right \rangle},p_{\left \langle x-k+1, y \right \rangle}, ..., p_{\left \langle x, y \right \rangle }, ..., p_{\left \langle x+k-1, y \right \rangle }, p_{\left \langle x+k, y \right \rangle } \right \rangle \\ V_{\langle x,y \rangle} = \left \langle p_{\left \langle x,y-k \right \rangle},p_{\left \langle x, y-k+1 \right \rangle}, ..., p_{\left \langle x, y \right \rangle }, ..., p_{\left \langle x, y+k-1 \right \rangle }, p_{\left \langle x, y+k \right \rangle } \right \rangle \end{aligned}$

Then we define P as set of all coordinates in the image (width and height as size of the fragment):

$\begin{aligned} P \in \{ \langle x, y \rangle \in \mathbb{N}^2 \vert ( k \leq x < width - k ) \land ( k \leq y < height - k )\} \end{aligned}$

Now we can define function heat(x,y) that maps pixel x,y to its heat value:

$\begin{aligned} \operatorname{heat}(x,y) = \frac{\sum_{p \in P} \left ( \left [H_p = H_{\langle x,y \rangle} \right ] + \left [V_p = V_{\langle x,y \rangle} \right ] \right )}{2} \end{aligned}$

Finally we define predicate artifact(x,y) that is true if pixel is potential artifact:

$\begin{aligned} artifact(x,y) \Rightarrow \sqrt{\operatorname{heat}(x,y)} > 4 \end{aligned}$

input image

heatmap

identified artifacts

For each pixel that is a potential artifact, Gaussian blur-like filter is applied. This filter performs blurring on all historical data available for the pixel and its neighbors in the fragment. Each color is separated in its own plane and then filter is applied plane by plane. Once all plains are filtered, the color represented by the plane which has the greatest value is selected as pixel color. Planes representing colors which are not present in the history of the targeted pixel are ignored.

Standard Gaussian function G(x,y) is defined as:

$\begin{aligned} G(x,y) = \frac{1}{2 \pi \sigma ^ 2}e^{- \frac {x^2 + y^2}{2 \sigma^2}} \end{aligned}$

We need auxiliary functions count(x,y,z) and total(x,y) that are mapping pixel's position to its history: number of samples in which specified color appeared as value of the pixel and total number of samples captured for the pixel. Value of each plane z for pixel with coordinates (x₀, y₀) is calculated as:

$\begin{aligned} B(z) = \sum_{x=x_0-6\sigma }^{x_0+6\sigma}\sum_{y=y_0-6\sigma }^{y_0+6\sigma} \left [ \frac{\operatorname{count}(x,y,z) \ast G(x, y)}{\operatorname{total}(x,y)} \right ] \end{aligned}$

With this we can calculate values of all planes in fragment's "dot" as (N is total number of colors in the pallet):

$\begin{aligned} D = \langle B(z)\rangle_{z=1}^{N} \end{aligned}$

We select an element from D such that D_i has the greatest value and we use the index of that element as the color value of the pixel.

arf namespace hosts filter function that performs filtering on the fragment and produces native image.

Contour Extraction

Process of extracting contours remains the same as in the original algorithm, so it will not be discussed further.

ctr and cte namespaces wrap code dealing with representation of contour and algorithm for their extraction. ctr::contour class is representation of a contour which is stored as a sequence of its edge pixels. cte::extractor class wraps contour extraction algorithm.

Active Window Scanning

Active window scanning process remains largely unchanged. Each successive frame is scanned for changes from the previous one and each changed pixel is marked in a dedicated bitmap. This way the bitmap accumulates all the changes across time. With each new frame algorithm locates the largest contour in the bitmap. Once the largest contour's area becomes stable, its enclosing rectangle is declared as the active window.

aws namespace wraps active window scanning functionality. scan function goes over frames in the provided feed and tries to identify the window.

Supporting Stuff

Following sections are describing supporting code that is (re)used by higher-level algorithm code.

Memory Allocation

For parts of the algorithm that operate on data local to the current frame, arena allocator is employed as another low-level optimization. For each frame a new arena is allocated that is slightly larger than the highest watermark of any of the previous frames. Memory arena is discarded once it is no longer needed. In some cases it is discarded when processing has finished for the current frame. In other cases, the arena has to be kept until the end of the successive frame (for example when the algorithm processes a stream of frames and has to match two frames to determine their offset).

Code dealing with memory management and allocators is located in all namespace. frame_allocator and memory_pool classes implement arena allocator pattern. memory_stack and memory_swing can be used for managing arenas that are overlapping in time.

Common Datatypes

cdt namespace contains all common data types that are representing spatial information used in various parts of the algorithm.

Following types are defined:

point - class templates that represents coordinate in two dimensions space (usually represents location of a pixel in the image)
offset_t - alias that represent signed coordinate point type or a delta (usually represents relative location of two frames)
dimensions - class template that represent two dimensional size (usually represents size of an image/fragment)
limits - class template that represent single dimensional limits
region - class template that represent two dimensional limits (usually represents margins of a frame/fragment or region of overlap between frame/fragment)

Matrix Representation and Manipulation

Code dealing with matrices is located in mrl namespace. matrix is class that encapsulates a matrix and provides some basic operation. It is used for representing images in different formats. This type is allocator aware.

Image and Pixel Types

cpl namespace defines all necessary pixel types and sid namespace contains some standard image types based on those pixel types.

Several pixel formats are provided:

cpl::rgb_cc - (RGB, component color) pixel of a single RGB color component
cpl::rgb_bc - (RGB, blended color) RGB24 pixel
cpl::rgb_pc - (RGB, packed color) pixel represented by tuple of three RGB components
cpl::nat_cc - (native, color coded) pixel as color code encoded by the value on the native platform
cpl::nat_ov - (native, ordered value) pixel that stores index of color in native color intensity table
cpl::grs_iv - (greyscale, intensity value) pixel stores intensity as floating point value
cpl::mon_bv - (monochrome, bit value) monochrome pixel

There are also utilities that perform conversions between pixel formats (names are self-explanatory):

native_to_blend
native_to_ordered
native_to_intensity

ordered_to_native
ordered_to_blend
ordered_to_intensity

pack_to_blend
pack_to_intentisy
pack_to_intensity

blend_to_pack
blend_to_intensity

intensity_to_pack
intensity_to_blend

Image is nothing more than a matrix of specific types of pixels. sid has two namespaces: mon and nat. They define two most commonly used image formats: monochromatic and native. Each name space has dimg_t alias that defines image using default allocator and aimg_t templated alias which takes allocator type as its parameter.

Image Compression

icd namespace has definitions of interface for services that provides image (de)compression needed by the algorithm and nic namespace contains implementation of these interfaces for (de)compression of images in native image format.

nic::compress and nic::decompress offers a rather simple and not so efficient way to compress native images.

icd::compressor and icd::decompressor concepts are provided as interface for (de)compression algorithm so more efficient implementation (in terms of size at least) can be provided.

Frame Feed and Map Builder

Interface of frame feed, on which the rest of the algorithm relies as source of frames is defined in ifd namespace. Namespace mpb contains map builder.

ifd::feeder is a concept that defines the interface of the feeder that is used by active window scanner or frame collector to read a sequence of frames. Feeder returns ifd::frame which contains native image and frame order number.

mpb::builder class sits at the top and unifies all the steps of the algorithm and support classes. It takes an instance of an adapter object which has to be provided by the application. It has to provide frame feed, image compression callback and other settings to the algorithm. Adapter also offers callbacks for the application to hook into the execution of the algorithm.

Utilities

nil namespace has utilities for storing, loading and converting images in native format. On the other hand ful namespace contains utilities to deal with fragments.

nil::read_raw function loads native image from the file and nil::write_png writes native image in PNG format. ful::read loads from, while ful::write stores collection of fragments to the disk.