Analyzing profiling data from vsperfcmd

Mattias Högström

5.00/5 (4 votes)

Jan 2, 2013

Ms-PL

8 min read

39863

1349

Building a viewer and analyzing profiling data reports from vsperfcmd.

VsPerfView - Callee view

Introduction

This is a spin-off from another article Profiling of C++-Applications in Visual Studio for Free where the author created a simple viewer to show profiling data. My contribution is adding basic analysis to make viewing more interesting, and adding support for trace profiling data.

Background

In the first article, we learn how to compile the app for sample profiling, and use the tools that comes with all versions of visual studio, even the Express version. If you have the premium or ultimate version of Visual Studio, you will have a nice Viewer and Analyzer builtin, but for the rest of us, although we have also the profiling tools available, we don't have a viewer, and profiling without a decent viewer doesn't make much sense. Many thanks to the original article for showing us the way and giving us a viewer. Let's try to make some improvements.

Below is a screenshot of the old viewer

ProfileResultViewer - Callee view

On my machine the viewer has a minor aestethic flaw. The profiling data gets generated in my locale using comma as the decimal separator. This confuses the parser and shows that wmmain consumes a 10000 % instead of 100%. C

What is more annoying is that wmain is appearing several times. To make the viewer more useful, I would like to see wmainCRTStartup as the root node, and see wmain under it. Then I would like to be able to drill down for further details. Below is the the new view using the same data file as above.

VsPerfView - Callee view

I added some colors to the output. In my initial idea, I was going to use more colors, to mark hotspots and possible cpu hogs, but also to mark false positives. A false positive is for example the function WaitForMultipleObjects. That is a function that doesn't consume cpu cycles. It is a sort of Wait/Sleep that is only released by the OS, when a special condition is met or a time out occurs. This function could be given a light grey color in order to mark it as unimportant.

VsPerfView - false positive

As you can see the thread spends 51.97% of the time in WaitForMultipleObjects, but the CPU hog is not there. If your application has many threads, it is in fact most likely that most of the threads are in some waiting state. A C++/.Net app might have the following threads: background worker thread, asynchronous IO worker thread, garbage collector, and a GUI thread waiting for interaction.

Even a console input call like readline, eventually calls either WaitForSingleObject or WaitForMultipleObjects.

How it was done

Profiling Data Format

The data format is quite simple. It is a hierarchical structure oriented around Root nodes, which has a caller and a callee as sub nodes.

Root","_wmain"
"Caller","_wmainCRTStartup"
"Callee","ExpensiveMethodB(void)"
"Callee","ExpensiveMethodC(void)"

This patterns is repeated for every caller and callee. The callee ExpensiveMethodB is also a root note.

Root","ExpensiveMethodB(void)"
"Caller","_wmain"
"Callee","Data::~Data(void)"
"Callee","Data::Data(int)"

Without a filter, we will get repeated nodes if we display all Root nodes. We will try to avoid these redundant nodes.

Analyzing the profiling data

Step 1 - Parsing and graphical representation

The parsing was pretty straight forward, and I thought that a TreeView would be suitable for representing the call stacks.

Step 2 - Finding the entrypoints

One approach is to search for the function _wmainCRTStartup and set it as Root node. Unfortunately this approach has flaws. This function has a tendency to switch name depending on the version of the compiler. Sometimes it is called _wmainCRTStartup and sometimes _tmainCRTStartup, I have even found a few more.

In its extension we will also have to find all the entrypoints of all the remaining threads. If we break into a running app with windbg, the entrypoint of all threads are RtlUserThreadStart, this function doesn't even appear as node.

I ended up with a simple solution. All nodes that don't have any Callers ought to be entrypoints, i.e. the start of independent threads (including wmain).

Step 3 - Expanding a node

Expanding the root node, I had to fill in the information for the callees. I used the information available in the Root node.

"Root","_wmain",14,0,"100,00","0,00"
"Caller","_wmainCRTStartup",14,0,"100,00","0,00"
"Callee","ExpensiveMethodB(void)",4,0,"28,57","0,00"
"Callee","ExpensiveMethodC(void)",10,0,"71,43","0,00"

VsPerfView - Callee view

This looks indeed correct.

Step 4 - Oh no, The values are wrong

When I drilled down even further, going into ExpensiveMethodB and ExpensiveMethodC I noticed pretty soon that the values didn't make sense.

Look, the call to std::list::_Tidy(void) is 42.86% in each branch.

VsPerfView - wrong values

Let's look at the Root node to see what went wrong

"Root","std::list::_Tidy(void)",6,0,"42,86","0,00"
"Caller","Data::~Data(void)",6,0,"42,86","0,00"
"Callee","_free",6,0,"42,86","0,00"

I see. My first approach was too naive. Data::~Data(void) is the only caller, which spends 42.86% time in total, but in our application Data::~Data(void) is called from two distinct functions. When we drill down into these functions we need to calculate how much each branch contribute.

The problem is that it is not enough to just look at the contribution from the Caller, but also from the Caller's caller, and in its extension also the caller's caller's caller. In fact we have to go as far back as to the root node.

Jusy thinking about calculating a new value based on a chain of callers, traversing up and down the tree made me dizzy. Luckily, I found an easy solution to it.

Step 5 - Correcting the values

In the root node. The value is correct. On the second level, the parent (root node) is correct. As long as the parent value is correctly calculated, we can reuse that value. Looking in the TreeView, and parsing the value of the parent caller was a simple solution, probably not the most elegant one. Then I calculated a new value and adjusted it based on the parent value in the TreeView.

double AdjustedParentCallerValue = double.Parse(TreeViewItem.Item[1]);
double weight = AdjustedParentCallerValue / Current caller value;
double newAdjustedValue = Callee value * weight;

Weight is the ratio of how much the current branch contributes to the callee time. Then this value has to be multiplied with the real value. Now the calculations look better.

VsPerfView - corrected values

Words of caution

The adjusted values is an approximation. In theory there can be differences. If callers of function X doesn't trigger the same callees when they execute, it is wrong to calculate a distribution ratio. It is also wrong to use the callee as a subnode for the callers that don't go down that branch. Unfortunately, we have run into a limitation here. The Vsperfreport tool that we use to generate the report, doesn't contain that information. The .vsp file that is a huge raw file probably contains it, but it is in a binary format, unknown to me.

Instrumentation Profiling

The first article showed how to make Sample profiling. That is a technique that stops the process at a fixed interval, and logs the current callstack of the running processes. That is why the function WaitForMultipleObjects is overrepresented. Sampling has a few drawbacks.

If your program executes for a very short time you might not get any samples at all.
Some functions may be missed, if they run between two samples.
Sampling doesn't work in virtual machines.

I develop on a Mac, running Windows in a virtual environment. I didn't get any samples at all. I found out why here. The CPU counters which the vsperf profiler uses are not virtualised or exposed. But to the rescue, I also found out that instrumentation profiling works in virtual machines. That is a method that modifies the binary by inserting prelude and postlude code in each function body. This method is very exact and doesn't miss calls as the sampling does. The disadvantage is that the raw profiling data files, very fast become enormous (gigs of data). The generated summary report is still manageable in size. A 2 GB data file generated a 450 KB report. That report is parsed in fractions of a second be the viewer.

Tracing with instrumented binaries

Prerequisites

You will need the Vs2010 or Vs2012 performance tools. They are normally installed here.

"C:\Program Files (x86)\Microsoft Visual Studio 10.0\Team Tools\Performance Tools"
"C:\Program Files (x86)\Microsoft Visual Studio 11.0\Team Tools\Performance Tools"

Otherwise it can be downloaded from Vs2010 preformance tools

Support for CallerCalleeSummary from Tracing

The summary report from a trace is almost identical to a report from sample profiling. The app VsPerfView automatically detects the correct format.

The app was developed and tested under vs2010. When ran a sampling profiling at work under vs2012, my viewer failed to read the file. The reason was that Vs2012 has added some extra columns, e.g. average values. A very good addition. In a future it would be nice to add them also to the viewer. I also added support for the vs2012 data format.

How to do trace profiling

The tool VsInstr.exe has to be run on the exe and its dlls. It produces new binaries with trace enabled. These new binaries are much much slower than the original. So be patient.

Preparing the binary

SET pt="C:\Program Files (x86)\Microsoft Visual Studio 10.0\Team Tools\Performance Tools"
%pt%\vsinstr VSPerfNativeSampleApp.exe

Starting a trace

SET pt="C:\Program Files (x86)\Microsoft Visual Studio 10.0\Team Tools\Performance Tools"
%pt%\VSPerfClrEnv /traceon 
%pt%\vsperfcmd /start:trace /output:my_trace_data.vsp
VSPerfNativeSampleApp.exe
%pt%\vsperfcmd /shutdown
%pt%\VSPerfClrEnv /off
%pt%\vsperfreport /summary:all my_trace_data.vsp 
  /symbolpath:"srv*C:\Symbols*http://msdl.microsoft.com/download/symbols"

Using the code

Compiling with VS2010 Professional

This project was built with VS2010. The VS2010 Express version might need to install F# CTP.

Compiling under VS2012

If you have the professional version of VS2012, except for the automatic conversion, you will have to manually re-link the references to the latest version of FSharp.Core. This assembly is unfortunately at the time of writing not available in VS2012 Express. That is one of the reasons I developed it in VS2010.

History

2 Jan 2013 - Initial publication.
3 Jan 2013 - Clarification regarding tracing and virtual machines. Clarified the names of the downloads.
17 Jan 2013 - Corrected the filename in the instructions. Changed my_sampled_data.vsp to my_trace_data.vsp.