Click here to Skip to main content
15,887,135 members
Articles / Formating

Calculating record size of a record in an XLSB file

Rate me:
Please Sign up or sign in to vote.
4.94/5 (5 votes)
11 Nov 2013CPOL8 min read 19.3K   3   6
An understanding of the record structure for a string record in the XLSB file format.

Introduction

This article is in response to an under documented feature of the Microsoft XLSB file format regarding reading string data. This article sheds light on how the user can understand how to write code to read the XLSB string data. The purpose of this article is to show the reader how to calculate the record size of a record type in the XLSB file format.

Background

This is a very advanced topic. It requires that the user has already written a substantial amount of code that can handle the XLSB file format. The XLSB file format is really a ZIP archive that contains many files with many types of information. For this article we are only interested in the worksheet file information.

Using the code  

This article doesn't use code or even pseudo-code. It simply sheds light on the data structure of the string record. 

Points of Interest

This article is for anyone who has ever had to write readers or converters of the Excel file format. Currently I have written or worked on XLS 95, XLS 97-2003, XLSX, and XLSB. I have these formats under my belt and consider myself an expert on many aspects of the file format though I am not much of a user or expert of the software itself. In my day to day employment I have to support the Excel file format.

The data structure

A record in the XLSB file format is a smart data structure. It allows for future record types and support for legacy record types. I find the record format quite advanced and smartly designed. "Each binary record is a variable-length sequence of bytes. A binary record consists of three components: a record type, a record size, and the record data that is specific to that record type." -- From the documentation XLSB documentation.

  • The record type is variable length up to 2 bytes and is combined as 7 bit numbers.
  • The record size is also variable length up to 4 bytes and is combined as 7 bit numbers.
  • The record data follows and its length, in bytes, is the record size.

When reading the data one can skip all the record types of the data they are not interested in and parse those records they are interested in. String data comes in many formats. It can exist in a string table or as a stand alone record. Because the string data is variable length in nature it is extremely important to be able to precisely calculate the record size in order to correctly read the record data.  At the end of the article a formula is given that can be used to calculate any record length of any record type. This article will examine this with the context of string data.

The documentation gives a poor description and example of how to process a small string and they give the formula B2 * 128 + B1. B1 is the first byte but using only the first 7 bits plus the second byte multiplied by 128, but only using the first 7 bits of the data. With this formula we can only read strings totaling 127 * 128 + 127 in length. This works out to 16,383 bytes of data. Since the strings are stored as Unicode data this length is really halved and the string is limited to 8,191 Unicode characters. This is probably sufficient for most cells of data. 

What do we do when we have a string longer than that? If we follow the documentation we come away wanting more examples and documentation. However, we are left wanting.... What are we to do? In my case I had to debug a crashing application because it was trying to read in strings that are larger than this size. Fortunately I was able to reverse engineer the process and package it up in a nice article for future generations of coders who want to read Excel data.   

A string comes in many different record types but underneath it all is a basic string representation. For this example we are looking at record type 62. This record is formally known as the BRTCELLSTRING and it is one of the record types that Excel uses when transferring clipboard data between applications. The record contains cell and string information. This record is given by specifications 2.4.282, 2.5.9, and 2.5.128 of the file format.  In each image the left most "0" represents the first bit and the rightmost "1" represents the 32 bit of the data structure. The components are stacked to logical order and size. Some bits are not used and are marked as reserved. 

679614/BrtCellRString.jpg

679614/Cell.jpg

679614/RichStr.jpg 

Let's look at the crashing application and the string it was trying to read. We know that the format of the string. See BrtCellRString above. The first portion is a cell. See Cell above. The next portion is the value. See RichStr above. 

Now we need to read the data dwSizeStrRun in the RichStr (2.5.128) structure which tells us the length of the string. Theoretically it should be possible to reverse engineer the overarching record size using just the string length element since the rest of the data is a fixed length. 

Looking at the structures again we see that the Cell is always 8 bytes, 4 bytes for the column and 4 for the style. The value is variable length with 1 byte for information, 4 bytes for the length dwSizeStrRun, and then the string data. 

Now we can read the string length and compute the record size and make sense of the record size bytes.

In the crash I was analyzing dwSizeStrRun contained the value 12402. Using the knowledge of the record and knowing this string is not phonetic we can compute the value that the record size should tell us.

Starting by knowing that the string is a Unicode string we must double this amount to get the length in bytes. Therefore 12402 * 2 = 24804. Now we add 13 to this to account for the size of the fields containing dwSizeStrRun (4 bytes), A and B (1 byte), iStyleRef and A (4 bytes), and column (4 bytes). We get an expected record size of 24817 bytes.

Now, in my sample data the record size of the overarching file is given by the 3 bytes 241, 193, and 1. Once again how did we know we only have 3 bytes to determine the record size? We know this because any byte of the record size which is greater than or equal to 128 means that another byte follows. We read, based on this logic, for up to 4 bytes. In the example the 3rd byte has the value of 1 so we stop.

How in the world do the bytes 241, 193, and 1 get translated into the record size of 24817? It's easy if you know how...

  • Step 1 - The documentation tells us that only the first 7 bits are important in the value so for values greater than or equal to 128 we subtract 128. This gives us the values 113, 65, and 1 
  • Step 2 - We are told that the first byte is the least significant byte and the last byte is the most significant byte so we reverse the order. This gives us the sequence of 1, 65, and 113.
  • Step 3 - We are told that these are 7 bit numbers so let's look at the bits, in order of the sequence above

1 = 0000001, 65 = 1000001, 113 = 1110001.

Combined they are 000000110000011110001

Now let's look at the binary value of 24817. 

 24817 = 110000011110001

Let's compare our two binary numbers: 110000011110001 == 110000011110001 

Voila! They are the same so now we know how to manipulate the record size bytes to match with the string length. Those record size bytes, once manipulated form a BCD value. BCD is a binary coded decimal. Let's turn this information into a formula.  The formula must ignore the 8th bit and must concatenate the (at most) 4 bytes together to make a record length. The 8th bit is 128 and all the other bits of interest combine to the value of 127. We can use "And" logic to mask out the high bit by only looking at the bits of interest. We can use "bit shift" logic to position our bits for our binary coded decimal value. We can use "Or" logic to stitch everything back together. 

Here goes our formula: 

Record Size = ((B4 & 127) << 21) | ((B3 & 127) << 14) | ((B2 & 127) << 7) | (B1 & 127) 

 I doubt you will get the same gratification out of reverse engineering a file format to fill in the gaps of documentation that I needed to do for my work. However, I hope this formula really helps someone and my purpose for writing this article was to document an under documented but extremely critical feature of the XLSB file format.

Cheers! - Andy 

History 

  • 11/7/2013 - The article was penned by me.
  • 11/8/2013 - Edited to add more text and fix the formula because I missed a parenthesis.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Founder
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
BugThe record size can be 1, 2, or 3 bytes as well, so your formula isn't correct in general Pin
Member 105696223-Feb-14 17:37
Member 105696223-Feb-14 17:37 
GeneralRe: The record size can be 1, 2, or 3 bytes as well, so your formula isn't correct in general Pin
Andy Bantly4-Feb-14 2:37
Andy Bantly4-Feb-14 2:37 
GeneralRe: The record size can be 1, 2, or 3 bytes as well, so your formula isn't correct in general Pin
Member 105696224-Feb-14 2:52
Member 105696224-Feb-14 2:52 
GeneralRe: The record size can be 1, 2, or 3 bytes as well, so your formula isn't correct in general Pin
Andy Bantly4-Feb-14 2:59
Andy Bantly4-Feb-14 2:59 
GeneralRe: The record size can be 1, 2, or 3 bytes as well, so your formula isn't correct in general Pin
Member 105696224-Feb-14 3:11
Member 105696224-Feb-14 3:11 
GeneralRe: The record size can be 1, 2, or 3 bytes as well, so your formula isn't correct in general Pin
Andy Bantly4-Feb-14 4:12
Andy Bantly4-Feb-14 4:12 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.