Legacy Detail Member Binary Format
Whereas the light binary format represents everything about a given pivot table, the legacy binary format conceptually consists of a number of named sources, each of which consists of a number of named variables, each of which is a 1-dimensional array of numbers or strings or a mix. Thus, the legacy binary member format is quite simple.
This section uses the same context-free grammar notation as in the previous section, with the following additions:
-
vAF(X)
In a version 0xaf legacy member,X
; in other versions, nothing. (The legacy member header indicates the version; see below.) -
vB0(X)
In a version 0xb0 legacy member,X
; in other versions, nothing.
A legacy detail member .bin
has the following overall format:
LegacyBinary =>
00 byte[version] int16[n-sources] int32[member-size]
Metadata*[n-sources]
#Data*[n-sources]
#Strings?
version
is a version number that affects the interpretation of some
of the other data in the member. Versions 0xaf and 0xb0 are known. We
will refer to "version 0xaf" and "version 0xb0" members later on.
A legacy member consists of n-sources
data sources, each of which
has Metadata and Data.
member-size
is the size of the legacy binary member, in bytes.
The Data and Strings above are commented out because the Metadata has some oddities that mean that the Data sometimes seems to start at an unexpected place. The following section goes into detail.
Metadata
Metadata =>
int32[n-values] int32[n-variables] int32[data-offset]
vAF(byte*28[source-name])
vB0(byte*64[source-name] int32[x])
A data source has n-variables
variables, each with n-values
data
values.
source-name
is a 28- or 64-byte string padded on the right with
0-bytes. The names that appear in the corpus are very generic: usually
tableData
for pivot table data or source0
for chart data.
A given Metadata's data-offset
is the offset, in bytes, from the
beginning of the member to the start of the corresponding Data. This
allows programs to skip to the beginning of the data for a particular
source. In every case in the corpus, the Data follow the Metadata in
the same order, but it is important to use data-offset
instead of
reading sequentially through the file because of the exception described
below.
One SPV file in the corpus has legacy binary members with version
0xb0 but a 28-byte source-name
field (and only a single source). In
practice, this means that the 64-byte source-name
used in version 0xb0
has a lot of 0-bytes in the middle followed by the variable-name
of
the following Data. As long as a reader treats the first 0-byte in the
source-name
as terminating the string, it can properly interpret these
members.
The meaning of x
in version 0xb0 is unknown.
Numeric Data
Data => Variable*[n-variables]
Variable => byte*288[variable-name] double*[n-values]
Data follow the Metadata
in the legacy binary format, with sources
in the same order (but readers should use the data-offset
in
Metadata
records, rather than reading sequentially). Each Variable
begins with a variable-name
that generally indicates its role in the
pivot table, e.g. "cell", "cellFormat", "dimension0categories",
"dimension0group0", followed by the numeric data, one double per
datum. A double with the maximum negative double -DBL_MAX
represents the system-missing value SYSMIS
.
String Data
Strings => SourceMaps[maps] Labels
SourceMaps => int32[n-maps] SourceMap*[n-maps]
SourceMap => string[source-name] int32[n-variables] VariableMap*[n-variables]
VariableMap => string[variable-name] int32[n-data] DatumMap*[n-data]
DatumMap => int32[value-idx] int32[label-idx]
Labels => int32[n-labels] Label*[n-labels]
Label => int32[frequency] string[label]
Each variable may include a mix of numeric and string data values.
If a legacy binary member contains any string data, Strings
is present;
otherwise, it ends just after the last Data element.
The string data overlays the numeric data. When a variable includes
any string data, its Variable represents the string values with a
SYSMIS
or NaN placeholder. (Not all such values need be
placeholders.)
Each SourceMap
provides a mapping between SYSMIS
or NaN values in
source source-name
and the string data that they represent.
n-variables
is the number of variables in the source that include
string data. More precisely, it is the 1-based index of the last
variable in the source that includes any string data; thus, it would
be 4 if there are 5 variables and only the fourth one includes string
data.
A VariableMap
repeats its variable's name, but variables are always
present in the same order as the source, starting from the first
variable, without skipping any even if they have no string values.
Each VariableMap
contains DatumMap
nonterminals, each of which
maps from a 0-based index within its variable's data to a 0-based
label index, e.g. pair value-idx
= 2, label-idx
= 3, means that
the third data value (which must be SYSMIS
or NaN) is to be replaced
by the string of the fourth Label.
The labels themselves follow the pairs. The valuable part of each
label is the string label
. Each label also includes a frequency
that reports the number of DatumMaps
that reference it (although
this is not useful).