An SPSS system file holds a set of cases and dictionary information that describes how they may be interpreted. The system file format dates back 40+ years and has evolved greatly over that time to support new features, but in a way to facilitate interchange between even the oldest and newest versions of software. This chapter describes the system file format.
System files use four data types: 8-bit characters, 32-bit integers,
64-bit integers,
and 64-bit floating points, called here char
, int32
,
int64
, and
flt64
, respectively. Data is not necessarily aligned on a word
or double-word boundary: the long variable name record (see Long Variable Names Record) and very long string records (see Very Long String Record) have arbitrary byte length and can therefore cause all
data coming after them in the file to be misaligned.
Integer data in system files may be big-endian or little-endian. A
reader may detect the endianness of a system file by examining
layout_code
in the file header record
(see layout_code
).
Floating-point data in system files may nominally be in IEEE 754, IBM,
or VAX formats. A reader may detect the floating-point format in use
by examining bias
in the file header record
(see bias
).
PSPP detects big-endian and little-endian integer formats in system files and translates as necessary. PSPP also detects the floating-point format in use, as well as the endianness of IEEE 754 floating-point numbers, and translates as needed. However, only IEEE 754 numbers with the same endianness as integer data in the same file have actually been observed in system files, and it is likely that other formats are obsolete or were never used.
System files use a few floating point values for special purposes:
The system-missing value is represented by the largest possible
negative number in the floating point format (-DBL_MAX
).
HIGHEST is used as the high end of a missing value range with an
unbounded maximum. It is represented by the largest possible positive
number (DBL_MAX
).
LOWEST is used as the low end of a missing value range with an
unbounded minimum. It was originally represented by the
second-largest negative number (in IEEE 754 format,
0xffeffffffffffffe
). System files written by SPSS 21 and later
instead use the largest negative number (-DBL_MAX
), the same
value as SYSMIS. This does not lead to ambiguity because LOWEST
appears in system files only in missing value ranges, which never
contain SYSMIS.
System files may use most character encodings based on an 8-bit unit.
UTF-16 and UTF-32, based on wider units, appear to be unacceptable.
rec_type
in the file header record is sufficient to distinguish
between ASCII and EBCDIC based encodings. The best way to determine
the specific encoding in use is to consult the character encoding
record (see Character Encoding Record), if present, and failing
that the character_code
in the machine integer info record
(see Machine Integer Info Record). The same encoding should be
used for the dictionary and the data in the file, although it is
possible to artificially synthesize files that use different encodings
(see Character Encoding Record).