DATA LIST

Used to read text or binary data, DATA LIST is the most fundamental data-reading command. Even the more sophisticated input methods use DATA LIST commands as a building block. Understanding DATA LIST is important to understanding how to use PSPP to read your data files.

There are two major variants of DATA LIST, which are fixed format and free format. In addition, free format has a minor variant, list format, which is discussed in terms of its differences from vanilla free format.

Each form of DATA LIST is described in detail below.

See GET DATA for a command that offers a few enhancements over DATA LIST and that may be substituted for DATA LIST in many situations.

DATA LIST FIXED

DATA LIST [FIXED]
        {TABLE,NOTABLE}
        [FILE='FILE_NAME' [ENCODING='ENCODING']]
        [RECORDS=RECORD_COUNT]
        [END=END_VAR]
        [SKIP=RECORD_COUNT]
        /[line_no] VAR_SPEC...

where each VAR_SPEC takes one of the forms
        VAR_LIST START-END [TYPE_SPEC]
        VAR_LIST (FORTRAN_SPEC)

DATA LIST FIXED is used to read data files that have values at fixed positions on each line of single-line or multiline records. The keyword FIXED is optional.

The FILE subcommand must be used if input is to be taken from an external file. It may be used to specify a file name as a string or a file handle. If the FILE subcommand is not used, then input is assumed to be specified within the command file using BEGIN DATA...END DATA. The ENCODING subcommand may only be used if the FILE subcommand is also used. It specifies the character encoding of the file. See INSERT, for information on supported encodings.

The optional RECORDS subcommand, which takes a single integer as an argument, is used to specify the number of lines per record. If RECORDS is not specified, then the number of lines per record is calculated from the list of variable specifications later in DATA LIST.

The END subcommand is only useful in conjunction with INPUT PROGRAM.

The optional SKIP subcommand specifies a number of records to skip at the beginning of an input file. It can be used to skip over a row that contains variable names, for example.

DATA LIST can optionally output a table describing how the data file is read. The TABLE subcommand enables this output, and NOTABLE disables it. The default is to output the table.

The list of variables to be read from the data list must come last. Each line in the data record is introduced by a slash (/). Optionally, a line number may follow the slash. Following, any number of variable specifications may be present.

Each variable specification consists of a list of variable names followed by a description of their location on the input line. Sets of variables may be with TO, e.g. VAR1 TO VAR5. There are two ways to specify the location of the variable on the line: columnar style and FORTRAN style.

In columnar style, the starting column and ending column for the field are specified after the variable name, separated by a dash (-). For instance, the third through fifth columns on a line would be specified 3-5. By default, variables are considered to be in F format. (Use SET FORMAT to change the default.)

In columnar style, to use a variable format other than the default, specify the format type in parentheses after the column numbers. For instance, for alphanumeric A format, use (A).

In addition, implied decimal places can be specified in parentheses after the column numbers. As an example, suppose that a data file has a field in which the characters 1234 should be interpreted as having the value 12.34. Then this field has two implied decimal places, and the corresponding specification would be (2). If a field that has implied decimal places contains a decimal point, then the implied decimal places are not applied.

Changing the variable format and adding implied decimal places can be done together; for instance, (N,5).

When using columnar style, the input and output width of each variable is computed from the field width. The field width must be evenly divisible into the number of variables specified.

FORTRAN style is an altogether different approach to specifying field locations. With this approach, a list of variable input format specifications, separated by commas, are placed after the variable names inside parentheses. Each format specifier advances as many characters into the input line as it uses.

Implied decimal places also exist in FORTRAN style. A format specification with D decimal places also has D implied decimal places.

In addition to the standard formats, FORTRAN style defines some extensions:

  • X
    Advance the current column on this line by one character position.

  • T<X>
    Set the current column on this line to column <X>, with column numbers considered to begin with 1 at the left margin.

  • NEWREC<X>
    Skip forward <X> lines in the current record, resetting the active column to the left margin.

  • Repeat count
    Any format specifier may be preceded by a number. This causes the action of that format specifier to be repeated the specified number of times.

  • (SPEC1, ..., SPECN)
    Use () to group specifiers together. This is most useful when preceded by a repeat count. Groups may be nested.

    FORTRAN and columnar styles may be freely intermixed. Columnar style leaves the active column immediately after the ending column specified. Record motion using NEWREC in FORTRAN style also applies to later FORTRAN and columnar specifiers.

Example 1

DATA LIST TABLE /NAME 1-10 (A) INFO1 TO INFO3 12-17 (1).

BEGIN DATA.
John Smith 102311
Bob Arnold 122015
Bill Yates  918 6
END DATA.

Defines the following variables:

  • NAME, a 10-character-wide string variable, in columns 1 through 10.

  • INFO1, a numeric variable, in columns 12 through 13.

  • INFO2, a numeric variable, in columns 14 through 15.

  • INFO3, a numeric variable, in columns 16 through 17.

The BEGIN DATA/END DATA commands cause three cases to be defined:

CaseNAMEINFO1INFO2INFO3
1John Smith102311
2Bob Arnold122015
3Bill Yates9186

The TABLE keyword causes PSPP to print out a table describing the four variables defined.

Example 2

DATA LIST FILE="survey.dat"
        /ID 1-5 NAME 7-36 (A) SURNAME 38-67 (A) MINITIAL 69 (A)
        /Q01 TO Q50 7-56
        /.

Defines the following variables:

  • ID, a numeric variable, in columns 1-5 of the first record.

  • NAME, a 30-character string variable, in columns 7-36 of the first record.

  • SURNAME, a 30-character string variable, in columns 38-67 of the first record.

  • MINITIAL, a 1-character string variable, in column 69 of the first record.

  • Fifty variables Q01, Q02, Q03, ..., Q49, Q50, all numeric, Q01 in column 7, Q02 in column 8, ..., Q49 in column 55, Q50 in column 56, all in the second record.

Cases are separated by a blank record.

Data is read from file survey.dat in the current directory.

DATA LIST FREE

DATA LIST FREE
        [({TAB,'C'}, ...)]
        [{NOTABLE,TABLE}]
        [FILE='FILE_NAME' [ENCODING='ENCODING']]
        [SKIP=N_RECORDS]
        /VAR_SPEC...

where each VAR_SPEC takes one of the forms
        VAR_LIST [(TYPE_SPEC)]
        VAR_LIST *

In free format, the input data is, by default, structured as a series of fields separated by spaces, tabs, or line breaks. If the current DECIMAL separator is DOT, then commas are also treated as field separators. Each field's content may be unquoted, or it may be quoted with a pairs of apostrophes (') or double quotes ("). Unquoted white space separates fields but is not part of any field. Any mix of spaces, tabs, and line breaks is equivalent to a single space for the purpose of separating fields, but consecutive commas will skip a field.

Alternatively, delimiters can be specified explicitly, as a parenthesized, comma-separated list of single-character strings immediately following FREE. The word TAB may also be used to specify a tab character as a delimiter. When delimiters are specified explicitly, only the given characters, plus line breaks, separate fields. Furthermore, leading spaces at the beginnings of fields are not trimmed, consecutive delimiters define empty fields, and no form of quoting is allowed.

The NOTABLE and TABLE subcommands are as in DATA LIST FIXED above. NOTABLE is the default.

The FILE, SKIP, and ENCODING subcommands are as in DATA LIST FIXED above.

The variables to be parsed are given as a single list of variable names. This list must be introduced by a single slash (/). The set of variable names may contain format specifications in parentheses. Format specifications apply to all variables back to the previous parenthesized format specification.

An asterisk on its own has the same effect as (F8.0), assigning the variables preceding it input/output format F8.0.

Specified field widths are ignored on input, although all normal limits on field width apply, but they are honored on output.

DATA LIST LIST

DATA LIST LIST
        [({TAB,'C'}, ...)]
        [{NOTABLE,TABLE}]
        [FILE='FILE_NAME' [ENCODING='ENCODING']]
        [SKIP=RECORD_COUNT]
        /VAR_SPEC...

where each VAR_SPEC takes one of the forms
        VAR_LIST [(TYPE_SPEC)]
        VAR_LIST *

With one exception, DATA LIST LIST is syntactically and semantically equivalent to DATA LIST FREE. The exception is that each input line is expected to correspond to exactly one input record. If more or fewer fields are found on an input line than expected, an appropriate diagnostic is issued.