Data Screening and Transformation

Once data has been entered, it is often desirable, or even necessary, to transform it in some way before performing analysis upon it. At the very least, it's good practice to check for errors.

Identifying incorrect data

Data from real sources is rarely error free. PSPP has a number of procedures which can be used to help identify data which might be incorrect.

The DESCRIPTIVES command is used to generate simple linear statistics for a dataset. It is also useful for identifying potential problems in the data. The example file physiology.sav contains a number of physiological measurements of a sample of healthy adults selected at random. However, the data entry clerk made a number of mistakes when entering the data. The following example illustrates the use of DESCRIPTIVES to screen this data and identify the erroneous values:

PSPP> get file='/usr/local/share/pspp/examples/physiology.sav'.
PSPP> descriptives sex, weight, height.

For this example, PSPP produces the following output:

                  Descriptive Statistics
┌─────────────────────┬──┬───────┬───────┬───────┬───────┐
│                     │ N│  Mean │Std Dev│Minimum│Maximum│
├─────────────────────┼──┼───────┼───────┼───────┼───────┤
│Sex of subject       │40│    .45│    .50│Male   │Female │
│Weight in kilograms  │40│  72.12│  26.70│  ─55.6│   92.1│
│Height in millimeters│40│1677.12│ 262.87│    179│   1903│
│Valid N (listwise)   │40│       │       │       │       │
│Missing N (listwise) │ 0│       │       │       │       │
└─────────────────────┴──┴───────┴───────┴───────┴───────┘

The most interesting column in the output is the minimum value. The weight variable has a minimum value of less than zero, which is clearly erroneous. Similarly, the height variable's minimum value seems to be very low. In fact, it is more than 5 standard deviations from the mean, and is a seemingly bizarre height for an adult person.

We can look deeper into these discrepancies by issuing an additional EXAMINE command:

PSPP> examine height, weight /statistics=extreme(3).

This command produces the following additional output (in part):

                   Extreme Values
┌───────────────────────────────┬───────────┬─────┐
│                               │Case Number│Value│
├───────────────────────────────┼───────────┼─────┤
│Height in millimeters Highest 1│         14│ 1903│
│                              2│         15│ 1884│
│                              3│         12│ 1802│
│                     ──────────┼───────────┼─────┤
│                      Lowest  1│         30│  179│
│                              2│         31│ 1598│
│                              3│         28│ 1601│
├───────────────────────────────┼───────────┼─────┤
│Weight in kilograms   Highest 1│         13│ 92.1│
│                              2│          5│ 92.1│
│                              3│         17│ 91.7│
│                     ──────────┼───────────┼─────┤
│                      Lowest  1│         38│─55.6│
│                              2│         39│ 54.5│
│                              3│         33│ 55.4│
└───────────────────────────────┴───────────┴─────┘

From this new output, you can see that the lowest value of height is 179 (which we suspect to be erroneous), but the second lowest is 1598 which we know from DESCRIPTIVES is within 1 standard deviation from the mean. Similarly, the lowest value of weight is negative, but its second lowest value is plausible. This suggests that the two extreme values are outliers and probably represent data entry errors.

The output also identifies the case numbers for each extreme value, so we can see that cases 30 and 38 are the ones with the erroneous values.

Dealing with suspicious data

If possible, suspect data should be checked and re-measured. However, this may not always be feasible, in which case the researcher may decide to disregard these values. PSPP has a feature for missing values, whereby data can assume the special value 'SYSMIS', and will be disregarded in future analysis. You can set the two suspect values to the SYSMIS value using the RECODE command.

PSPP> recode height (179 = SYSMIS).
PSPP> recode weight (LOWEST THRU 0 = SYSMIS).

The first command says that for any observation which has a height value of 179, that value should be changed to the SYSMIS value. The second command says that any weight values of zero or less should be changed to SYSMIS. From now on, they will be ignored in analysis.

If you now re-run the DESCRIPTIVES or EXAMINE commands from the previous section, you will see a data summary with more plausible parameters. You will also notice that the data summaries indicate the two missing values.

Inverting negatively coded variables

Data entry errors are not the only reason for wanting to recode data. The sample file hotel.sav comprises data gathered from a customer satisfaction survey of clients at a particular hotel. The following commands load the file and display its variables and associated data:

PSPP> get file='/usr/local/share/pspp/examples/hotel.sav'.
PSPP> display dictionary.

It yields the following output:

                                   Variables
┌────┬────────┬─────────────┬────────────┬─────┬─────┬─────────┬──────┬───────┐
│    │        │             │ Measurement│     │     │         │ Print│ Write │
│Name│Position│    Label    │    Level   │ Role│Width│Alignment│Format│ Format│
├────┼────────┼─────────────┼────────────┼─────┼─────┼─────────┼──────┼───────┤
│v1  │       1│I am         │Ordinal     │Input│    8│Right    │F8.0  │F8.0   │
│    │        │satisfied    │            │     │     │         │      │       │
│    │        │with the     │            │     │     │         │      │       │
│    │        │level of     │            │     │     │         │      │       │
│    │        │service      │            │     │     │         │      │       │
│v2  │       2│The value for│Ordinal     │Input│    8│Right    │F8.0  │F8.0   │
│    │        │money was    │            │     │     │         │      │       │
│    │        │good         │            │     │     │         │      │       │
│v3  │       3│The staff    │Ordinal     │Input│    8│Right    │F8.0  │F8.0   │
│    │        │were slow in │            │     │     │         │      │       │
│    │        │responding   │            │     │     │         │      │       │
│v4  │       4│My concerns  │Ordinal     │Input│    8│Right    │F8.0  │F8.0   │
│    │        │were dealt   │            │     │     │         │      │       │
│    │        │with in an   │            │     │     │         │      │       │
│    │        │efficient    │            │     │     │         │      │       │
│    │        │manner       │            │     │     │         │      │       │
│v5  │       5│There was too│Ordinal     │Input│    8│Right    │F8.0  │F8.0   │
│    │        │much noise in│            │     │     │         │      │       │
│    │        │the rooms    │            │     │     │         │      │       │
└────┴────────┴─────────────┴────────────┴─────┴─────┴─────────┴──────┴───────┘

                              Value Labels
┌────────────────────────────────────────────────────┬─────────────────┐
│Variable Value                                      │      Label      │
├────────────────────────────────────────────────────┼─────────────────┤
│I am satisfied with the level of service           1│Strongly Disagree│
│                                                   2│Disagree         │
│                                                   3│No Opinion       │
│                                                   4│Agree            │
│                                                   5│Strongly Agree   │
├────────────────────────────────────────────────────┼─────────────────┤
│The value for money was good                       1│Strongly Disagree│
│                                                   2│Disagree         │
│                                                   3│No Opinion       │
│                                                   4│Agree            │
│                                                   5│Strongly Agree   │
├────────────────────────────────────────────────────┼─────────────────┤
│The staff were slow in responding                  1│Strongly Disagree│
│                                                   2│Disagree         │
│                                                   3│No Opinion       │
│                                                   4│Agree            │
│                                                   5│Strongly Agree   │
├────────────────────────────────────────────────────┼─────────────────┤
│My concerns were dealt with in an efficient manner 1│Strongly Disagree│
│                                                   2│Disagree         │
│                                                   3│No Opinion       │
│                                                   4│Agree            │
│                                                   5│Strongly Agree   │
├────────────────────────────────────────────────────┼─────────────────┤
│There was too much noise in the rooms              1│Strongly Disagree│
│                                                   2│Disagree         │
│                                                   3│No Opinion       │
│                                                   4│Agree            │
│                                                   5│Strongly Agree   │
└────────────────────────────────────────────────────┴─────────────────┘

The output shows that all of the variables v1 through v5 are measured on a 5 point Likert scale, with 1 meaning "Strongly disagree" and 5 meaning "Strongly agree". However, some of the questions are positively worded (v1, v2, v4) and others are negatively worded (v3, v5). To perform meaningful analysis, we need to recode the variables so that they all measure in the same direction. We could use the RECODE command, with syntax such as:

recode v3 (1 = 5) (2 = 4) (4 = 2) (5 = 1).

However an easier and more elegant way uses the COMPUTE command. Since the variables are Likert variables in the range (1 ... 5), subtracting their value from 6 has the effect of inverting them:

compute VAR = 6 - VAR.

The following section uses this technique to recode the variables v3 and v5. After applying COMPUTE for both variables, all subsequent commands will use the inverted values.

Testing data consistency

A sensible check to perform on survey data is the calculation of reliability. This gives the statistician some confidence that the questionnaires have been completed thoughtfully. If you examine the labels of variables v1, v3 and v4, you will notice that they ask very similar questions. One would therefore expect the values of these variables (after recoding) to closely follow one another, and we can test that with the RELIABILITY command. The following example shows a PSPP session where the user recodes negatively scaled variables and then requests reliability statistics for v1, v3, and v4.

PSPP> get file='/usr/local/share/pspp/examples/hotel.sav'.
PSPP> compute v3 = 6 - v3.
PSPP> compute v5 = 6 - v5.
PSPP> reliability v1, v3, v4.

This yields the following output:

Scale: ANY

Case Processing Summary
┌────────┬──┬───────┐
│Cases   │ N│Percent│
├────────┼──┼───────┤
│Valid   │17│ 100.0%│
│Excluded│ 0│    .0%│
│Total   │17│ 100.0%│
└────────┴──┴───────┘

    Reliability Statistics
┌────────────────┬──────────┐
│Cronbach's Alpha│N of Items│
├────────────────┼──────────┤
│             .81│         3│
└────────────────┴──────────┘

As a rule of thumb, many statisticians consider a value of Cronbach's Alpha of 0.7 or higher to indicate reliable data.

Here, the value is 0.81, which suggests a high degree of reliability among variables v1, v3 and v4, so the data and the recoding that we performed are vindicated.

Testing for normality

Many statistical tests rely upon certain properties of the data. One common property, upon which many linear tests depend, is that of normality -- the data must have been drawn from a normal distribution. It is necessary then to ensure normality before deciding upon the test procedure to use. One way to do this uses the EXAMINE command.

In the following example, a researcher was examining the failure rates of equipment produced by an engineering company. The file repairs.sav contains the mean time between failures (mtbf) of some items of equipment subject to the study. Before performing linear analysis on the data, the researcher wanted to ascertain that the data is normally distributed.

PSPP> get file='/usr/local/share/pspp/examples/repairs.sav'.
PSPP> examine mtbf /statistics=descriptives.

This produces the following output:

                                  Descriptives
┌──────────────────────────────────────────────────────────┬─────────┬────────┐
│                                                          │         │  Std.  │
│                                                          │Statistic│  Error │
├──────────────────────────────────────────────────────────┼─────────┼────────┤
│Mean time between        Mean                             │     8.78│    1.10│
│failures (months)       ──────────────────────────────────┼─────────┼────────┤
│                         95% Confidence Interval Lower    │     6.53│        │
│                         for Mean                Bound    │         │        │
│                                                 Upper    │    11.04│        │
│                                                 Bound    │         │        │
│                        ──────────────────────────────────┼─────────┼────────┤
│                         5% Trimmed Mean                  │     8.20│        │
│                        ──────────────────────────────────┼─────────┼────────┤
│                         Median                           │     8.29│        │
│                        ──────────────────────────────────┼─────────┼────────┤
│                         Variance                         │    36.34│        │
│                        ──────────────────────────────────┼─────────┼────────┤
│                         Std. Deviation                   │     6.03│        │
│                        ──────────────────────────────────┼─────────┼────────┤
│                         Minimum                          │     1.63│        │
│                        ──────────────────────────────────┼─────────┼────────┤
│                         Maximum                          │    26.47│        │
│                        ──────────────────────────────────┼─────────┼────────┤
│                         Range                            │    24.84│        │
│                        ──────────────────────────────────┼─────────┼────────┤
│                         Interquartile Range              │     6.03│        │
│                        ──────────────────────────────────┼─────────┼────────┤
│                         Skewness                         │     1.65│     .43│
│                        ──────────────────────────────────┼─────────┼────────┤
│                         Kurtosis                         │     3.41│     .83│
└──────────────────────────────────────────────────────────┴─────────┴────────┘

A normal distribution has a skewness and kurtosis of zero. The skewness of mtbf in the output above makes it clear that the mtbf figures have a lot of positive skew and are therefore not drawn from a normally distributed variable. Positive skew can often be compensated for by applying a logarithmic transformation, as in the following continuation of the example:

PSPP> compute mtbf_ln = ln (mtbf).
PSPP> examine mtbf_ln /statistics=descriptives.

which produces the following additional output:

                                Descriptives
┌────────────────────────────────────────────────────┬─────────┬──────────┐
│                                                    │Statistic│Std. Error│
├────────────────────────────────────────────────────┼─────────┼──────────┤
│mtbf_ln Mean                                        │     1.95│       .13│
│       ─────────────────────────────────────────────┼─────────┼──────────┤
│        95% Confidence Interval for Mean Lower Bound│     1.69│          │
│                                         Upper Bound│     2.22│          │
│       ─────────────────────────────────────────────┼─────────┼──────────┤
│        5% Trimmed Mean                             │     1.96│          │
│       ─────────────────────────────────────────────┼─────────┼──────────┤
│        Median                                      │     2.11│          │
│       ─────────────────────────────────────────────┼─────────┼──────────┤
│        Variance                                    │      .49│          │
│       ─────────────────────────────────────────────┼─────────┼──────────┤
│        Std. Deviation                              │      .70│          │
│       ─────────────────────────────────────────────┼─────────┼──────────┤
│        Minimum                                     │      .49│          │
│       ─────────────────────────────────────────────┼─────────┼──────────┤
│        Maximum                                     │     3.28│          │
│       ─────────────────────────────────────────────┼─────────┼──────────┤
│        Range                                       │     2.79│          │
│       ─────────────────────────────────────────────┼─────────┼──────────┤
│        Interquartile Range                         │      .88│          │
│       ─────────────────────────────────────────────┼─────────┼──────────┤
│        Skewness                                    │     ─.37│       .43│
│       ─────────────────────────────────────────────┼─────────┼──────────┤
│        Kurtosis                                    │      .01│       .83│
└────────────────────────────────────────────────────┴─────────┴──────────┘

The COMPUTE command in the first line above performs the logarithmic transformation: compute mtbf_ln = ln (mtbf). Rather than redefining the existing variable, this use of COMPUTE defines a new variable mtbf_ln which is the natural logarithm of mtbf. The final command in this example calls EXAMINE on this new variable. The results show that both the skewness and kurtosis for mtbf_ln are very close to zero. This provides some confidence that the mtbf_ln variable is normally distributed and thus safe for linear analysis. In the event that no suitable transformation can be found, then it would be worth considering an appropriate non-parametric test instead of a linear one. See NPAR TESTS, for information about non-parametric tests.