Basic statistical treatments of sociolinguistic data

Last updated 30 April 2004

By Richard Hudson 

Let's assume that you have collected data on 4 speakers' use of two sociolinguistic variables:

  • (h), the use or non-use of /h/ in words where /h/ is possible; there are two variants:
    • (h):/h/ (/h/ present)
    • (h): 0 (/h/ absent)
  • (t), the pronunciation of the consonant which in RP is /t/: there are three variants:
    • (t):[t]
    • (t):[?] (glottal stop)
    • (t):0 (no consonant at all)

For each variable and each speaker you have at least 50 tokens (instances).

1. Frequency tables

You should first present your raw data as one frequency table (also called a 'contingency table') for each variable. If you are using Word, open a table (click the table icon on the tool bar) with 3 (or 4) x 5 cells (allow extra columns and lines for labels and totals), like this:

Variable (h)

speaker

variant

A

B

C

D

/h/

35

23

37

6

0

28

37

15

45

Total

63

60

50

51

 Table 1. Frequencies for (h)

 

Variable (t)

speaker

variant

A

B

C

D

[t]

13

40

2

53

[?]

28

15

49

0

0

11

8

3

0

Total

52

63

54

53

Table 2. Frequencies for (t) 

 2. Percentage tables

Next convert these frequencies into percentages so that you can compare scores of different speakers. The percentages you want are column percentages, showing each variant, for each speaker, as a percentage of all the tokens of the same variable for that speaker. For example, speaker A produced a total of 63 tokens of the (h) variable, of which 35 were /h/; so 35/63 x 100 = 56% of A's tokens of (h) had the /h/ variant.

In reporting percentages, you do not need to report the same number of cells as for frequencies because the column percentages always add up to 100; so once you know that A's (h):/h/ = 56%, you know that A's other variant on the same variable is 100 - 56 = 44%. However, you do need to report the size of the total so that we can reconstruct the percentages, and (equally important) we know how seriously to take the figures. Percentages based on 500 tokens deserve far more respect than the same percentages based on only 50 tokens. Consequently, the percentage tables will give:

  • All but one of the variant percentages for each speaker (don't bother to report decimal fractions of a percent point, e.g. 55.6%; the accuracy is spurious in this kind of work).
  • The total number of tokens for each variable for each speaker. This is called N (the total Number).

You can easily calculate the percentages with a pocket calculator, but if you want to transfer at this stage to a spreadsheet (see below), do. The spreadsheet will calculate the percentages for you.

 

Variable (h)

speaker

variant

A

B

C

D

/h/

56%

38

74

12

N

63

60

50

51

 Table 3. Percentages for (h).

 

Variable (t)

speaker

variant

A

B

C

D

[t]

25%

63

4

100

[?]

54

24

91

0

N

52

63

54

53

Table 4. Percentages for (t)

You can now use these tables to compare the speakers. Don't simply repeat the figures in prose ("Table 4 shows that A uses 25% [t] but 54% [?], whereas B uses ....") - tables are by far the most efficient way of presenting the figures. What tables cannot do is to interpret the figures - to answer the question:

So what?

3. Interpreting the figures

This is where you need prose comments such as:

  • On both variables, speakers C and D are at opposite extremes. This is clearest in Table 4, where D always uses [t] and C almost always uses [?], but it is also true in Table 3, where C is the highest user of /h/ and D is the lowest.
  • For all speakers, (t):0 is either rare or non-existent. The main choice for (t) is between [t] (D's only form) and [?] (C's strong favourite), and B prefers [t] while B prefers [?].

At this stage all you are doing is finding trends: finding that speakers are different (or similar). Even at this stage the trends may be clear enough for you to be certain they exist, and after stage 4 you will be able to be even more confident about presenting them as facts (though it would be wise to present them modestly, as facts about these speakers on this particular occasion, without trying to generalise much further).

It is important to keep this stage separate from a much more ambitious stage where you offer explanations. In this kind of research project your explanations can never be more than intelligent guesses. For example, why are speakers C and D so extreme? Maybe it's because one is male and the other female, or because one is a graduate and the other left school at 16, or because one comes from London and the other from Norfolk, or ..... Don't worry if you can't think of any plausible explanation - that's often true in research, and that's what drives research forward. Given a large research grant and research team you might be able to produce evidence which clearly favours one explanation over all the others; but until then it is enough to flag your results as puzzling and suggesting further research.

4. Checking and reporting significance

How 'significant' are your figures? This is where you need to use some elementary statistical techniques, as 'significance' is a technical term from statistics. A pattern in your results is significant if it is unlikely to have been produced merely by chance. If you toss a coin ten times and it lands heads up 7 times, does this show that the coin is biased? If your result is statistically significant, it does; if not, it is just chance. Statistical significance is a clear and objective measure which you can calculate with the help of simple tests. It is defined as the probability (p) of getting the pattern of results observed merely by chance, and the main test for the kinds of observations that you have is called the Chi-square test (named after the Greek letter which looks like X whose name in Greek is "kai", which is pronounced /kai/ and sometimes written "chi").

Here's how to calculate the significance of your data. The data you are concerned with are Tables 1 and 2, because the test works on raw frequencies, not percentages.

  • Click here, and a separate screen should open so you can easily move between it and this screen. (If clicking doesn't work, the address is: http://www.physics.csbsju.edu/stats/Index.html).
  • Find "X2 Contingency Tables" (NOT "X2 Test: Observed and Expected Counts").
  • First click "info" for an explanation of what you are about to do.
  • Then click "Calculate".
  • Enter the number of rows and columns for the raw data in your table (i.e. excluding headings and totals) and click "Submit".
  • Enter your data in the blank cells and click "Calculate".
  • It will show a page of information, with your results at the very end.

For my Table 1 I get this:

chi-square = 41.1

degrees of freedom = 3

probability = 0.000

This means that the probability of getting the figures in Table 1 merely by chance is (virtually) zero, so the pattern of results in Table 1 is highly significant - i.e. there must be an explanation (though it's not your job to find one). My Table 2 produces the same result.

All this hard work produces a vital (though tiny) bit of your final report: when you introduce the relevant table you write "The differences in this table are ...", where the dots stand for one of the following, depending on the figure for p:

  • "highly significant" if p is less than 0.005;
  • "significant" if p is between 0.005 and 0.05;
  • "not significant" if p is greater than 0.05.

(The figure for p ranges from 1.0 (complete certainty) to 0.0 (virtual impossibility); 0.05 means that merely by chance you can expect such figures 5 times in 100 experiments, and 0.005 means 5 times in 1000. The standard cut-off for probability in social sciences is 0.05.) After this significance statement, give the figures from the web site like this:

(X-square = ..., df = .., p = ...)

Good calculating!! If it hurts, it's doing you good - you're on the way to being numerate. If not, congratulations on your numeracy skills.