I paste here an interesting discussion I read on Corpora List some days ago. I think the issue of corpus size is relevant to many of us. Here is the discussion in its integrity:
9 Aug (5 days ago)
In large corpora, it is very often impossible to analyse every single occurrence of a given phenomenon: Therefore, one often needs to reduce the amount of data via (random) sampling in order to have a more qualitative look at large quantities of data.
I’ve seen several times that samples of 200 occurrences/examples/tokens are chosen, each of which is then individually examined. An early example of this approach is Jennifer Coates’ study about „The Semantics of the Modal Auxiliaries“ (1983).
Does anybody know if this kind of sampling has inherent advantages (besides the fact that it reduces the quantity of work)? Are there statistic reasons to take into account 200 tokens? (Why not 100 or 500?)
I’d be grateful for documentation about this (or any other kind of practical) sampling. Thank you in advance!
University of Geneva
%%%%% Answer 1
This is an excellent question, and the answer is seldom “200.” It depends on several factors.
The first is that corpora are almost always a sample of some kind. When you generalize from your corpus, what are you generalizing to?
The second is that corpora almost always have some internal structure of their own. If you simply grab 200 occurrences at random, are you oversampling one or more texts, speakers/authors, subgenres? The answer to that depends on your hypothesis, and the theory that it is embedded in.
There is a large literature on sampling in the social sciences. All that stuff about p-values and chi-squares is basically aimed at answering your question. It goes back to Laplace’s question, “Can we get a good estimate of the population of the French empire without counting everyone?” and Student’s question, “How many batches of Guinness do I have to examine to properly evaluate this strain of barley?”
I’ve written more about sampling on my blog:
but in general I encourage you to consult a statistician with experience in social science sampling. From a glance at your university’s website, it looks like you have some good people.
%%%%% Answer 2
#1 angus raises the legitimate problem of the skewing of randomly selected
examples through disproportionate occurrences of a language feature in one
source text or a subset of source texts… however you can monitor such skewing
in any corpus software that allows the display of source text ids in the concordance
display screen… and therefore adjust for it in your analysis…
#2 there is also the consideration of how accurate you want your analysis/statements to be…
working with much smaller corpora, i think Sinclair suggested that our ultimate aim should be
to account for every single example… he and others at that time (eg Stubbs, Tognini-Bonelli?)
offered an alternative technique: take one concordance screenful, and note the numbers of
whichever features you are interested in; take a second screenful and do the same; in general,
the proportion of those features will stabilise after a few screenfuls… but more details will appear,
indicating sub-features in relation to those features… at some point, when you are satisfied with the
depth of analysis of the original features (and here the corpus frequency of the nodeword will be relevant)
you can estimate the percentage of the total occurrences that you have analysed, and decide whether
the rate of change has stabilised sufficiently for your purposes…?
combining these two strategies (statistical/probability calculations as per angus… plus manual inspection/annotation)
may provide the triangulation you need?
Ramesh and Angus have already made excellent points about estimating the stability of the distribution of phenomena within your sample, so I won’t say anything about that. But I wanted to add one thing about errors in your search and estimating the error rate.
Especially in a scenario where you run multiple queries that are each meant to give you a count of some variant, you may want to be able to use the entire result set (many more hits than you can read). If you are working on some alternation or a phenomenon that has multiple alternative, known realizations, you may be able to say something using the entire dataset if you have a good idea that each query variant is highly accurate.
I think in these kinds of cases, having a limited, manually analyzed random subset just to estimate the error rate for each query can be very valuable. I also agree that there’s no special property for a number like 200; I’ve used this strategy with 1000 hits per construction and published results using entire datasets when the error rate was below 1% (so fewer than 10 spurious hits per 1000 randomly dispersed items, and then report the fully automatic data for each variant). In any case, the most important thing is to state what you are doing openly – if it doesn’t make sense, peer reviewers will let you know 🙂
Dr. Amir Zeldes
Asst. Prof. for Computational Linguistics
Department of Linguistics
1437 37th St. NW
Washington, DC 20057
%—- the end