Keywords, Multiword phrases and Key Features
This interface allows you to extract what is special
about one subcorpus as compared to another reference corpus.
- Keywords: the words which occur more
frequently in the subcorpus of interest in comparison
with the reference corpus (as made famous by WordSmith
and other concordance software).
- Key Phrases: the most frequent phrases of a given length
in the corpus of interest.
- Key Features: the features which occur more frequently
in a subcoprus when compared with some reference corpus. For instance,
we can compare a group of low proficiency learners to the rest of the
learners to see what they do differently (at that layer).
Using the Interface
Select the "Keywords" pane from the selector at the top of the
CorpusTool interface, then:
- Select and option from the "Type of Study" menu.
- The "Containing Unit" field: select the unit which you want to
explore keywords/phrases/features within. E.g., if I want to see what is
special in clauses in Editorials, I would here select "clause in_segment editorial'.
- The "Compare With" field: here you select what reference corpus you want
to use to compare to the containing unit. Here you have three options:
- "Everything else in project": the software will count the words/features in all the project's
files, and subtract from this the counts in the focus corpus (the 'containing unit' subcorpus).
- "Specific subset of Corpus": you will then specify a specific subset of the corpus which will be
compared with the focus corpus.
- "Other UAMCT Project": the software will load a separate project and count words/features within that corpus,
and compare to the focus corpus.
- (If 'Phrase'): "Phrase Length": the number of words in the multiple word phrases you wish to see.
- (If 'Key Features'): Layer of Interest: this defines the network you wish to see the features of.
We might for instance specify the "containing unit" as Editorials, and the "Layer of Interest" as "Grammar".
Calculation of 'Keyness' of keywords
In UAM CT, the keyness of a term is calculated
as the relative frequency of the term in the
subcorpus of interest divided by the relative
frequency of the term in the reference corpus.
Relative frequency is the count of the term in the subcorpus divided by
the number of terms in that subcorpus.
Basically, a term with a keyness value of 2.0 occurs
twice as often in the corpus of interest as it does
in the reference corpus.
If the keyness value is over 100, a value of 100
is used.
A term must appear in more than one text to
be included as a keyword, unless the project has
only 1 or 2 files, in which case this condition
is ignored (this basically stops consistent misspellings or
person names rising to the top of the list).
Where less than 20 instances of the term occur
in the combination of the two subcorpora, the keyness
value is decreased in relation to how much less than
20 the count is: if there are 10 hits, the keyness is halved,
if 15, reduced by 25% etc.
A term will be included in the keyword list
only if it occurs 8 or more times in the combination of
the subcorpus of interest and the reference corpus.
Calculation of 'Keyness' of features
The calculation of keyness of features is similar
of keywords, with minor exceptions. Basically the
annotations of segments in the focus corpus are treated
as words in a text.
The main exception is that while a word is required
to occur in more than one texts, this requirement is
not required for features (but the feature still needs
to occur 8 or more times.