Keywords, Multiword phrases and Key Features

This interface allows you to extract what is special about one subcorpus as compared to another reference corpus.

  1. Keywords: the words which occur more frequently in the subcorpus of interest in comparison with the reference corpus (as made famous by WordSmith and other concordance software).
  2. Key Phrases: the most frequent phrases of a given length in the corpus of interest.
  3. Key Features: the features which occur more frequently in a subcoprus when compared with some reference corpus. For instance, we can compare a group of low proficiency learners to the rest of the learners to see what they do differently (at that layer).

Using the Interface

Select the "Keywords" pane from the selector at the top of the CorpusTool interface, then:

  1. Select and option from the "Type of Study" menu.
  2. The "Containing Unit" field: select the unit which you want to explore keywords/phrases/features within. E.g., if I want to see what is special in clauses in Editorials, I would here select "clause in_segment editorial'.
  3. The "Compare With" field: here you select what reference corpus you want to use to compare to the containing unit. Here you have three options:
  4. (If 'Phrase'): "Phrase Length": the number of words in the multiple word phrases you wish to see.
  5. (If 'Key Features'): Layer of Interest: this defines the network you wish to see the features of. We might for instance specify the "containing unit" as Editorials, and the "Layer of Interest" as "Grammar".

Calculation of 'Keyness' of keywords

In UAM CT, the keyness of a term is calculated as the relative frequency of the term in the subcorpus of interest divided by the relative frequency of the term in the reference corpus. Relative frequency is the count of the term in the subcorpus divided by the number of terms in that subcorpus.

Basically, a term with a keyness value of 2.0 occurs twice as often in the corpus of interest as it does in the reference corpus.

If the keyness value is over 100, a value of 100 is used.

A term must appear in more than one text to be included as a keyword, unless the project has only 1 or 2 files, in which case this condition is ignored (this basically stops consistent misspellings or person names rising to the top of the list).

Where less than 20 instances of the term occur in the combination of the two subcorpora, the keyness value is decreased in relation to how much less than 20 the count is: if there are 10 hits, the keyness is halved, if 15, reduced by 25% etc.

A term will be included in the keyword list only if it occurs 8 or more times in the combination of the subcorpus of interest and the reference corpus.

Calculation of 'Keyness' of features

The calculation of keyness of features is similar of keywords, with minor exceptions. Basically the annotations of segments in the focus corpus are treated as words in a text.

The main exception is that while a word is required to occur in more than one texts, this requirement is not required for features (but the feature still needs to occur 8 or more times.