Back to ComputerTerms, InformationRetrieval
Keys:
Stemming is automated conflation (fusing or combining) usually by reducing words to a common root. Needed-->need, engineering-->engineer, but also disoriented-->oriented which might be an example of overstemming.
Stemming provides a simple method to compress the index by including only the stems of words.
Stemming will, in general, increase recall at the cost of decreased precision.
Methods of Conflation
https://www.scotnpatti.com/images/conflationmethods.jpg
Automatic approaches:
- Affix Algorithms: remove suffixes and/or prefixes from terms leaving a stem. The name stemmer derives from this method, which is the most common. Usually iterative longest match stemmers remove the longest possible string from a word according to a set of rules leaving a stem. The problem is that you might have matches like skies matching ski instead of sky. Two methods are used to overcome this problem:
Recoding: AxC --> AyC so that skies --> skyes and then the es can be removed.
- Partial Matching: Only the n initial characters of the stems are used in comparing them. We might say that two stems match if all but the last two characters match.
- Successor Algorithms use the frequency of letter sequences in a body of text as the basis of stemming
- n-gram methods conflates terms based on the number of digrams or n-grams they share. This method takes the Successor method a little farther and breaks up the words into n-grams (sequences of letters) and determines the number of n-gram matches.
- Stemming can also be done via a lookup table. Simple method, but has problems because you have to actually determine the stems either manually or through some method that may leave you with inaccuracies because the English language is so complex.
Criteria for judging Stemmers
- Correctness
- Retrieval effectiveness
- compression performance
- Overstemming
- Understemming
Back to ComputerTerms, InformationRetrieval