| Size: 264 Comment:  |  ← Revision 6 as of 2004-04-08 16:24:35  ⇥ Size: 1170 Comment:  | 
| Deletions are marked like this. | Additions are marked like this. | 
| Line 3: | Line 3: | 
| A list of words that for reasons of volumne or ["Precision"] and ["Recall"] will not be included in the index and hence are not searchable. E.g. "and", "or", "not" etc. | A list of words that for reasons of volume or ["Precision"] and ["Recall"] will not be included in the index and hence are not searchable. E.g. "and", "or", "not" etc. There are two ways to filter stoplist words from an input token stream: a. Examine lexical analyzer output and remove any stopwords a. Remove stopwords as part of the lexical analysis: This is one of the more efficient ways to implement a StopList If we implement (a) we must look up every token produced in a stoplist structure. Hashing is undoubtable the fastest way to do this! We can even implement hashing into the lexical analysis process by generating the hash code as part of the token generation. Issues include comparing the token against the stopword if we are not using a ''perfect hashing algorithm''. The second method is better: We have to do lexical analysis anyway, and recognizing even a large stoplist can be done at almost no extra cost! This was shown (in my IR book) using a Lexical analyzer generator which generates a minimum state deterministic finite automata. | 
Back to ComputerTerms, InformationRetrieval
A list of words that for reasons of volume or ["Precision"] and ["Recall"] will not be included in the index and hence are not searchable. E.g. "and", "or", "not" etc.
There are two ways to filter stoplist words from an input token stream:
- Examine lexical analyzer output and remove any stopwords
- Remove stopwords as part of the lexical analysis: This is one of the more efficient ways to implement a StopList 
If we implement (a) we must look up every token produced in a stoplist structure. Hashing is undoubtable the fastest way to do this! We can even implement hashing into the lexical analysis process by generating the hash code as part of the token generation. Issues include comparing the token against the stopword if we are not using a perfect hashing algorithm.
The second method is better: We have to do lexical analysis anyway, and recognizing even a large stoplist can be done at almost no extra cost! This was shown (in my IR book) using a Lexical analyzer generator which generates a minimum state deterministic finite automata.
Back to ComputerTerms, InformationRetrieval
