Back to ComputerTerms, InformationRetrieval

See Also: StopWords

Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens. '''Tokens''' are groups of characters with collective significance. '''This is the first stage of automated indexing and of the query processing'''.

Issues:

   * Digits: Number are not usually allowed, but we might allow words to contain digits, as long as they don't start with them. There are exceptions of course!
   * Hyphens: Consistancy is important, but there will be problems non the less.
   * ",._?`~" and other punctuation may be an integral part of the word. How we deal with this is important with respect to the kind of information that we are using!
   * Case: Usually just make everthing lower case!
   * Choosing delimiters is also very important: usually any white space and unrecognized punctuation or control characters are delimiters.

Implementation:

   1. Use alexical analyzer generator like lex: This is the best approach when the lexical analyzer is complicated.
   1. Write a lexical analyzer by hand - ad hoc: The worst solution, this will likely have subtle errors and may not be efficient.
   1. Write a lexical analyzer by hand as a finite state machine: Must be a good way, because this the the one our book chose to implement.

Back to ComputerTerms, InformationRetrieval