Back to ComputerTerms Topic: InformationRetrieval = How to create an inverted file representation = Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: '''Term, Document Number'''. Note: we may optionally keep track of the location within the document as well if we are doing any proximity tests. Step 2: Alphabetically sort the file by term Step 3: Agregate the duplicates at this point for each '''Document'''. Now the file is formatted in three columns: '''Term, Document Number, Frequency''' What you now have is an inverted file implementation. This can be split into a '''Lexicon (Dictionary)''' and a '''Postings file'''. == Example == ||Document||||Keywords|| ||1||||CS(2), UNL(3), Ferguson(5), Lincoln(2)|| ||2||||Lincoln(3), CS(4), Computer(6)|| ||3||||CS(3)|| ||4||||university(2), UNL(2), CS(1)|| ||5||||Ferguson(1)|| '''Here is Inverted File:''' ||Term||||Document Number||||Frequency|| ||Computer||||2||||6|| ||CS||||1||||2|| ||CS||||2||||4|| ||CS||||3||||3|| ||CS||||4||||1|| ||Ferguson||||1||||5|| ||Ferguson||||5||||1|| ||Lincoln||||1||||2|| ||Lincoln||||2||||3|| ||university||||4||||2|| ||UNL||||1||||3|| ||UNL||||4||||2|| To see the split to Lexicon and Posting file SEE: PostingsFile Back to ComputerTerms