Back to ComputerTerms
Topic: InformationRetrieval
How to create an inverted file representation
Step 1: Documents are parsed to extract tokens. These are saved with the Document ID. (Duplicates allowed) The file is formatted in columns of: Term, Document Number. Note: we may optionally keep track of the location within the document as well if we are doing any proximity tests.
Step 2: Alphabetically sort the file by term
Step 3: Agregate the duplicates at this point for each Document. Now the file is formatted in three columns: Term, Document Number, Frequency
What you now have is an inverted file implementation.
This can be split into a Lexicon (Dictionary) and a Postings file.
Example
| Document | Keywords | |
| 1 | CS(2), UNL(3), Ferguson(5), Lincoln(2) | |
| 2 | Lincoln(3), CS(4), Computer(6) | |
| 3 | CS(3) | |
| 4 | university(2), UNL(2), CS(1) | |
| 5 | Ferguson(1) | |
Here is Inverted File:
| Term | Document Number | Frequency | ||
| Computer | 2 | 6 | ||
| CS | 1 | 2 | ||
| CS | 2 | 4 | ||
| CS | 3 | 3 | ||
| CS | 4 | 1 | ||
| Ferguson | 1 | 5 | ||
| Ferguson | 5 | 1 | ||
| Lincoln | 1 | 2 | ||
| Lincoln | 2 | 3 | ||
| university | 4 | 2 | ||
| UNL | 1 | 3 | ||
| UNL | 4 | 2 | ||
To see the split to Lexicon and Posting file SEE: PostingsFile
Back to ComputerTerms
