Differences between revisions 6 and 17 (spanning 11 versions)
Revision 6 as of 2004-04-08 01:39:40
Size: 282
Editor: yakko
Comment:
Revision 17 as of 2021-10-26 16:08:37
Size: 2423
Editor: 216
Comment:
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
   * ClusteringAlgorithms
Line 7: Line 8:
   * ["Precision"]
   * ["Recall"]
   * LexicalAnalysis
* [[Precision]]
   * [[Recall]]
Line 13: Line 15:
   * SuperImposedCoding
Line 16: Line 19:
'''This link provides a nice glossary of terms http://www.cs.jhu.edu/~weiss/glossary.html'''

== Description ==

Examples: Library catalogs

Generally the '''data''' are organized as a collection of '''documents'''.

== Querying ==

Querying of unstructured textual data is referred to as '''Information Retrieval'''. It covers the following areas:

   * Querying based on keywords
   * The relevance of documents to the query
   * The analysis, classification and indexing of documents.

Queries are formed using keywords and logical connectives ''and, or,'' and ''not'' where the ''and'' connective is implicit.

'''Full Text''' --> All words in a document are ''keywords''. We use '''term''' to refer to words in a document, since all words are keywords.

Given a document ''d'', and a term ''t'' one way of defining the relavence ''r'' is

$$$r(d,t)=\log\left(1+\frac{n(d,t)}{n(d)}\right)$$$

n(d) denotes the number of terms in the document, and n(d,t) denotes the number of occurrences of term t in the document d.

KEY: In the information retrieval community, the relevance of a document to a term is referred to as '''term frequency''', regardless of the exact formula used.

Inverse Document frequency defined as:

$$$IDF = \frac{1}{n(t)}$$$

where n(t) denotes the number of documents that contain the term t.

Here we have a low IDF if the word is found in many of the documents. If it is found in only a few, then it is probably a good term to use!

Thus the '''relavance''' of a document ''d'' to a set of terms ''Q'' is then defined as

$$$r(d,Q)=\sum_{t \in Q}\frac{r(d,t)}{n(t)}$$$

$$$r(d,Q)=\sum_{t \in Q}\frac{w(t) r(d,t)}{n(t)}$$$

where w(t) is a weight specified by the user.

'''KEY: Stop words''' are words that are not indexed such as ''and, or the, a'' etc.

'''Proximity''': if a the terms occur close to each other in the document, the document would be ranked higher than if they occur far apart. We could (although we don't) modify the formula $$r(d,Q)$$ to take proximity into account.
Line 17: Line 68:

Back to ComputerTerms

Terms

This link provides a nice glossary of terms http://www.cs.jhu.edu/~weiss/glossary.html

Description

Examples: Library catalogs

Generally the data are organized as a collection of documents.

Querying

Querying of unstructured textual data is referred to as Information Retrieval. It covers the following areas:

  • Querying based on keywords
  • The relevance of documents to the query
  • The analysis, classification and indexing of documents.

Queries are formed using keywords and logical connectives and, or, and not where the and connective is implicit.

Full Text --> All words in a document are keywords. We use term to refer to words in a document, since all words are keywords.

Given a document d, and a term t one way of defining the relavence r is

$$$r(d,t)=\log\left(1+\frac{n(d,t)}{n(d)}\right)$$$

n(d) denotes the number of terms in the document, and n(d,t) denotes the number of occurrences of term t in the document d.

KEY: In the information retrieval community, the relevance of a document to a term is referred to as term frequency, regardless of the exact formula used.

Inverse Document frequency defined as:

$$$IDF = \frac{1}{n(t)}$$$

where n(t) denotes the number of documents that contain the term t.

Here we have a low IDF if the word is found in many of the documents. If it is found in only a few, then it is probably a good term to use!

Thus the relavance of a document d to a set of terms Q is then defined as

$$$r(d,Q)=\sum_{t \in Q}\frac{r(d,t)}{n(t)}$$$

$$$r(d,Q)=\sum_{t \in Q}\frac{w(t) r(d,t)}{n(t)}$$$

where w(t) is a weight specified by the user.

KEY: Stop words are words that are not indexed such as and, or the, a etc.

Proximity: if a the terms occur close to each other in the document, the document would be ranked higher than if they occur far apart. We could (although we don't) modify the formula $$r(d,Q)$$ to take proximity into account.

Back to ComputerTerms

InformationRetrieval (last edited 2021-10-26 16:08:37 by 216)