Differences between revisions 50 and 51
Revision 50 as of 2020-01-26 18:35:23
Size: 13964
Editor: 68
Comment:
Revision 51 as of 2020-01-26 18:38:41
Size: 13950
Editor: 68
Comment:
Deletions are marked like this. Additions are marked like this.
Line 139: Line 139:
$$$sim(d_{j},q)~\symbol{126}~\frac{P(\vec{d_{j}}|R)}{P(\vec{d}_{j}|\bar{R})}$$$ $$$sim(d_{j},q)~ \tilde ~\frac{P(\vec{d_{j}}|R)}{P(\vec{d}_{j}|\bar{R})}$$$
Line 143: Line 143:
$$$sim(d_{j},q)~\symbol{126}~\frac{\left( \prod_{g_{i}(\vec{d}_{j})=1} P(k_{i}|R))\right) \times\left( \prod_{g_{i}(\vec{d}_{j})=0}P\left( \bar{k}_{i}|R\right) \right) }{\left( \prod_{g_{i}(\vec{d}_{j})=1}P\left(k_{i}|\bar{R}\right) \right) \times\left( \prod_{g_{i}(\vec{d}_{j})=0}P\left( \bar{k}_{i}|\bar{R}\right) \right) }$$$ $$$sim(d_{j},q)~\tilde~\frac{\left( \prod_{g_{i}(\vec{d}_{j})=1} P(k_{i}|R))\right) \times\left( \prod_{g_{i}(\vec{d}_{j})=0}P\left( \bar{k}_{i}|R\right) \right) }{\left( \prod_{g_{i}(\vec{d}_{j})=1}P\left(k_{i}|\bar{R}\right) \right) \times\left( \prod_{g_{i}(\vec{d}_{j})=0}P\left( \bar{k}_{i}|\bar{R}\right) \right) }$$$
Line 148: Line 148:
$$$sim\left( d_{j},q\right) ~\symbol{126}~\sum_{i=1}^{t}w_{i,q}\times w_{i,j}\times\left( \log\frac{P\left( k_{i}|R\right) }{1-P\left( k_{i}|R\right) }+\log\frac{1-P\left( k_{i}|\bar{R}\right) }{P\left(k_{i}|\bar{R}\right) }\right)$$$ $$$sim\left( d_{j},q\right) ~ \tilde ~\sum_{i=1}^{t}w_{i,q}\times w_{i,j}\times\left( \log\frac{P\left( k_{i}|R\right) }{1-P\left( k_{i}|R\right) }+\log\frac{1-P\left( k_{i}|\bar{R}\right) }{P\left(k_{i}|\bar{R}\right) }\right)$$$

Chapter 1 + Section 2.1 Introduction


InformationRetreivalProcess.jpg

Information Retrieval Process


  • Three Models of Browsing
    • Flag
    • Structure guided
    • Hypertext

Section 2.2 A taxonomy of Information Retrieval Models

  • Predicting which documents are relevant is usaually dependent on a ranking algorithm.

  • The three classic models in information retreival are:
    • Boolean Model: In the boolean model documents and queries are represented as sets of index terms, thus we say this model is a set theoretic model

    • Vector Model: In the vector model documents and queries are represented as vectors in a t-dimensional space, thus we say that the model is algebraic.

    • Probabilistic Model: The framework for modeling document and query representations is based on probability theory, and thus we sat that the model is prababilistic.

Section 2.3 Retrieval: Ad hoc and Filtering

The following is the formal definition for IR from MIR p 23.

Definition An information retrieval model is a quadruple D,Q,F,R(qi,dj)) where

  • D is a set composed of logical views (or representations) for the documents in the collection.

  • Q is a set composed of logical views (or representations) for the user information needs. Such representations are called queries

  • F is a framework for modeling document representations, queries and their relationships.

  • R(qi,dj) is a ranking function which associates a real number with a query qiQ and a document representation djD. Such ranking defines an ordering among the documents with regard to the query qi.

Section 2.5.1 Basic Concepts of Classic IR

  • Each document is described by a set of representative key workds calle index terms.

  • Index terms are usually nouns. Why? Because verbs, adjectives connectives etc. have little meaning on their own.
  • Index terms have weights described as follows:

Definition Let t be the number of index terms in the system and ki be a generic index term. K=k1,...,kt is the set of all index terms. A weight wi,j>0 is associated with each index term ki of a document dj. For an index term which does not appear in the document text, wi,j=0. With document dj is associated an index term vector dj=(w1,j,w2,j,...,wt,j). Further, let gi be a function that returns the weight associated with the index term ki in any t-dimensional vector (i.e., gi(dj)=wi,j).

Section 2.5.2 Boolean Model

  • index terms are weighted either 0 or 1.
  • Queries are expressed in DisjunctiveNormalForm of the [q=ka(kb¬kc)] can be written in disjunctive

normal form as [qdnf=(1,1,1)(1,1,0)(1,0,0)].

Definition For the Boolean model, the index term weight variables are all binary ie.e, wi,j0,1. A query q is a conventional boolean expression. Let qdnf be the disjunctive normal form for the query q. further, let qcc be any of the conjunctive components of qdnf. The similarity of a document dj to the query q is defined as

sim(dj,q)={1if qcc | (qccqdnf)(ki,gi(dj)=gi(qcc)0otherwise

If sim(dj,q)=1 then the Boolean model predicts that the document dj is relevant to the query q (it might not be). Otherwise, the prediction is that the document is not relevant.

Section 2.5.3 Vector Model

  • Weights are non-binary that is real numbers
  • This allows us to calculate a degree of similarity.

Definition For the vector model, the weight $w_{i,j}$ associated with a pair (ki,jj) is positive and non-binary. Further, the index terms in the query are also weighted. Let $w_{i,q}$ be the weight associated with the pair [ki,q], where wi,qq0. Then, the query vector q is defined as q=(w1,q,...,wt,q) where t is the total number of index terms in the system. As before, the vector for a document dj is represented by dj=(w1,j,...,wt,j).

Thus we have a document dj and user query q represented as t-dimensional vectors. Similarity is defined cosine of the angle between the document vector and the query vector as follows:

attachment:csce810_2_2.png

Figure 2

Having defined similarity the we now have a method for ranking and we can determine what is relevant by setting a threshold on the degree of similarity. This is essentially a clustering algorithm that clusters documents similar to the query in one set and documents dissimilar to the query in the other set. We say the IR problem can be reduced to a clustering problem determining what documents belong to the relevant set A and which do not. There are two main issues to resolved in a clustering problem:

  1. intra-cluster similarity --> term frequency (tf): One needs to determine what are the features which better describe the objects in set A.

  2. intra-cluster dissimilarity --> inverse document frequency (idf): One needs to determine what are the features which better distinguish the objects in set A from the remaining objects in collection C (the set of all documents).

Definition

Let N be the total number of documents in the system and ni be the number of documents in which the index term ki appears. Let freqi,j be the raw frequency of term ki in the document dj (i.e., the number of times the term $k_i$ is mentioned in the text of the document $d_j$). Then, the {\bf normalized} frequency $f_{i,j}$ of term $k_i$ in document $d_j$ is given by:

fi,j=freqi,jmaxl(freql,j)

where the maximum is computed over all terms which are mentioned in the text of the document dj. If the term ki does not appear in the document dj then fi,j=0. Further, let idfi, inverse document frequency for ki given by:

idfi,j=logNni

The best known term weighting schemes use weights which are given by

wi,j=fi,j×logNni

or by a variation of this formula. Such term-weighting strategies are called tf-idf schemes.

One such term wights formula was given by Salton and Buckley as

wi,q=(0.5+0.5 freqi,qmaxl(freql,q))×logNni

Advantages

  1. Term-weighting scheme improves retrieval performance
  2. Partial matching allows retrieval of documents that approximate the query conditions
  3. The cosine ranking formula sorts the documents according th their degree of similarity

Disadvantages

  1. Does not account for term dependencies
  2. Answer sets are difficult to improve without query expansion or relevance feedback.

Section 2.5.4 Probabilistic Model

Main Idea: Given a user query there is a set of documents which contains exactly the relevant documents and no other.

  1. There exists an ideal answer set to a query.

Given a set of index terms we guess at the properties of the ideal set and retrieve that set. This becomes the guess or initial set of documents.

Assumption (Prababilistic Principle):

Given a user query q and a document dj in the collection, the probabilistic model tries to estimate the probability that the user will find teh document interesting (i.e. relevant). The model assumes that this probability of relevance depends on the query and the document representations only. Further, the model assumes that there is a subset of all documents which the user prefers as the answer set for the query q. Such an ideal answer set is labeled R and should maximized the overall probability of relevance to the user. Documents in the set R are predicted to be relevant to the query. Documents not in this set are predicted to be non-relevant.

Given a query q, the probabilistic model assigns to each document dj, as a measure of its similarity to the query, the ratio:

sim(dj,q)=P(dj relevantto q)P(dj nonrelevantto q)

which computes the odds of the document dj being relevant to the query q. Taking the odds of relevance as the rank minimizes the probability of an erroneous judgement.

Definition

For the probabilistic mode, the index term weight variables are all binary i.e., wi,j0,1, wi,q0,1. A query q is a subset of index terms. Let R be the set of documents known (or initially guessed}) to be relevant. Let R¯ be the complement of R (i.e. the set of initially guessed non-relevant documents). Let P(R|dj) be the probability that the document dj is relevant to the query q and P(R¯|dj be the probability that dj is non-relevant to q. The similarity sim(dj,q) of the document dj to the query $q$ is defined as the ratio:

sim(dj,q)=P(R|dj)P(R¯|dj)

Using Bayes' rule,

sim(dj,q)=P(dj|R)×P(R)P(dj|R¯)×P(R¯)

For more info see P32 MIR. They assume that P(R) and P(R¯) are the same for all the documents in the collection. This doesn't make sense to me unless we assume that about 50% of the documents that are relevant. The other option is to say that a few are definitely relevant and a few are definitely irrelevant, and the rest are in a different category of quasi-relevant. Even this doesn't make sense though because we named the two sets R and R¯. Big question!

If we leave off R and R¯ we get:

sim(dj,q)  ~P(dj|R)P(dj|R¯)

Assuming independence of index terms,

sim(dj,q)  ~(gi(dj)=1P(ki|R)))×(gi(dj)=0P(k¯i|R))(gi(dj)=1P(ki|R¯))×(gi(dj)=0P(k¯i|R¯))

where: gi(dj)=1 means the weight of query keyword (gi(di)=ki) of document dj is 1 and P(ki|R) stands for the probability that the index term ki is present in a document randomly selected from the set R. P(k¯i|R) stands for the probability that the index term ki is not present in a document randomly selected from the set R. And of course R is the initial guess at the relevant set. With some massaging of the above equation, we can get:

sim(dj,q)  ~i=1twi,q×wi,j×(logP(ki|R)1P(ki|R)+log1P(ki|R¯)P(ki|R¯))

In the very beginning (i.e. immediately after the query specification), there are no retrieved documents. Thus, one has to make simplifying assumptions such as

  • assume that P(ki|R) is constant for all index terms kiq
  • assume that the distribution of index terms among the non-relevant documents can be approximated by the distribution of index terms among all the docuemnts in the collection.

From these two assumptions we get:

P(ki|R)=0.5P(ki|R¯)=niN

where ni is the number of documents that contain the keyword ki and N is the total number of documents in the collection. Now that we have initial values for the probabilities we can use (previous equation) to rank the documents and determine our first guess at the set of document that belong in R call this initial set V determined by the top r ranked documents. Define Vi to be the subset of V containing keyword ki. Now we are ready to improve our guesses for P(ki|R) and P(ki|R¯). We accomplish this with the following assumptions

  • We can approximate P(ki|R) by the distribution of index term ki among the documents retrieved so far
  • We can approximate P(ki|R¯) by considering all the non-retrieved documents as not relevant.

From these two assumptions we get:

Missing \end{align}

This process can be repeated without any human intervention to improve our guesses for the probabilities. For small values of |Vi| we in practice make the above two equations:

Misplaced &

or

Misplaced &

unl/Csce810Chapter2 (last edited 2020-01-26 18:49:25 by 68)