Oracle8i interMedia Text Reference Release 8.1.5 A67843-01 |
|
This appendix describes the scoring algorithm for word queries.You obtain score using the SCORE operator.
To calculate a relevance score for a returned document in a word query, Oracle uses an inverse frequency algorithm based on Salton's formula.
Inverse frequency scoring assumes that frequently occurring terms in a document set are "noise" terms, and so these terms are scored lower. For a document to score high, the query term must occur frequently in the document but infrequently in the document set as a whole.
The following table illustrates Oracle's inverse frequency scoring. The first column shows the number of documents in the document set, and the second column shows the number of terms in the document necessary to score 100.
This table assumes that only one document in the set contains the query term.
Number of Documents in Document Set | Occurrences of Term in Document Needed to Score 100 |
---|---|
1 |
34 |
5 |
20 |
10 |
17 |
50 |
13 |
100 |
12 |
500 |
10 |
1,000 |
9 |
10,000 |
7 |
100,000 |
5 |
1,000,000 |
4 |
The table illustrates that if only one document contained the query term and there were five documents in the set, the term would have to occur 20 times in the document to score 100. Whereas, if there were 1,000,000 documents in the set, the term would have to occur only 4 times in the document to score 100.
You have 5000 documents dealing with chemistry in which the term chemical occurs at least once in every document. The term chemical thus occurs frequently in the document set.
You have a document that contains 5 occurrences of chemical and 5 occurrences of the term hydrogen. No other document contains the term hydrogen. The term hydrogen thus occurs infrequently in the document set.
Because chemical occurs so frequently in the document set, its score for the document is lower with respect to hydrogen, which is infrequent is the document set as a whole. The score for hydrogen is therefore higher than that of chemical. This is so even though both terms occur 5 times in the document.
Inverse frequency scoring also means that adding documents that contain hydrogen lowers the score for that term in the document, and adding more documents that do not contain hydrogen raises the score.
Because the scoring algorithm is based on the number of documents in the document set, inserting, updating or deleting documents in the document set is likely change the score for any given term before and after the DML.
If DML is heavy, you or your Oracle administrator must optimize the index. Perfect relevance ranking is obtained by executing a query right after optimizing the index.
If DML is light, Oracle still gives fairly accurate relevance ranking.
In either case, you or your Oracle administrator must synchronize the index either with ALTER INDEX or by running ctxsrv in the background.
See Also:
For more information about ALTER INDEX, see ALTER INDEX in Chapter 2. For more information about ctxsrv, see "ctxsrv" in Chapter 11. |