Oracle8i interMedia Text Reference Release 8.1.5 A67843-01 |
|
This chapter describes the CTX_DOC PL/SQL package for requesting document services. The CTX_DOC package includes the following procedures and functions:
Use the CTX_DOC.FILTER procedure to generate either a plain text or HTML version of a document, which is stored in a result table. This procedure is generally called after a query, from which you identify the document to be filtered.
CTX_DOC.FILTER( index_name IN VARCHAR2, textkey IN VARCHAR2, restab IN VARCHAR2, query_id IN VARCHAR2 DEFAULT 0, plaintext IN BOOLEAN DEFAULT FALSE);
Specify the name of the index associated with the text column containing the document identified by textkey.
Specify the unique identifier (usually the primary key) for the document.
The textkey parameter can be a single column textkey or an encoded specification for a composite (multiple column) textkey.
Specify the name of the result table where the filtered document is stored.
See Also:
For more information about the structure of the filter result table, see "Filter Table" in Appendix B. |
Specify an identifier to use to identify the row inserted into restab.
Specify TRUE to generate a plaintext version of the document. Specify FALSE to generate an HTML version of the document if you are using the INSO filter or indexing HTML documents.
Create the filter result table to store the filtered document as follows:
create table filtertab (query_id number, document clob);
To obtain a plaintext version of document with textkey 20, issue the following statement:
begin ctx_doc.filter('newsindex', 20, 'filtertab', 0, TRUE); end;
Before CTX_DOC.FILTER is called, the result table specified in restab must exist.
When textkey is a composite textkey, you must encode the composite textkey string using CTX_DOC.PKENCODE.
When query_id is specified, all rows with the same query_id are deleted from restab before new rows are generated with query_id.
When query_id is not specified or set to NULL, it defaults to 0. You must manually truncate the table specified in restab.
Use the CTX_DOC.GIST procedure to generate a Gist and theme summaries for a document. You can generate paragraph-level or sentence-level Gists/theme summaries.
CTX_DOC.GIST( index_name IN VARCHAR2, textkey IN VARCHAR2, restab IN VARCHAR2, query_id IN NUMBER DEFAULT 0, glevel IN VARCHAR2 DEFAULT 'P', pov IN VARCHAR2 DEFAULT NULL, numParagraphs IN NUMBER DEFAULT 16, maxPercent IN NUMBER DEFAULT 10);
Specify the name of the index associated with the text column containing the document identified by textkey.
Specify the textkey (usually the primary key) of the document to be processed. The parameter textkey can be a single column textkey or an encoded specification for a multiple column textkey.
Specify the name of the result table used to store the output generated by GIST.
See Also:
For more information about the structure of the Gist result table, see "Gist Table" in Appendix B. |
Specify an identifier to use to identify the row(s) inserted into restab.
Specify the type of Gist/theme summary to produce. The possible values are:
The default is P.
Specify whether a Gist or a single theme summary is generated. The type of Gist/theme summary generated (sentence-level or paragraph-level) depends on the value specified for glevel.
To generate a Gist for the document, specify a value of `GENERIC' for pov. To generate a theme summary for a single theme in a document, specify the theme as the value for pov.
If you specify a NULL value for pov, this procedure generates a Gist for the document and a theme summary for each document theme (up to 50).
Note: The pov parameter is case sensitive. To return a Gist for a document, specify `GENERIC' in all uppercase. To return a theme summary, specify the theme exactly as it is generated for the document. Only the themes generated by CTX_DOC.THEMES for a document can be used as input for pov. |
Specify the maximum number of document paragraphs (or sentences) selected for the document Gist/theme summaries. The default is 16.
Note: The numParagraphs parameter is used only when this parameter yields a smaller Gist/theme summary size than the Gist/theme summary size yielded by the maxPercent parameter. |
Specify the maximum number of document paragraphs (or sentences) selected for the document Gist/theme summaries as a percentage of the total paragraphs (or sentences) in the document. The default is 10.
Note: The maxPercent parameter is used only when this parameter yields a smaller Gist/theme summary size than the Gist/theme summary size yielded by the numParagraphs parameter. |
The following example creates a Gist table called CTX_GIST:
create table CTX_GIST (query_id number, pov varchar2(80), gist CLOB);
The following example returns a default sized paragraph level Gist for document 34 as well as a theme summary for all the themes in the document:
begin ctx_doc.gist('newsindex',34,'CTX_GIST',1,glevel => 'P'); end;
The following example generates a non-default size Gist of at most ten paragraphs:
begin ctx_doc.gist('newsindex',34,'CTX_GIST',1,glevel => 'P',pov => 'GENERIC', numParagraphs => 10); end;
The following example generates a Gist whose number of paragraphs is at most ten percent of the total paragraphs in document:
begin ctx_doc.gist('newsindex',34,'CTX_GIST',1, glevel =>'P',pov => 'GENERIC', maxPercent => 10); end;
The following example returns a paragraph level theme summary for insects for document 34. The default theme summary size is returned.
begin ctx_doc.gist('newsindex',34,'CTX_GIST',1,glevel =>'P', pov => 'insects'); end;
By default, this procedure generates up to 50 themes for a document. As a result, CTX_DOC.GIST creates a maximum of 51 gists for each document: one theme summary for each theme and one Gist for the entire document.
When textkey is a composite textkey, you must encode the composite textkey string using the CTX_DOC.PKENCODE procedure as in the second example above.
Use the CTX_DOC.HIGHLIGHT procedure to generate highlight offsets for a document. The offset information is generated for the terms in the document that satisfy the query you specify. These highlighted terms are either the words that satisfy a word query or the themes that satisfy an ABOUT query.
You can generate highlight offsets for either plaintext or HTML versions of the document. Yo can apply the offset information to the same documents filtered with CTX_DOC.FILTER.
You usually call this procedure after a query, from which you identify the document to be processed.
CTX_DOC.HIGHLIGHT( index_name IN VARCHAR2, textkey IN VARCHAR2, text_query IN VARCHAR2 DEFAULT NULL, restab IN VARCHAR2 DEFAULT NULL, query_id IN NUMBER DEFAULT 0, plaintext IN BOOLEAN DEFAULT FALSE);
Specify the name of the index associated with the text column containing the document identified by textkey.
Specify the unique identifier (usually the primary key) for the document.
The textkey parameter can be a single column textkey or an encoded specification for a composite (multiple column) textkey.
Specify the original query expression used to retrieve the document. If NULL, no highlights are generated.
Specify the name of the result table where highlight offsets are stored.
See Also:
For more information about the structure of the highlight result table, see "Highlight Table" in Appendix B. |
Specify the identifier used to identify the row inserted into restab.
Specify TRUE to generate a plaintext offsets of the document.
Specify FALSE to generate HTML offsets of the document if you are using the INSO filter or indexing HTML documents.
Create the highlight table to store the highlight offset information:
create table hightab(query_id number, offset number, length number);
To obtain HTML highlight offset information for document 20 for the word dog:
begin ctx_doc.highlight('newsindex', 20, 'dog', 'hightab', 0, FALSE); end;
Assuming the index newsindex has a theme component, you obtain HTML highlight offset information for the theme query of politics by issuing the following query:
begin ctx_doc.highlight('newsindex', 20, 'about(politics)', 'hightab', 0, FALSE); end;
The output for this statement are the offsets to highlighted words and phrases that represent the theme of politics in the document.
Before CTX_DOC.HIGHLIGHT is called, the result table specified in restab must exist.
When textkey is a composite textkey, you must encode the composite textkey string using the CTX_DOC.PKENCODE procedure.
If text_query includes wildcards, stemming, fuzzy matching which result in stopwords being returned, HIGHLIGHT does not highlight the stopwords.
If text_query contains the threshold operator, the operator is ignored. The HIGHLIGHT procedure always returns highlight information for the entire result set.
When query_id is specified, all rows with the same query_id are deleted from restab before new rows are generated with query_id.
When query_id is not specified or set to NULL, it defaults to 0. You must manually truncate the table specified in restab.
The CTX_DOC.MARKUP procedure takes a query specification and a document textkey and returns a version of the document in which the query terms are marked-up. These marked-up terms are either the words that satisfy a word query or the themes that satisfy an ABOUT query.
The marked-up output can be either plaintext or HTML.
You can use one of the pre-defined tagsets for marking highlighted terms, including a tag sequence that enables HTML navigation.
You usually call CTX_DOC.MARKUP after a query, from which you identify the document to be processed.
CTX_DOC.MARKUP(index_name IN VARCHAR2, textkey IN VARCHAR2, text_query IN VARCHAR2, restab IN VARCHAR2, query_id IN NUMBER DEFAULT 0, plaintext IN BOOLEAN DEFAULT FALSE, tagset IN VARCHAR2 DEFAULT 'TEXT_DEFAULT', starttag IN VARCHAR2 DEFAULT NULL, endtag IN VARCHAR2 DEFAULT NULL, prevtag IN VARCHAR2 DEFAULT NULL, nexttag IN VARCHAR2 DEFAULT NULL);
Specify the name of the index associated with the text column containing the document identified by textkey.
Specify the unique identifier (usually the primary key) for the document.
The textkey parameter can be a single column textkey or an encoded specification for a composite (multiple column) textkey.
Specify the original query expression used to retrieve the document.
Specify the name of the result table where the marked-up, plain-text document is stored.
See Also:
For more information about the structure of the markup result table, see "Markup Table" in Appendix B. |
Specify the identifier used to identify the row inserted into restab.
Specify TRUE to generate plaintext marked-up document. Specify FALSE to generate a marked-up HTML version of document if you are using the INSO filter or indexing HTML documents.
Specify one of the following pre-defined tagsets. The second and third columns show how the four different tags are defined for each tagset:
Specify the character(s) inserted by MARKUP to indicate the start of a highlighted term.
The sequence of starttag, endtag, prevtag and nexttag with respect to the highlighted word is as follows:
... prevtag starttag word endtag nexttag...
Specify the character(s) inserted by MARKUP to indicate the end of a highlighted term.
Specify the markup sequence that defines the tag that navigates the user to the previous highlight.
In the markup sequences prevtag and nexttag, you can specify the following offset variables which are set dynamically:
Offset Variable | Value |
---|---|
%CURNUM |
the current offset number |
%PREVNUM |
the previous offset number |
%NEXTNUM |
the next offset number |
See the description of the HTML_NAVIGATE tagset for an example.
Specify the markup sequence that defines the tag that navigates the user to the next highlight tag.
Within the markup sequence, you can use the same offset variables you use for prevtag. See the explanation for prevtag and the HTML_NAVIGATE tagset for an example.
Create the highlight markup table to store the marked-up document as follows:
create table markuptab (query_id number, document clob);
To create HTML highlight markup for the words dog or cat for document 23, issue the following statement:
begin ctx_doc.markup(index_name => 'my_index', textkey => '23', text_query => 'dog|cat', restab => 'markuptab', query_id => '1' tagset => 'HTML_DEFAULT'); end;
To create HTML highlight markup for the theme of politics for document 23, issue the following statement:
begin ctx_doc.markup(index_name => 'my_index', textkey => '23', text_query => 'about(politics)', restab => 'markuptab', query_id => '1' tagset => 'HTML_DEFAULT'); end;
Before CTX_DOC.MARKUP is called, the result table specified in restab must exist.
When textkey is a composite textkey, you must encode the composite textkey string using the CTX_DOC.PKENCODE procedure.
If text_query includes wildcards, stemming, fuzzy matching which result in stopwords being returned, MARKUP does not highlight the stopwords.
If text_query contains the threshold operator, the operator is ignored. The MARKUP procedure always returns highlight information for the entire result set.
When query_id is specified, all rows with the same query_id are deleted from restab before new rows are generated with query_id.
When query_id is not specified or set to NULL, it defaults to 0. You must manually truncate the table specified in restab.
The CTX_DOC.PKENCODE function converts a composite textkey list into a single string and returns the string.
The string created by PKENCODE can be used as the primary key parameter textkey in other CTX_DOC procedures, such as CTX_DOC.THEMES and CTX_DOC.GIST.
CTX_DOC.PKENCODE( pk1 IN VARCHAR2, pk2 IN VARCHAR2 DEFAULT NULL, pk4 IN VARCHAR2 DEFAULT NULL, pk5 IN VARCHAR2 DEFAULT NULL, pk6 IN VARCHAR2 DEFAULT NULL, pk7 IN VARCHAR2 DEFAULT NULL, pk8 IN VARCHAR2 DEFAULT NULL, pk9 IN VARCHAR2 DEFAULT NULL, pk10 IN VARCHAR2 DEFAULT NULL, pk11 IN VARCHAR2 DEFAULT NULL, pk12 IN VARCHAR2 DEFAULT NULL, pk13 IN VARCHAR2 DEFAULT NULL, pk14 IN VARCHAR2 DEFAULT NULL, pk15 IN VARCHAR2 DEFAULT NULL, pk16 IN VARCHAR2 DEFAULT NULL) RETURN VARCHAR2;
Each PK argument specifies a column element in the composite textkey list. You can encode at most 16 column elements.
String that represents the encoded value of the composite textkey.
begin ctx_doc.gist('newsindex',CTX_DOC.PKENCODE('smith', 14), 'CTX_GIST'); end;
In this example, smith and 14 constitute the composite textkey value for the document.
The CTX_DOC.THEMES procedure generates a list of up to fifty themes for a document. Each theme is stored as a row in a result table specified by the user.
CTX_DOC.THEMES(index_name IN VARCHAR2, textkey IN VARCHAR2, restab IN VARCHAR2, query_id IN NUMBER DEFAULT 0, full_themes IN BOOLEAN DEFAULT FALSE);
Specify the name of the index for the column in which the document for the list of theme is stored.
Specify the textkey (usually the primary key) of the document (row) to be processed. The parameter textkey can be a single column textkey or an encoded specification for a multiple column textkey.
Specify the name of the result table used to store the output generated by THEMES.
See Also:
For more information about the structure of the theme result table, see "Theme Table" in Appendix B. |
Specify the identifier used to identify the row(s) inserted into restab.
Specify whether this procedure generates a single theme or a hierarchical list of parent themes (full themes) for each document theme.
Specify TRUE for this procedure to write full themes to the THEME column of the result table.
Specify FALSE for this procedure to write single theme information to the THEME column of the result table. This is the default.
The following example creates a theme table called CTX_THEMES:
create table CTX_THEMES (query_id number, theme varchar2(2000), weight number);
To obtain a list of themes where each element in the list is a single theme, issue:
begin ctx_doc.themes('newsindex',34,'CTX_THEMES',1,full_themes => FALSE); end;
To obtain a list of themes where each element in the list is a hierarchical list of parent themes, issue:
begin ctx_doc.themes('newsindex',34,'CTX_THEMES',1,full_themes => TRUE); end;
When textkey is a composite key, you must encode the composite textkey string using the CTX_DOC.PKENCODE procedure.