8
CTX_DOC Package

This chapter describes the CTX_DOC PL/SQL package for requesting document services. The CTX_DOC package includes the following procedures and functions:

Name Description

FILTER

Generates a plain text or HTML version of a document

GIST

Generates a Gist or theme summaries for a document

HIGHLIGHT

Generates plain text or HTML highlighting offset information for a document

MARKUP

Generates a plain text or HTML version of a document with query terms highlighted

PKENCODE

Encodes a composite textkey string (value) for use in other CTX_DOC procedures

THEMES

Generates a list of themes for a document

Name	Description
FILTER	Generates a plain text or HTML version of a document
GIST	Generates a Gist or theme summaries for a document
HIGHLIGHT	Generates plain text or HTML highlighting offset information for a document
MARKUP	Generates a plain text or HTML version of a document with query terms highlighted
PKENCODE	Encodes a composite textkey string (value) for use in other CTX_DOC procedures
THEMES	Generates a list of themes for a document

FILTER

Use the CTX_DOC.FILTER procedure to generate either a plain text or HTML version of a document, which is stored in a result table. This procedure is generally called after a query, from which you identify the document to be filtered.

Syntax

 CTX_DOC.FILTER(
          index_name  IN VARCHAR2, 
          textkey     IN VARCHAR2, 
          restab      IN VARCHAR2, 
          query_id    IN VARCHAR2 DEFAULT 0,
          plaintext   IN BOOLEAN  DEFAULT FALSE);

index_name

Specify the name of the index associated with the text column containing the document identified by textkey.

textkey

Specify the unique identifier (usually the primary key) for the document.

The textkey parameter can be a single column textkey or an encoded specification for a composite (multiple column) textkey.

restab

Specify the name of the result table where the filtered document is stored.

See Also:
For more information about the structure of the filter result table, see "Filter Table" in Appendix B.

query_id

Specify an identifier to use to identify the row inserted into restab.

plaintext

Specify TRUE to generate a plaintext version of the document. Specify FALSE to generate an HTML version of the document if you are using the INSO filter or indexing HTML documents.

Example

Create the filter result table to store the filtered document as follows:

create table filtertab (query_id  number,   
                        document  clob);

To obtain a plaintext version of document with textkey 20, issue the following statement:

begin 
ctx_doc.filter('newsindex', 20, 'filtertab', 0, TRUE);
end;

Notes

Before CTX_DOC.FILTER is called, the result table specified in restab must exist.

When textkey is a composite textkey, you must encode the composite textkey string using CTX_DOC.PKENCODE.

When query_id is specified, all rows with the same query_id are deleted from restab before new rows are generated with query_id.

When query_id is not specified or set to NULL, it defaults to 0. You must manually truncate the table specified in restab.

GIST

Use the CTX_DOC.GIST procedure to generate a Gist and theme summaries for a document. You can generate paragraph-level or sentence-level Gists/theme summaries.

Syntax

CTX_DOC.GIST(
              index_name    IN VARCHAR2, 
              textkey       IN VARCHAR2, 
              restab        IN VARCHAR2, 
              query_id      IN NUMBER DEFAULT 0,
              glevel        IN VARCHAR2 DEFAULT 'P',
              pov           IN VARCHAR2 DEFAULT NULL,
              numParagraphs IN NUMBER DEFAULT 16,
              maxPercent    IN NUMBER DEFAULT 10);

index_name

Specify the name of the index associated with the text column containing the document identified by textkey.

textkey

Specify the textkey (usually the primary key) of the document to be processed. The parameter textkey can be a single column textkey or an encoded specification for a multiple column textkey.

restab

Specify the name of the result table used to store the output generated by GIST.

See Also:
For more information about the structure of the Gist result table, see "Gist Table" in Appendix B.

query_id

Specify an identifier to use to identify the row(s) inserted into restab.

glevel

Specify the type of Gist/theme summary to produce. The possible values are:

P for paragraph
S for sentence

The default is P.

pov

Specify whether a Gist or a single theme summary is generated. The type of Gist/theme summary generated (sentence-level or paragraph-level) depends on the value specified for glevel.

To generate a Gist for the document, specify a value of `GENERIC' for pov. To generate a theme summary for a single theme in a document, specify the theme as the value for pov.

If you specify a NULL value for pov, this procedure generates a Gist for the document and a theme summary for each document theme (up to 50).

Note:
The pov parameter is case sensitive. To return a Gist for a document, specify `GENERIC' in all uppercase. To return a theme summary, specify the theme exactly as it is generated for the document.
Only the themes generated by CTX_DOC.THEMES for a document can be used as input for pov.

numParagraphs

Specify the maximum number of document paragraphs (or sentences) selected for the document Gist/theme summaries. The default is 16.

Note:
The numParagraphs parameter is used only when this parameter yields a smaller Gist/theme summary size than the Gist/theme summary size yielded by the maxPercent parameter.

maxPercent

Specify the maximum number of document paragraphs (or sentences) selected for the document Gist/theme summaries as a percentage of the total paragraphs (or sentences) in the document. The default is 10.

Note:
The maxPercent parameter is used only when this parameter yields a smaller Gist/theme summary size than the Gist/theme summary size yielded by the numParagraphs parameter.

Examples

Gist Table

The following example creates a Gist table called CTX_GIST:

create table CTX_GIST (query_id  number,
                       pov       varchar2(80), 
                       gist      CLOB);

Gists

The following example returns a default sized paragraph level Gist for document 34 as well as a theme summary for all the themes in the document:

begin
ctx_doc.gist('newsindex',34,'CTX_GIST',1,glevel => 'P');
end;

The following example generates a non-default size Gist of at most ten paragraphs:

begin
ctx_doc.gist('newsindex',34,'CTX_GIST',1,glevel => 'P',pov => 'GENERIC', 
numParagraphs => 10);
end;

The following example generates a Gist whose number of paragraphs is at most ten percent of the total paragraphs in document:

begin 
ctx_doc.gist('newsindex',34,'CTX_GIST',1, glevel =>'P',pov => 'GENERIC', 
maxPercent => 10);
end;

Theme Summary

The following example returns a paragraph level theme summary for insects for document 34. The default theme summary size is returned.

begin
ctx_doc.gist('newsindex',34,'CTX_GIST',1,glevel =>'P', pov => 'insects');
end;

Notes

By default, this procedure generates up to 50 themes for a document. As a result, CTX_DOC.GIST creates a maximum of 51 gists for each document: one theme summary for each theme and one Gist for the entire document.

When textkey is a composite textkey, you must encode the composite textkey string using the CTX_DOC.PKENCODE procedure as in the second example above.

HIGHLIGHT

Use the CTX_DOC.HIGHLIGHT procedure to generate highlight offsets for a document. The offset information is generated for the terms in the document that satisfy the query you specify. These highlighted terms are either the words that satisfy a word query or the themes that satisfy an ABOUT query.

You can generate highlight offsets for either plaintext or HTML versions of the document. Yo can apply the offset information to the same documents filtered with CTX_DOC.FILTER.

You usually call this procedure after a query, from which you identify the document to be processed.

Syntax

CTX_DOC.HIGHLIGHT(
          index_name  IN VARCHAR2, 
          textkey     IN VARCHAR2, 
          text_query  IN VARCHAR2 DEFAULT NULL, 
          restab      IN VARCHAR2 DEFAULT NULL, 
          query_id    IN NUMBER   DEFAULT 0,
          plaintext   IN BOOLEAN  DEFAULT FALSE);

index_name

Specify the name of the index associated with the text column containing the document identified by textkey.

textkey

Specify the unique identifier (usually the primary key) for the document.

The textkey parameter can be a single column textkey or an encoded specification for a composite (multiple column) textkey.

text_query

Specify the original query expression used to retrieve the document. If NULL, no highlights are generated.

restab

Specify the name of the result table where highlight offsets are stored.

See Also:
For more information about the structure of the highlight result table, see "Highlight Table" in Appendix B.

query_id

Specify the identifier used to identify the row inserted into restab.

plaintext

Specify TRUE to generate a plaintext offsets of the document.

Specify FALSE to generate HTML offsets of the document if you are using the INSO filter or indexing HTML documents.

Examples

Create Highlight Table

Create the highlight table to store the highlight offset information:

create table hightab(query_id number, 
                     offset number, 
                     length number);

Word Highlight Offsets

To obtain HTML highlight offset information for document 20 for the word dog:

begin
ctx_doc.highlight('newsindex', 20, 'dog', 'hightab', 0, FALSE);
end;

Theme Highlight Offsets

Assuming the index newsindex has a theme component, you obtain HTML highlight offset information for the theme query of politics by issuing the following query:

begin
ctx_doc.highlight('newsindex', 20, 'about(politics)', 'hightab', 0, FALSE);
end;

The output for this statement are the offsets to highlighted words and phrases that represent the theme of politics in the document.

Notes

Before CTX_DOC.HIGHLIGHT is called, the result table specified in restab must exist.

When textkey is a composite textkey, you must encode the composite textkey string using the CTX_DOC.PKENCODE procedure.

If text_query includes wildcards, stemming, fuzzy matching which result in stopwords being returned, HIGHLIGHT does not highlight the stopwords.

If text_query contains the threshold operator, the operator is ignored. The HIGHLIGHT procedure always returns highlight information for the entire result set.

When query_id is specified, all rows with the same query_id are deleted from restab before new rows are generated with query_id.

When query_id is not specified or set to NULL, it defaults to 0. You must manually truncate the table specified in restab.

MARKUP

The CTX_DOC.MARKUP procedure takes a query specification and a document textkey and returns a version of the document in which the query terms are marked-up. These marked-up terms are either the words that satisfy a word query or the themes that satisfy an ABOUT query.

The marked-up output can be either plaintext or HTML.

You can use one of the pre-defined tagsets for marking highlighted terms, including a tag sequence that enables HTML navigation.

You usually call CTX_DOC.MARKUP after a query, from which you identify the document to be processed.

Syntax

CTX_DOC.MARKUP( 
     
index_name     IN VARCHAR2, 
textkey        IN VARCHAR2, 
text_query     IN VARCHAR2, 
restab         IN VARCHAR2, 
query_id       IN NUMBER    DEFAULT 0,  
plaintext      IN BOOLEAN   DEFAULT FALSE, 
tagset         IN VARCHAR2  DEFAULT 'TEXT_DEFAULT', 
starttag       IN VARCHAR2  DEFAULT NULL, 
endtag         IN VARCHAR2  DEFAULT NULL, 
prevtag        IN VARCHAR2  DEFAULT NULL, 
nexttag        IN VARCHAR2  DEFAULT NULL);

index_name

Specify the name of the index associated with the text column containing the document identified by textkey.

textkey

Specify the unique identifier (usually the primary key) for the document.

The textkey parameter can be a single column textkey or an encoded specification for a composite (multiple column) textkey.

text_query

Specify the original query expression used to retrieve the document.

restab

Specify the name of the result table where the marked-up, plain-text document is stored.

See Also:
For more information about the structure of the markup result table, see "Markup Table" in Appendix B.

query_id

Specify the identifier used to identify the row inserted into restab.

plaintext

Specify TRUE to generate plaintext marked-up document. Specify FALSE to generate a marked-up HTML version of document if you are using the INSO filter or indexing HTML documents.

tagset

Specify one of the following pre-defined tagsets. The second and third columns show how the four different tags are defined for each tagset:

Tagset Tag Tag Value

TEXT_DEFAULT

starttag

<<<

endtag

>>>

prevtag

nexttag

HTML_DEFAULT

starttag

<B>

endtag

</B>

prevtag

nexttag

HTML_NAVIGATE

starttag

<A NAME=ctx%CURNUM><B>

endtag

</B></A>

prevtag

<A HREF=#ctx%PREVNUM><</A>

nexttag

<A HREF=#ctx%NEXTNUM>></A>

Tagset	Tag	Tag Value
TEXT_DEFAULT	starttag	<<<
	endtag	>>>
	prevtag
	nexttag
HTML_DEFAULT	starttag	<B>
	endtag	</B>
	prevtag
	nexttag
HTML_NAVIGATE	starttag	<A NAME=ctx%CURNUM><B>
	endtag	</B></A>
	prevtag	<A HREF=#ctx%PREVNUM><</A>
	nexttag	<A HREF=#ctx%NEXTNUM>></A>

starttag

Specify the character(s) inserted by MARKUP to indicate the start of a highlighted term.

The sequence of starttag, endtag, prevtag and nexttag with respect to the highlighted word is as follows:

... prevtag starttag word endtag nexttag...

endtag

Specify the character(s) inserted by MARKUP to indicate the end of a highlighted term.

prevtag

Specify the markup sequence that defines the tag that navigates the user to the previous highlight.

In the markup sequences prevtag and nexttag, you can specify the following offset variables which are set dynamically:

Offset Variable Value

%CURNUM

the current offset number

%PREVNUM

the previous offset number

%NEXTNUM

the next offset number

Offset Variable	Value
%CURNUM	the current offset number
%PREVNUM	the previous offset number
%NEXTNUM	the next offset number

See the description of the HTML_NAVIGATE tagset for an example.

nexttag

Specify the markup sequence that defines the tag that navigates the user to the next highlight tag.

Within the markup sequence, you can use the same offset variables you use for prevtag. See the explanation for prevtag and the HTML_NAVIGATE tagset for an example.

Examples

Markup Table

Create the highlight markup table to store the marked-up document as follows:

create table markuptab (query_id  number,   
                        document  clob);

Word Highlighting in HTML

To create HTML highlight markup for the words dog or cat for document 23, issue the following statement:

begin
  ctx_doc.markup(index_name => 'my_index',
                      textkey => '23',
                      text_query => 'dog|cat',
                      restab => 'markuptab',
                      query_id => '1'
                      tagset => 'HTML_DEFAULT');
end;

Theme Highlighting in HTML

To create HTML highlight markup for the theme of politics for document 23, issue the following statement:

begin
  ctx_doc.markup(index_name => 'my_index',
                      textkey => '23',
                      text_query => 'about(politics)',
                      restab => 'markuptab',
                      query_id => '1'
                      tagset => 'HTML_DEFAULT');
end;

Notes

Before CTX_DOC.MARKUP is called, the result table specified in restab must exist.

When textkey is a composite textkey, you must encode the composite textkey string using the CTX_DOC.PKENCODE procedure.

If text_query includes wildcards, stemming, fuzzy matching which result in stopwords being returned, MARKUP does not highlight the stopwords.

If text_query contains the threshold operator, the operator is ignored. The MARKUP procedure always returns highlight information for the entire result set.

When query_id is specified, all rows with the same query_id are deleted from restab before new rows are generated with query_id.

When query_id is not specified or set to NULL, it defaults to 0. You must manually truncate the table specified in restab.

PKENCODE

The CTX_DOC.PKENCODE function converts a composite textkey list into a single string and returns the string.

The string created by PKENCODE can be used as the primary key parameter textkey in other CTX_DOC procedures, such as CTX_DOC.THEMES and CTX_DOC.GIST.

Syntax

CTX_DOC.PKENCODE(
         pk1    IN VARCHAR2,
         pk2    IN VARCHAR2 DEFAULT NULL, 
         pk4    IN VARCHAR2 DEFAULT NULL, 
         pk5    IN VARCHAR2 DEFAULT NULL, 
         pk6    IN VARCHAR2 DEFAULT NULL,
         pk7    IN VARCHAR2 DEFAULT NULL,
         pk8    IN VARCHAR2 DEFAULT NULL,
         pk9    IN VARCHAR2 DEFAULT NULL,
         pk10   IN VARCHAR2 DEFAULT NULL,
         pk11   IN VARCHAR2 DEFAULT NULL,
         pk12   IN VARCHAR2 DEFAULT NULL,
         pk13   IN VARCHAR2 DEFAULT NULL,
         pk14   IN VARCHAR2 DEFAULT NULL,
         pk15   IN VARCHAR2 DEFAULT NULL,
         pk16   IN VARCHAR2 DEFAULT NULL)
RETURN VARCHAR2;

pk1-pk16

Each PK argument specifies a column element in the composite textkey list. You can encode at most 16 column elements.

Returns

String that represents the encoded value of the composite textkey.

Examples

begin 
ctx_doc.gist('newsindex',CTX_DOC.PKENCODE('smith', 14), 'CTX_GIST');
end;

In this example, smith and 14 constitute the composite textkey value for the document.

THEMES

The CTX_DOC.THEMES procedure generates a list of up to fifty themes for a document. Each theme is stored as a row in a result table specified by the user.

Syntax

CTX_DOC.THEMES(index_name      IN VARCHAR2,
               textkey         IN VARCHAR2,
               restab          IN VARCHAR2,
               query_id        IN NUMBER DEFAULT 0,
               full_themes     IN BOOLEAN DEFAULT FALSE);

index_name

Specify the name of the index for the column in which the document for the list of theme is stored.

textkey

Specify the textkey (usually the primary key) of the document (row) to be processed. The parameter textkey can be a single column textkey or an encoded specification for a multiple column textkey.

restab

Specify the name of the result table used to store the output generated by THEMES.

See Also:
For more information about the structure of the theme result table, see "Theme Table" in Appendix B.

query_id

Specify the identifier used to identify the row(s) inserted into restab.

full_themes

Specify whether this procedure generates a single theme or a hierarchical list of parent themes (full themes) for each document theme.

Specify TRUE for this procedure to write full themes to the THEME column of the result table.

Specify FALSE for this procedure to write single theme information to the THEME column of the result table. This is the default.

Examples

Theme Table

The following example creates a theme table called CTX_THEMES:

create table CTX_THEMES (query_id number, 
                         theme varchar2(2000), 
                         weight number);

Single Themes

To obtain a list of themes where each element in the list is a single theme, issue:

begin
ctx_doc.themes('newsindex',34,'CTX_THEMES',1,full_themes => FALSE);
end;

Full Themes

To obtain a list of themes where each element in the list is a hierarchical list of parent themes, issue:

begin
ctx_doc.themes('newsindex',34,'CTX_THEMES',1,full_themes => TRUE);
end;

Notes

When textkey is a composite key, you must encode the composite textkey string using the CTX_DOC.PKENCODE procedure.

8 CTX_DOC Package

FILTER

Syntax

Example

Notes

GIST

Syntax

Examples

Gist Table

Gists

Theme Summary

Notes

HIGHLIGHT

Syntax

Examples

Create Highlight Table

Word Highlight Offsets

Theme Highlight Offsets

Notes

MARKUP

Syntax

Examples

Markup Table

Word Highlighting in HTML

Theme Highlighting in HTML

Notes

PKENCODE

Syntax

Returns

Examples

THEMES

Syntax

Examples

Theme Table

Single Themes

Full Themes

Notes

8
CTX_DOC Package