3
Indexing

This chapter introduces the concepts necessary for understanding the index preference objects supplied with interMedia Text.

The following topics are discussed in this chapter:

Overview

When you use CREATE INDEX to create an index or ALTER INDEX to manage an index, you can optionally specify indexing preferences, stoplists, and section groups in the parameter string. Specifying a preference, stoplist, or section group answers one of the following questions about the way Oracle indexes text:

Preference Class Description

Datastore

How are your documents stored?

Filter

How can the documents be converted to plaintext?

Lexer

What language is being indexed?

Wordlist

How should stem and fuzzy queries be expanded?

Storage

How should the index tables be stored?

Stop List

What words or themes are not to be indexed?

Section Group

Is querying within sections enabled and how are the document sections defined?

Preference Class	Description
Datastore	How are your documents stored?
Filter	How can the documents be converted to plaintext?
Lexer	What language is being indexed?
Wordlist	How should stem and fuzzy queries be expanded?
Storage	How should the index tables be stored?
Stop List	What words or themes are not to be indexed?
Section Group	Is querying within sections enabled and how are the document sections defined?

This chapter describes the options you have for setting each preference. You enable an option by creating a preference with one of the objects described in this chapter.

For example, to specify that your documents are stored in external files, you can create a datastore preference called mydatastore using the FILE_DATASTORE object and specify mydatastore as the datastore preference in the parameter string of CREATE INDEX.

Creating Preferences

To create a datastore, lexer, filter, wordlist, or storage preference, you use CTX_DDL.CREATE_PREFERENCE procedure and specify one of the objects described in this chapter. For some objects, you can also set attributes with CTX_DDL.SET_ATTRIBUTE.

To create a stoplists, use CTX_DDL.CREATE_STOPLIST.

To create section groups, use CTX_DDL.CREATE_SECTION_GROUP and specify a section group type.

Datastore Objects

Use the datastore objects to specify how your text is stored. To create a data storage preference, you must use one of the following objects:

Object Use When

DIRECT_DATASTORE

Data is stored internally in the text column. Each row is indexed as a single document

DETAIL_DATASTORE

Data is stored internally in the text column. Document consists of one or more rows in a detail table, with header information stored in a master table.

FILE_DATASTORE

Data is stored externally in operating system files. File names stored in the text column.

URL_DATASTORE

Data is stored externally in files located on an intranet or the Internet. Uniform Resource Locators (URLs) stored in the text column.

USER_DATASTORE

Documents are synthesized at index time by a user-defined stored procedure.

Object	Use When
DIRECT_DATASTORE	Data is stored internally in the text column. Each row is indexed as a single document
DETAIL_DATASTORE	Data is stored internally in the text column. Document consists of one or more rows in a detail table, with header information stored in a master table.
FILE_DATASTORE	Data is stored externally in operating system files. File names stored in the text column.
URL_DATASTORE	Data is stored externally in files located on an intranet or the Internet. Uniform Resource Locators (URLs) stored in the text column.
USER_DATASTORE	Documents are synthesized at index time by a user-defined stored procedure.

DIRECT_DATASTORE

Use the DIRECT_DATASTORE object for text stored directly in the database. It has no attributes.

DETAIL_DATASTORE

Use the DETAIL_DATASTORE object for text stored directly in the database in detail tables, with the textkey column located in the master table.

DETAIL_DATASTORE has the following attributes:

Attribute Attribute Value

binary

Specify TRUE for Oracle to add no newline character after each detail row.
Specify FALSE for Oracle to add a newline character (\n) after each detail row automatically.

detail_table

Specify name of detail table (OWNER.TABLE if necessary)

detail_key

Specify name of detail table foreign key column(s)

detail_lineno

Specify name of detail table sequence column.

detail_text

Specify name of detail table text column.

Attribute	Attribute Value
binary	Specify TRUE for Oracle to add no newline character after each detail row. Specify FALSE for Oracle to add a newline character (\n) after each detail row automatically.
detail_table	Specify name of detail table (OWNER.TABLE if necessary)
detail_key	Specify name of detail table foreign key column(s)
detail_lineno	Specify name of detail table sequence column.
detail_text	Specify name of detail table text column.

Example Master/Detail Tables

This example illustrates how master and detail tables are related to each other.

Master Table

Master tables define the documents in a master/detail relationship. You assign an identifying number to each document. The following table is an example master table, called my_master:

Column Name Column Type Description

article_id

NUMBER

Document ID, unique for each document. (Primary Key)

author

VARCHAR2(30)

Author of document.

title

VARCHAR2(50)

Title of Document

body

CHAR(1)

Dummy column to specify in CREATE INDEX.

Column Name	Column Type	Description
article_id	NUMBER	Document ID, unique for each document. (Primary Key)
author	VARCHAR2(30)	Author of document.
title	VARCHAR2(50)	Title of Document
body	CHAR(1)	Dummy column to specify in CREATE INDEX.

Detail Table

Detail tables contain the text for a document, whose content is usually stored across a number of rows. The following detail table my_detail is related to the master table my_master with the article_id column. This column identifies the master document to which each detail row (sub-document) belongs.

Column Name Column Type Description

article_id

NUMBER

Document ID that relates to master table.

seq

NUMBER

Sequence of document in the master document defined by article_id.

text

CLOB

Document text.

Column Name	Column Type	Description
article_id	NUMBER	Document ID that relates to master table.
seq	NUMBER	Sequence of document in the master document defined by article_id.
text	CLOB	Document text.

Attributes

In this example, the DETAIL_DATASTORE attributes have the following values:

Attribute Attribute Value

binary

TRUE

detail_table

my_detail

detail_key

article_id

detail_lineno

seq

detail_text

text

Attribute	Attribute Value
binary	TRUE
detail_table	my_detail
detail_key	article_id
detail_lineno	seq
detail_text	text

You use CTX_DDL.CREATE_PREFERENCE to create a preference with DETAIL_DATASTORE. You use CTX_DDL.SET_ATTRIBUTE to set the attributes for this preference as described above. The following example shows how this is done:

begin
ctx_ddl.create_preference('my_detail_pref', 'DETAIL_DATASTORE');
ctx_ddl.set_attribute('my_detail_pref', 'binary', 'true');
ctx_ddl.set_attribute('my_detail_pref', 'detail_table', 'my_detail');
ctx_ddl.set_attribute('my_detail_pref', 'detail_key', 'article_id');
ctx_ddl.set_attribute('my_detail_pref', 'detail_lineno', 'seq');
ctx_ddl.set_attribute('my_detail_pref', 'detail_text', 'text');
end;

Index

To index the document defined in this master/detail relationship, you specify a column in the master table with CREATE INDEX. The column you specify must be one of the allowable types.

This example uses the body column, whose function is to allow the creation of the master/detail index and to improve readability of the code. The my_detail_pref preference is set to DETAIL_DATASTORE with the required attributes:

CREATE INDEX myindex on my_master(body) indextype is context 
parameters('datastore my_detail_pref');

In this example, you can also specify the title or author column to create the index. However, if you do so, changes to these columns will trigger a re-index operation.

FILE_DATASTORE

The FILE_DATASTORE object is used for text stored in files accessed through the local file system.

FILE_DATASTORE has the following attribute(s):

Attribute Attribute Values

path

path1:path2:...:pathn

Attribute	Attribute Values
path	path1:path2:...:pathn

path

Specify the location of text files that are stored externally in a file system.

You can specify multiple paths for path, with each path separated by a colon (:). File names are stored in the text column in the text table. If path is not used to specify a path for external files, Oracle requires the path to be included in the file names stored in the text column.

URL_DATASTORE

Use the URL_DATASTORE object for text stored:

in files on the World Wide Web (accessed through HTTP or FTP)
in files in the local file system (accessed through the file protocol)

You store each URL in a single text field.

URL Syntax

The syntax of a URL you store in a text field must comply with the RFC 1738 specification. The syntax of this specification is as follows:

[URL:]<access_scheme>://[<user_id>:<password>@]<host_name>[:<port_number>]/[<url_path>]

The access_scheme string is either ftp, http, or file.

As part of the RFC 1738 specification, the following restriction holds for the URL syntax:

The URL must contain only printable ASCII characters. Non-printable ASCII characters and multibyte characters must be escaped with the %xx notation, where xx is the hexadecimal representation of the special character.

URL_DATASTORE Attributes

URL_DATASTORE has the following attributes:

Attribute Attribute Values

timeout

Specify timeout in seconds. The valid range is 15 to 3600 seconds. The default is 30.

maxthreads

Specify maximum number of threads that can be running simultaneously. Use a number between 1and 1024. Default is 8.

urlsize

Specify maximum length of URL string in bytes. Use number between 32 and 65535. Defaults to 256.

maxurls

Specify maximum size of URL buffer. Use a number between 32 and 65535. Defaults to 256.

maxdocsize

Specify maximum document size. Use a number between 256 and 2,147,483,647 bytes (2 gigabytes). Defaults to 2,000,000.

http_proxy

Specify host name of http proxy server.

ftp_proxy

Specify host name of fttp proxy server.

no_proxy

Specify domain for no proxy server. Use a comma separated string of up to 16 domain names.

Attribute	Attribute Values
timeout	Specify timeout in seconds. The valid range is 15 to 3600 seconds. The default is 30.
maxthreads	Specify maximum number of threads that can be running simultaneously. Use a number between 1and 1024. Default is 8.
urlsize	Specify maximum length of URL string in bytes. Use number between 32 and 65535. Defaults to 256.
maxurls	Specify maximum size of URL buffer. Use a number between 32 and 65535. Defaults to 256.
maxdocsize	Specify maximum document size. Use a number between 256 and 2,147,483,647 bytes (2 gigabytes). Defaults to 2,000,000.
http_proxy	Specify host name of http proxy server.
ftp_proxy	Specify host name of fttp proxy server.
no_proxy	Specify domain for no proxy server. Use a comma separated string of up to 16 domain names.

timeout

Specify the length of time, in seconds, that a network operation such as a connect or read waits before timing out and returning a timeout error to the application. The valid range for timeout is 15 to 3600 and the default is 30.

Note:
Since timeout is at the network operation level, the total timeout may be longer than the time specified for timeout.

maxthread

Specify the maximum number of threads that can be running at the same time. The valid range for maxthreads is 1 to 1024 and the default is 8.

urlsize

Specify the maximum length, in bytes, that the URL data store supports for URLs stored in the database. If a URL is over the maximum length, an error is returned. The valid range for urlsize is 32 to 65535 and the default is 256.

Note:
The values specified for maxurls and urlsize, when multiplied, cannot exceed 5,000,000.
In other words, the maximum size of the memory buffer (maxurls * urlsize) for the URL object is approximately 5 megabytes.

maxurls

Specify the maximum number of rows that the internal buffer can hold for HTML documents (rows) retrieved from the text table. The valid range for maxurls is 32 to 65535 and the default is 256.

maxdocsize

Specify the maximum size, in bytes, that the URL data store supports for accessing HTML documents whose URLs are stored in the database. The valid range for maxdocsize is 1 to 2,147,483,647 (2 gigabytes) and the default is 2,000,000.

http_proxy

Specify the fully-qualified name of the host machine that serves as the HTTP proxy (gateway) for the machine on which interMedia Text is installed. This attribute must be set if the machine is in an intranet that requires authentication through a proxy server to access Web files located outside the firewall.

ftp_proxy

Specify the fully-qualified name of the host machine that serves as the FTP proxy (gateway) for the machine on which interMedia Text is installed. This attribute must be set if the machine is in an intranet that requires authentication through a proxy server to access Web files located outside the firewall.

no_proxy

Specify a string of domains (up to sixteen, separate by commas) which are found in most, if not all, of the machines in your intranet. When one of the domains is encountered in a host name, no request is sent to the machine(s) specified for ftp_proxy and http_proxy. Instead, the request is processed directly by the host machine identified in the URL.

For example, if the string us.oracle.com, uk.oracle.com is entered for no_proxy, any URL requests to machines that contain either of these domains in their host names are not processed by your proxy server(s).

USER_DATASTORE

Use USER_DATASTORE object to define stored procedures that synthesize documents during indexing. For example, a user procedure might synthesize author, date, and text columns into one document to have the author and date information be part of the indexed text.

The USER_DATASTORE has the following attribute:

Attribute Attribute Value

procedure

Specify the name of the procedure that synthesizes the document to be indexed.

Attribute	Attribute Value
procedure	Specify the name of the procedure that synthesizes the document to be indexed.

procedure

Specify the name of the procedure that synthesizes the document to be indexed. This specification must be in the form PROCEDURENAME or PACKAGENAME.PROCEDURENAME. The schema owner name is constrained to CTXSYS, so specifying owner name is not necessary.

The procedure you specify must have the following parameters:

(IN ROWID, IN OUT CLOB)

The procedure is called once for each row indexed. Given the rowid of the current row, the procedure must write the text of the document into the CLOB locator.

The following constraints apply to the procedure you specify:

the procedure must be owned by CTXSYS
the procedure must be executable by the index owner
the procedure cannot issue DDL or transaction control statements like COMMIT
the procedure cannot be a safe callout or call a safe callout

Filter Objects

Use the filter objects to create preferences that determine how text is filtered for indexing. Filters allow word processor and formatted documents, as well as plain text and HTML documents, to be indexed.

For formatted documents, Oracle stores documents in their native format and uses filters to build temporary plain text or HTML versions of the documents. Oracle indexes the plain text/HTML version of the formatted document.

To create a filter preference, you must use one of the following objects:

Filter Preference Object Description

NULL_FILTER

ASCII filter.

INSO_FILTER

Inso filter for filtering formatted documents.

USER_FILTER

User-defined filter

CHARSET_FILTER

Character set converting filter.

Filter Preference Object	Description
NULL_FILTER	ASCII filter.
INSO_FILTER	Inso filter for filtering formatted documents.
USER_FILTER	User-defined filter
CHARSET_FILTER	Character set converting filter.

NULL_FILTER

Use the NULL_FILTER object to specify that plain text is stored in the text column and no filtering needs to be performed. NULL_FILTER has no attributes.

INSO_FILTER

The Inso filter is a universal filter that filters most document formats. Use it for indexing single and mixed format columns. The INSO_FILTER has no attributes.

See Also:
For a list of the formats supported by INSO_FILTER and to learn more about how to set up your environment to use this filter, see Appendix C, "Supported Filter Formats".

USER_FILTER

Use the USER_FILTER object to specify an external filter for filtering documents in a column. USER_FILTER has the following attribute:

Attribute Attribute Values

command

filter executable

Attribute	Attribute Values
command	filter executable

command

Specify the executable for the single external filter used to filter all text stored in a column. If more than one document format is stored in the column, the external filter specified for command must recognize and handle all such formats.

The executable you specify must go in the $ORACLE_HOME/ctx/bin directory. You must create your user-filter executable with two parameters: the first is the name of the input file to be read, and the second is the name of the output file to be written to.

If all the document formats are supported by the INSO_FILTER, use INSO_FILTER instead of USER_FILTER unless additional tasks besides filtering are required for the documents.

CHARSET_FILTER

Use the CHARSET_FILTER to convert documents from one character set to the database character set.

The CHARSET_FILTER has the following attribute:

Attribute Attribute Value

charset

Specify the NLS name of source character set.
Specify JAAUTO for Japanese character set auto-detection. This filter automatically detects the custom character specification in JA16EUC or JA16SJIS and converts to the database character set. This filter is useful in Japanese when your data files have mixed character sets.

Attribute	Attribute Value
charset	Specify the NLS name of source character set. Specify JAAUTO for Japanese character set auto-detection. This filter automatically detects the custom character specification in JA16EUC or JA16SJIS and converts to the database character set. This filter is useful in Japanese when your data files have mixed character sets.

See Also:
For more information about the supported NLS character sets, see Oracle8i National Language Support Guide.

Lexer Objects

Use the lexer preference to specify the language of the text to be indexed. To create a lexer object, you must use one of the following objects:

Object Description

BASIC_LEXER

Lexer used for extracting tokens from text in languages, such as English and most western European languages that use single-byte character sets.

CHINESE_VGRAM_LEXER

Lexer used for extracting tokens from Chinese-language text.

JAPANESE_VGRAM_LEXER

Lexer used for extracting tokens from Japanese-language text.

KOREAN_LEXER

Lexer used for extracting tokens from Korean-language text.

Object	Description
BASIC_LEXER	Lexer used for extracting tokens from text in languages, such as English and most western European languages that use single-byte character sets.
CHINESE_VGRAM_LEXER	Lexer used for extracting tokens from Chinese-language text.
JAPANESE_VGRAM_LEXER	Lexer used for extracting tokens from Japanese-language text.
KOREAN_LEXER	Lexer used for extracting tokens from Korean-language text.

BASIC_LEXER

Use the BASIC_LEXER object to identify tokens for creating Text indexes for English and all other supported single-byte languages.

The BASIC_LEXER is also used to enable base-letter conversion, composite word indexing, case-sensitive indexing and alternate spelling for single-byte languages that have extended character sets.

In English, you can use the BASIC_LEXER to enable theme indexing.

Note:
Any changes made to tokens before Text indexing (e.g. removing of characters, base-letter conversion) are also performed on the query terms in a Text query. This ensures that the query terms match the form of the tokens in the Text index.

BASIC_LEXER has the following attributes:

Attribute Attribute Values

continuation

characters (string)

numgroup

characters (string)

numjoin

characters (string)

printjoins

characters (string)

punctuations

characters (string)

skipjoins

characters (string)

startjoins

non-alphanumeric characters that occur at the beginning of a token (string)

endjoins

non-alphanumeric characters that occur at the end of a token (string)

whitespace

characters (string)

newline

NEWLINE (\n)
CARRIAGE_RETURN (\r)

base_letter

NO (disabled)

YES (enabled)

mixed_case

NO (disabled)

YES (enabled)

composite

NO (no composite word indexing, default)

GERMAN (German composite word indexing)

DUTCH (Dutch composite word indexing)

index_themes

YES (enabled)

NO (disabled)

index_text

YES (enabled)

NO (disabled)

alternate_spelling

GERMAN (German alternate spelling)

DANISH (Danish alternate spelling)

SWEDISH (Swedish alternate spelling)

Attribute	Attribute Values
continuation	characters (string)
numgroup	characters (string)
numjoin	characters (string)
printjoins	characters (string)
punctuations	characters (string)
skipjoins	characters (string)
startjoins	non-alphanumeric characters that occur at the beginning of a token (string)
endjoins	non-alphanumeric characters that occur at the end of a token (string)
whitespace	characters (string)
newline	NEWLINE (\n) CARRIAGE_RETURN (\r)
base_letter	NO (disabled)
	YES (enabled)
mixed_case	NO (disabled)
	YES (enabled)
composite	NO (no composite word indexing, default)
	GERMAN (German composite word indexing)
	DUTCH (Dutch composite word indexing)
index_themes	YES (enabled)
	NO (disabled)
index_text	YES (enabled)
	NO (disabled)
alternate_spelling	GERMAN (German alternate spelling)
	DANISH (Danish alternate spelling)
	SWEDISH (Swedish alternate spelling)

Note:
The BASIC_LEXER object attributes that use character strings can contain multiple characters. Each character in the string serves as a distinct character for that type of attribute.
For example, if the string '*_.-' is specified for the printjoins attribute, each individual character ('*', '_', '.', and '-') in the string is treated as a joining character that is included in the index entry for a token in which the character occurs.

continuation

Specify the characters that indicate a word continues on the next line and should be indexed as a single token. The most common continuation characters are hyphen '-' and backslash '\'.

numgroup

Specify a single character that, when it appears in a string of digits, indicates that the digits are groupings within a larger single unit.

For example, comma ',' might be defined as numgroup characters because it often indicates a grouping of thousands when it appears in a string of digits.

numjoin

Specify the characters that, when they appear in a string of digits, cause Oracle to index the string of digits as a single unit or word.

For example, period '.' may be defined as numjoin characters because it often serves as decimal points when it appears in a string of digits.

Note:
The default values for numjoin and numgroup are determined by the NLS initialization parameters that are specified for the database.
In general, a value need not be specified for either numjoin or numgroup when creating a Lexer preference for the BASIC_LEXER object.

printjoins

Specify the non-alphanumeric characters that, when they appear anywhere in a word (beginning, middle, or end), are processed as alphanumeric and included with the token in the Text index. This includes printjoins that occur consecutively.

For example, if the hyphen '-' and underscore '_' characters are defined as printjoins, terms such as pseudo-intellectual and _file_ are stored in the Text index as pseudo-intellectual and _file_.

Note:
If a printjoins character is also defined as a punctuations character, the character is only processed as an alphanumeric character if the character immediately following it is a standard alphanumeric character or has been defined as a printjoins or skipjoins character.

punctuations

Specify the non-alphanumeric characters that, when they appear at the end of a word, indicate the end of a sentence. The defaults are period '.', question mark '?', and exclamation point '!'.

Characters that are defined as punctuations are removed from a token before text indexing; however, if a punctuations character is also defined as a printjoins character, the character is only removed if it is the last character in the token and it is immediately preceded by the same character.

For example, if the period (.) is defined as both a printjoins and a punctuations character, the following transformations take place during indexing and querying as well:

Token Indexed Token

.doc

.doc

dog.doc

dog.doc

dog..doc

dog..doc

dog.

dog

dog...

dog..

Token	Indexed Token
.doc	.doc
dog.doc	dog.doc
dog..doc	dog..doc
dog.	dog
dog...	dog..

In addition, BASIC_LEXER uses punctuations characters in conjunction with newline and whitespace characters to determine sentence and paragraph deliminters for sentence/paragraph searching.

skipjoins

Specify the non-alphanumeric characters that, when they appear within a word, identify the word as a single token; however, the characters are not stored with the token in the Text index.

For example, if the hyphen character '-' is defined as a skipjoins, the word pseudo-intellectual is stored in the Text index as pseudointellectual.

Note:
printjoins and skipjoins are mutually exclusive. The same characters cannot be specified for both attributes.

startjoins/endjoins

Specify the characters that, when encountered as the first character in a token, explicitly identify the start of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the Text index entry for the token. In addition, the first startjoins character in a string of startjoins characters implicitly end the previous token.

endjoins specifies the characters that, when encountered as the last character in a token, explicitly identify the end of the token. The character, as well as any other startjoins characters that immediately follow it, is included in the Text index entry for the token.

The following rules apply to both startjoins and endjoins:

the characters specified for startjoins/endjoins cannot occur in any of the other attributes for BASIC_LEXER.
startjoins/endjoins characters can occur only at the beginning/end of tokens

whitespace

Specify the characters that are treated as blank spaces between tokens. BASIC_LEXER uses whitespace characters in conjunction with punctuations and newline characters to identify character strings that serve as sentence delimiters for sentence/paragraph searching.

The predefined, default values for whitespace are 'space' and 'tab'; these values cannot be changed. Specifying characters as whitespace characters adds to these defaults.

newline

Specify the characters that indicate the end of a line of text. BASIC_LEXER uses newline characters in conjunction with punctuations and whitespace characters to identify character strings that server as paragraph delimiters for sentence/paragraph searching.

The only valid values for newline are NEWLINE and CARRIAGE_RETURN (for carriage returns). The default is NEWLINE.

base_letter

Specify whether characters that have diacritical marks (umlauts, cedillas, acute accents, etc.) are converted to their base form before being stored in the Text index. The default is NO (base-letter conversion disabled).

mixed_case

Specify whether the lexer converts the tokens in Text index entries to all uppercase or stores the tokens exactly as they appear in the text. The default is NO (tokens converted to all uppercase).

Note:
Oracle ensures Text queries match the case-sensitivity of the index being queried. As a result, if you enable case-sensitivity for your Text index, queries against the index are always case-sensitive.

composite

Specify whether composite word indexing is disabled or enabled for either GERMAN or DUTCH text. The default is NO (composite word indexing disabled).

Note:
The composite and mixed_case attributes are mutually exclusive; Composite indexes do not support case-sensitivity.

index_themes

Specify YES to index theme information in English. This makes ABOUT queries more precise. The index_themes and index_text attributes cannot both be NO.

index_text

Specify YES to index word information. The index_themes and index_text attributes cannot both be NO.

alternate_spelling

Specify either GERMAN, DANISH, or SWEDISH to enable alternate spelling in one of these languages. By default, alternate spelling is enabled in all three languages.

See Also:
For more information about the alternate spelling conventions Oracle uses, see Appendix F, "Alternate Spelling Conventions".

CHINESE_VGRAM_LEXER

The CHINESE_VGRAM_LEXER object identifies tokens in Chinese text for creating Text indexes. It has no attributes.

JAPANESE_VGRAM_LEXER

The JAPANESE_VGRAM_LEXER object identifies tokens in Japanese for creating Text indexes. It has no attributes.

KOREAN_LEXER

The KOREAN_LEXER object identifies tokens in Korean text for creating Text indexes. When you use the KOREAN_LEXER, specify the following attributes:

KOREAN_LEXER Attribute Attribute Values

verb

Specify TRUE or FALSE to index verb

adjective

Specify TRUE or FALSE to index adjective

adverb

Specify TRUE or FALSE to index adverb

onechar

Specify TRUE or FALSE to index one character

number

Specify TRUE or FALSE to index number

udic

Specify TRUE or FALSE to index user dictionary

xdic

Specify TRUE or FALSE to index x-user dictionary

composite

Specify TRUE or FALSE to index composites

morpheme

Specify TRUE or FALSE for morphological analysis

toupper

Specify TRUE or FALSE to convert English to uppercase

tohangeul

Specify TRUE or FALSE to convert hanja to hanggeul

Wordlist Object

Use the wordlist preference to enable the advanced query options such as stemming and fuzzy matching for your language. To create a wordlist preference, you must use BASIC_WORDLIST, which is the only object available.

BASIC_WORDLIST

Use BASIC_WORDLIST object to enable stemming and fuzzy matching for Text indexes.

See Also:
For more information about the stem and fuzzy operators, see Chapter 4, "Query Operators".

BASIC_WORDLIST has the following attributes:

Table 3-1

Attribute	Attribute Values
stemmer	Specify which language stemmer to use. You can specify one of: NULL (no stemming) ENGLISH (English inflectional) DERIVATIONAL (English derivational) DUTCH FRENCH GERMAN ITALIAN SPANISH
fuzzy_match	Specify which fuzzy matching cluster to use. You can specify one of the following: GENERIC JAPANESE_VGRAM KOREAN CHINESE_VGRAM ENGLISH DUTCH FRENCH GERMAN ITALIAN SPANISH OCR
fuzzy_score	Specify a default lower limit of fuzzy score. Specify a number between 0 and 80. Setting fuzzy score means scores below this number are not produced.
fuzzy_numresults	Specify the maximum number of fuzzy expansions. Use a number between 0 and 5000.

stemmer

Specify the stemmer used for word stemming in Text queries. When there is no stemmer for a language, the default is NULL. With the NULL stemmer, the $ operator is ignored in queries.

fuzzy_match

Specify which fuzzy matching routines are used for the column. Fuzzy matching is currently supported for English, Japanese, and, to a lesser extent, the Western European languages.

The default for fuzzy_match is GENERIC.

Note:
The fuzzy_match attribute values for Chinese and Korean are dummy attribute values that prevent the English and Japanese fuzzy matching routines from being used on Chinese and Korean text.

fuzzy_score

Specify a default lower limit of fuzzy score. Specify a number between 0 and 80. Setting fuzzy score means scores below this number are not produced.

Fuzzy score is a measure of how close the expanded word is to the query word, the higher the score the better the match. Use this parameter to limit fuzzy expansions to the best matches.

fuzzy_numresults

Specify the maximum number of fuzzy expansions. Use a number between 0 and 5000.

Setting a fuzzy expansion limits the expansion to a certain number of the best matching words.

Storage Objects

Use the storage preference to specify tablespace and creation parameters for tables associated with a Text index. The system provides a single storage object called BASIC_STORAGE:

Object Description

BASIC_STORAGE

Indexing object used to specify the tablespace and creation parameters for the database tables and indexes that constitute a Text index.

BASIC_STORAGE

The BASIC_STORAGE object specifies the tablespace and creation parameters for the database tables and indexes that constitute a Text index.

The clause you specify is added to the internal CREATE TABLE (CREATE INDEX for the i_index _clause) statement at index creation. You can specify most allowable clauses, such as storage, LOB storage, or partitioning.

However, do not specify an index organized table clause.

See Also:
For more information about how to specify CREATE TABLE and CREATE INDEX clauses, see their command syntax specification in Oracle8i SQL Reference.

BASIC_STORAGE has the following attributes:

BASIC_STORAGE Attribute Attribute Value

i_table_clause

Parameter clause for dr$<indexname>$I table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.
The I table is the index data table.

k_table_clause

Parameter clause for dr$<indexname>$K table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.
The K table is the keymap table.

r_table_clause

Parameter clause for dr$<indexname>$R table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.
The R table is the rowid table.

n_table_clause

Parameter clause for dr$<indexname>$N table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement.
The N table is the negative list table.

i_index_clause

Parameter clause for dr$<indexname>$X index creation. Specify storage and tablespace clauses to add to the end of the internal CREATE INDEX statement.

Storage Default Behavior

By default, BASIC_STORAGE attributes are not set. In such cases, the Text index tables are created in the index owner's default tablespace. Consider the following statement, issued by user IUSER, with no BASIC_STORAGE attributes set:

create index IOWNER.idx on TOWNER.tab(b) indextype is ctxsys.context

In this example, the tablespace is created in IOWNER's default tablespace.

Storage Example

The following examples specify that the index tables are to be created in the foo tablespace with an initial extent of 1K:

begin
ctx_ddl.create_preference('mystore', 'BASIC_STORAGE');
ctx_ddl.set_attribute('mystore', 'I_TABLE_CLAUSE',
                        'tablespace foo storage (initial 1K)'); 
ctx_ddl.set_attribute('mystore', 'K_TABLE_CLAUSE',
                        'tablespace foo storage (initial 1K)'); 
ctx_ddl.set_attribute('mystore', 'R_TABLE_CLAUSE',
                        'tablespace foo storage (initial 1K)'); 
ctx_ddl.set_attribute('mystore', 'N_TABLE_CLAUSE',
                        'tablespace foo storage (initial 1K)'); 
ctx_ddl.set_attribute('mystore', 'I_INDEX_CLAUSE',
                        'tablespace foo storage (initial 1K)'); 
end;

Section Group Types

You can specify one of the following group types to create a section group with CTX_DDL.CREATE_SECTION_GROUP:

Section Group Preference Description

NULL_SECTION_GROUP

This is the default. Use this group type when you define no sections or when you define only SENTENCE or PARAGRAPH sections.

BASIC_SECTION_GROUP

Use this group type for defining sections where the start and end tags are of the form <A> and </A>.

HTML_SECTION_GROUP

Use this group type for defining section in HTML documents.

XML_SECTION_GROUP

Use this group type for defining sections in XML-style tagged documents.

NEWS_SECTION_GROUP

Use this group for defining sections in newsgroup formatted documents according to the RFC 1036 specification.

System-Defined Preferences

When you install interMedia Text, some indexing preferences are created. You can use these preferences in the parameter string of CREATE INDEX or define your own. These preferences are divided into the following categories:

Data Storage

CTXSYS.DEFAULT_DATASTORE

This preference uses the DIRECT_DATASTORE object. It is used to create indexes for text columns in which the text is stored directly in the column.

CTXSYS.FILE_DATASTORE

This preference uses the FILE_DATASTORE object.

CTXSYS.URL_DATASTORE

This preference uses the URL_DATASTORE object.

Filter

CTXSYS.NULL_FILTER

This preference uses the NULL_FILTER object.

CTXSYS.INSO_FILTER

This preference uses the INSO_FILTER object.

Lexer

CTXSYS.DEFAULT_LEXER

This preference defaults to the lexer required for the language you specify during your database setup.

Section Group

CTXSYS.NULL_SECTION_GROUP

This preference uses the NULL_SECTION_GROUP object.

CTXSYS.HTML_SECTION_GROUP

This preference uses the HTML_SECTION_GROUP object.

Stoplist

CTXSYS.DEFAULT_STOPLIST

This stoplist preference defaults to the stoplist of the language specified during your database setup.

See Also:
For a complete list of the stop words in the supplied stoplists, see Appendix E, "Supplied Stoplists".

Storage

CTXSYS.DEFAULT_STORAGE

This storage preference uses the BASIC_STORAGE object.

Wordlist

CTXSYS.DEFAULT_WORDLIST

This preference uses the language stemmer for the language specified during your database setup. If your language is not listed in Table 3-1, this preference defaults to the NULL stemmer and the GENERIC fuzzy matching attribute.

System Parameters

General

When you install interMedia Text, in addition to the system-defined preferences, the following system parameters are set:

System Parameter Description

MAX_INDEX_MEMORY

This is the maximum indexing memory which can be specified in the parameter string of CREATE INDEX and ALTER INDEX.

DEFAULT_INDEX_MEMORY

This is the default indexing memory used with CREATE INDEX and ALTER INDEX.

LOG_DIRECTORY

This is the directory for ctx_output files.

You can view system defaults with CTX_PARAMETERS view. You can change defaults using the CTX_ADM.SET_PARAMETER procedure.

Default Index Parameters

The following default parameters are used when you do not specify preferences in the parameter string of CREATE INDEX. Each default parameter names a pre-defined preference to use for data storage, filtering, lexing and so on.

Viewing Default Values

You can view system defaults with CTX_PARAMETERS view.

Changing Default Values

You can change a default value using the CTX_ADM.SET_PARAMETER procedure to name another preference, custom or pre-defined, to use as default.

KOREAN_LEXER Attribute	Attribute Values
verb	Specify TRUE or FALSE to index verb
adjective	Specify TRUE or FALSE to index adjective
adverb	Specify TRUE or FALSE to index adverb
onechar	Specify TRUE or FALSE to index one character
number	Specify TRUE or FALSE to index number
udic	Specify TRUE or FALSE to index user dictionary
xdic	Specify TRUE or FALSE to index x-user dictionary
composite	Specify TRUE or FALSE to index composites
morpheme	Specify TRUE or FALSE for morphological analysis
toupper	Specify TRUE or FALSE to convert English to uppercase
tohangeul	Specify TRUE or FALSE to convert hanja to hanggeul

BASIC_STORAGE Attribute	Attribute Value
i_table_clause	Parameter clause for dr$<indexname>$I table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement. The I table is the index data table.
k_table_clause	Parameter clause for dr$<indexname>$K table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement. The K table is the keymap table.
r_table_clause	Parameter clause for dr$<indexname>$R table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement. The R table is the rowid table.
n_table_clause	Parameter clause for dr$<indexname>$N table creation. Specify storage and tablespace clauses to add to the end of the internal CREATE TABLE statement. The N table is the negative list table.
i_index_clause	Parameter clause for dr$<indexname>$X index creation. Specify storage and tablespace clauses to add to the end of the internal CREATE INDEX statement.

Section Group Preference	Description
NULL_SECTION_GROUP	This is the default. Use this group type when you define no sections or when you define only SENTENCE or PARAGRAPH sections.
BASIC_SECTION_GROUP	Use this group type for defining sections where the start and end tags are of the form <A> and </A>.
HTML_SECTION_GROUP	Use this group type for defining section in HTML documents.
XML_SECTION_GROUP	Use this group type for defining sections in XML-style tagged documents.
NEWS_SECTION_GROUP	Use this group for defining sections in newsgroup formatted documents according to the RFC 1036 specification.

System Parameter	Description
MAX_INDEX_MEMORY	This is the maximum indexing memory which can be specified in the parameter string of CREATE INDEX and ALTER INDEX.
DEFAULT_INDEX_MEMORY	This is the default indexing memory used with CREATE INDEX and ALTER INDEX.
LOG_DIRECTORY	This is the directory for ctx_output files.

System Parameter	Used When	Default Value
DEFAULT_DATASTORE	No datastore preference specified in parameter string of CREATE INDEX.	CTXSYS.DEFAULT_DATASTORE
DEFAULT_FILTER_FILE	No filter preference specified in parameter string of CREATE INDEX, and either of the following conditions is true: your files are stored in external files (BFILES) or you specify a datastore preference that uses FILE_DATASTORE	CTXSYS.INSO_FILTER
DEFAULT_FILTER_BINARY	No filter preference specified in parameter string of CREATE INDEX, and Oracle detects that the text column datatype is RAW, LONG RAW, or BLOB.	CTXSYS.INSO_FILTER
DEFAULT_FILTER_TEXT	No filter preference specified in parameter string of CREATE INDEX, and Oracle detects that the text column datatype is either LONG, VARCHAR2, VARCHAR, CHAR, or CLOB.	CTXSYS.NULL_FILTER
DEFAULT_SECTION_HTML	No section group specified in parameter string of CREATE INDEX, and when either of the following conditions is true: your datastore preference uses URL_DATASTORE or when your filter preference uses INSO_FILTER.	CTXSYS.HTML_SECTION_GROUP
DEFAULT_SECTION_TEXT	No section group specified in parameter string of CREATE INDEX, and when you do not use either URL_DATASTORE or INSO_FILTER.	CTXSYS.NULL_SECTION_GROUP
DEFAULT_STORAGE	No storage preference specified in parameter string of CREATE INDEX.	CTXSYS.DEFAULT_STORAGE
DEFAULT_LEXER	No lexer preference specified in parameter string of CREATE INDEX.	CTXSYS.DEFAULT_LEXER
DEFAULT_STOPLIST	No stoplist specified in parameter string of CREATE INDEX.	CTXSYS.DEFAULT_STOPLIST
DEFAULT_WORDLIST	No wordlist preference specified in parameter string of CREATE INDEX.	CTXSYS.DEFAULT_WORDLIST

3 Indexing