10
ConText Data Dictionary

This chapter provides reference information for the ConText data dictionary objects provided with ConText.

The topics discussed in this chapter are:

Tiles, Tile Attributes, and Attribute Values: Indexing

The following section lists all of the Tiles which can be used to create indexing preferences for use in policies. The section also lists the attributes and attribute values for each indexing Tile. In addition, a brief description of the Tile attributes and examples are provided.

The indexing Tiles are grouped alphabetically by preference category:

Preference Category	Tiles
Data Store Category	DIRECT
	MASTER DETAIL
	OSFILE
	URL
Filter Category	BLASTER FILTER
	FILTER NOP
	HTML FILTER
	USER FILTER
Lexer Category	BASIC LEXER
	CHINESE V-GRAM LEXER
	JAPANESE V-GRAM LEXER
	KOREAN LEXER
	THEME LEXER
Engine Category	GENERIC ENGINE
	ENGINE NOP
Wordlist Category	GENERIC WORD LIST
Stoplist Category	GENERIC STOP LIST

Note:

The Compressor category is not listed because data compression is not supported in this release of ConText. Predefined NULL Compressor Tiles and preferences are used as defaults in any policies created.

Data Store Category

The Data Store category contains the following Tiles:

Tile	Attributes	Attribute Values
DIRECT	none	N/A
MASTER DETAIL	BINARY	0 (plain text)
		1 (binary text)
OSFILE	PATH	path1:path2:...:pathn
URL	TIMEOUT	seconds (0 to 3600, default 30)
	MAXTHREADS	number of threads (0 to 1024, default 8)
	MAXURLS	buffer length in bytes (1 to 4294967295, default 256)
	URLSIZE	URL length (32 to 65535, default 256)
	MAXDOCSIZE	document size (256 to 4294967295, default 2000000)
	HTTP_PROXY	host name
	NO_PROXY	string (up to 16 strings, separated by commas)

MASTER DETAIL Tile Attribute(s)

The binary attribute specifies whether the text in a master detail table is in plain text format (0) or binary format (1).

Text in plain text uses newline characters at the end of each line to indicate the end of the line. In contrast, binary format does not use newline characters to indicate the end of the line.

OSFILE Tile Attribute(s)

The path attribute specifies the location of text files that are stored externally in a file system.

Multiple paths can be specified for the path attribute, with each path separated by a colon (:). File names are stored in the text column in the text table. If the path attribute is not used to specify a path for external files, ConText requires the path to be included in the file names stored in the text column.

Note:

If text is stored in external files rather than in a database, the files must be accessible from the host machine on which the ConText server is running.

This can be accomplished by storing the files in the file system for the host machine or by mounting the file system where the files are stored to the host machine.

URL Tile Attribute(s)

The timeout attribute specifies the length of time, in seconds, that a network operation such as 'connect' or 'read' waits before timing out and returning a timeout error to the application. The valid range for timeout is 0 to 3600 and the default is 30.

Note:

Since timeout is at the network operation level, the total timeout may be longer than the time specified for timeout.

The maxthreads attribute specifies the maximum number of threads that can be running at the same time. The valid range for maxthreads is 1 to 1024 and the default is 8.

Note:

The upper range of maxthreads corresponds to the number of file descriptors that the operating system can process at one time. If the number of files the operating system can process at one time is less than the value set, an invalid socket error may be returned.

The maxurls attribute specifies the maximum number of rows that the internal buffer can hold for HTML documents (rows) retrieved from the text table. The valid range for maxurls is 1 to 4294967295 and the default is 256.

The urlsize attribute specifies the maximum length, in bytes, that the URL data store supports for URLs stored in the database. If a URL is over the maximum set, an error is returned. The valid range for urlsize is 32 to 65535 and the default is 256.

The maxdocsize attribute specifies the maximum size, in bytes, that the URL data store supports for accessing HTML documents whose URLs are stored in the database. The valid range for maxdocsize is 1 to 4294967295 and the default is 200000 (2 Mb).

The http_proxy attribute specifies the fully-qualified name of the host machine that serves as the proxy (gateway) for the machine on which ConText is installed.

The no_proxy attribute specifies the strings (up to sixteen, separate by commas) which, when encountered in a host name, cause the URL data store to ignore the machine as a proxy machine.

For example, if the string 'us.oracle.com, uk.oracle.com' is entered for no_proxy, any machines that contain either of these domains in their host names are ignored as proxy machines.

Data Store Example

The following example creates a preference named doc_ref for the OSFILE Tile:

begin
  ctx_ddl.set_attribute ('PATH', '/private/mydocs');
  ctx_ddl.create_preference ('DOC_PREF', 'Path my for my documents' 'OSFILE');
end;




Note:


This example illustrates usage of OSFILE for documents stored in a UNIX-based environment.


The directory path syntax may be different for other environments.

Filter Category

The Filter category contains the following Tiles:

Tile	Attributes	Attribute Values
BLASTER FILTER	EXECUTABLE	format id (number), filter executable, sequence (number)
	FORMAT	0 or 999 (No filter -- plain/ASCII text)
		1 or 4 (Word Perfect for Windows 5.x; Word Perfect for DOS 5.0, 5.1)
		2 (MS Word for DOS 5.0, 5.5)
		5 (Word Perfect for Windows 6.x; Word Perfect for DOS 6.0)
		6 (MS Word for Mac 3, 4, 5.x)
		7 (MS Word for Windows 2)
		8 (AMIPRO for Windows 1, 2, 3)
		9 (Lotus 1-2-3 for Windows 2, 3, 4, 5; Lotus 1-2-3 for DOS 4, 5)
		11 (MS Word for Windows 6.x, 7.0)
		13 (Xerox XIF for UNIX 5, 6)
		997 (Autorecognize)
FILTER NOP	none	N/A
HTML FILTER	CODE_CONVERSION	0 (disabled)
		1(enabled)
USER FILTER	COMMAND	filter executable

BLASTER FILTER Tile Attribute(s)

The format attribute specifies the internal filter used for filtering text stored in a text column.

The executable attribute specifies the external filters that are used to filter text stored in a mixed-format text column. It has three values that must be specified:

format_id (document format for the external filter)
filter_executable (name of executable that performs the filtering for the document format)
sequence_num (identifier for the executable and document format used in the preference)

Note:

format and executable cannot both be set in the same preference.

See Also:

For a list of the format IDs supported by the executable attribute, see "Supported Formats for Mixed-Format Columns" in this chapter.

HTML FILTER Tile Attribute(s)

The code_conversion attribute specifies whether code conversion is enabled for documents which contain Japanese ASCII text with HTML tags.

Code conversion is required for Japanese HTML documents if the documents use more than one of the three character sets supported for HTML text in Japanese. If code conversion is enabled, all Japanese HTML documents are converted to a single, common character set before indexing.

The default for code_conversion is 0 (disabled).

Note:

For mixed-format columns that use Autorecognize (BLASTER Tile, format attribute = 997) or use external filters (BLASTER Tile, executable attribute) for all formats except HTML, code conversion is always enabled.

USER FILTER Tile Attributes(s)

The command attribute specifies the executable for the single external filter used to filter all text stored in a column. If more than one document format is stored in the column, the external filter must recognize and handle all such formats.

Filter Example

The following example creates a preference named word6 for the BLASTER FILTER Tile:

begin
  ctx_ddl.set_attribute ('FORMAT', '11');
  ctx_ddl.create_preference ('WORD6', 'Microsoft Word docs', 'BLASTER FILTER');
end;

Lexer Category

The Lexer category contains the following Tiles:

Tile	Attributes	Attribute Values
BASIC LEXER	PUNCTUATIONS	characters (string)
	PRINTJOINS	characters (string)
	SKIPJOINS	characters (string)
	NUMJOIN	characters (string)
	NUMGROUP	characters (string)
	CONTINUATION	characters (string)
	BASE_LETTER	0 (disabled)
		1 (enabled)
CHINESE V-GRAM LEXER	HANZI_INDEXING	1
		2
JAPANESE V-GRAM LEXER	KANJI_INDEXING	1
		2
KOREAN LEXER	none	N/A
THEME LEXER	none	N/A

Note:

The character strings for each BASIC LEXER Tile attribute can contain multiple characters. Each character in the string serves as a punctuation, join, or continuation character.

For example, if the string '.?!' is specified for the punctuations attribute, each individual character ('.', '?', '!') in the string is treated by ConText as a sentence delimiter during indexing and queries.

BASIC LEXER Tile Attribute(s)

punctuations specifies the characters that indicate the end of a sentence.

printjoins specifies the characters that join words together when they appear between the words with no blank spaces. Words that contain printjoin characters are stored in the text index exactly as they appear in the text.

For example, if a hyphen '-' is defined as a printjoin character, the word pseudo-intellectual is stored in the text index as pseudo-intellectual.

skipjoins specifies the characters that join words together, but the characters are not stored in the text index.

For example, if a hyphen '-' is defined as a skipjoin character, the word pseudo-intellectual is stored in the text index as pseudointellectual.

Note:

printjoins and skipjoins are mutually exclusive. The same characters cannot be specified for both attributes.

numjoin specifies the characters that, when they appear in a string of digits, cause ConText to index the string of digits as a single unit or word.

For example, a period '.' may be defined as a numjoin character because it often serves as a decimal point when it appears in a string of digits.

numgroup specifies the characters that, when they appear in a string of digits, indicate that the digits are groupings within a larger single unit.

For example, a comma ',' may be defined as a numgroup character because it often indicates a grouping of thousands when it appears in a string of digits.

Note:

The default values for numjoin and numgroup are determined by the NLS initialization parameters that are specified for the database.

In general, a value does not need to be specified for either numjoin or numgroup when creating a Lexer preference for the BASIC LEXER Tile.

continuation specifies the characters that indicate a word continues on the next line and should be indexed as a single token. The most common continuation characters are a hyphen '-' and a backslash '\'.

base_letter specifies whether characters that have diacritical marks (umlats, cedillas, acute accents, etc.) are converted to their base form for text indexing and text queries.

CHINESE V-GRAM LEXER Tile Attribute(s)

The hanzi_indexing attribute specifies the length of the character groups used for pattern matching while indexing.

A value of 1 for hanzi_indexing indicates that the Chinese lexer examines each character individually to determine token boundaries.

A value of 2 for hanzi_indexing indicates that the lexer examines characters in pairs to determine token boundaries.

The default is 2.

JAPANESE V-GRAM LEXER Tile Attribute(s)

The kanji_indexing attribute specifies the length of the character groups used for pattern matching while indexing.

A value of 1 for kanji_indexing indicates that the Japanese lexer examines each character individually to determine token boundaries.

A value of 2 for kanji_indexing indicates that the lexer examines pairs of characters to determine token boundaries.

The default is 2.

Lexer Example

The following example creates a preference named doc_link for the BASIC LEXER Tile:

begin
  ctx_ddl.Set_attribute     ('PRINTJOINS', '-*/');
  ctx_ddl.create_preference ('DOC_LINK', 'Dash, star, slash', 'BASIC LEXER' );
end;

Engine Category

The Engine category contains the following Tiles:

Tile	Attributes	Attribute Values
GENERIC ENGINE	INDEX_MEMORY	memory in bytes (integer)
	OPTIMIZE_DEFAULT	default ConText index optimization method
	I1T_TABLESPACE, I1T_STORAGE, I1T_OTHER_PARMS	tablespace, STORAGE clause, and other table creation parameters for token table
	I1I_TABLESPACE, I1I_STORAGE, I1I_OTHER_PARMS	tablespace, STORAGE clause, and other index creation parameters for index on token table
	KTB_TABLESPACE, KTB_STORAGE, KTB_OTHER_PARMS	tablespace, STORAGE clause, and other table creation parameters for mapping table
	KID_TABLESPACE, KID_STORAGE, KID_OTHER_PARMS KIK_TABLESPACE, KIK_STORAGE, KIK_OTHER_PARMS	tablespace, STORAGE clause, and other index creation parameters for indexes on mapping table
	LST_TABLESPACE, LST_STORAGE, LST_OTHER_PARMS	tablespace, STORAGE clause, and other table creation parameters for control table
	LIX_TABLESPACE, LIX_STORAGE, LIX_OTHER_PARMS	tablespace, STORAGE clause, and other index creation parameters for index on control table
	SQR_TABLESPACE, SQR_STORAGE, SQR_OTHER_PARMS	tablespace, STORAGE clause, and other table creation parameters for SQE results table
	SRI_TABLESPACE, SRI_STORAGE, SRI_OTHER_PARMS	tablespace, STORAGE clause, and other index creation parameters for index on SQE results table
	SQE_TABLESPACE, SQE_STORAGE, SQE_OTHER_PARMS	tablespace, STORAGE clause, and other table creation parameters for SQE definition table (NOT USED)
	SEI_TABLESPACE, SEI_STORAGE, SEI_OTHER_PARMS	tablespace, STORAGE clause, and other index creation parameters for index on SQE definition table (NOT USED)
ENGINE NOP	none	N/A

GENERIC ENGINE Tile Attribute(s)

index_memory specifies the amount of memory, in bytes, allocated for indexing.

Note:

When specifying a value for index_memory in a preference, specify as much real (not virtual) memory as is available on the machine which is running the ConText server that will be creating indexes.

For parallel indexing, the memory specified should be the amount of available memory divided evenly among the number of ConText servers that will perform the indexing in parallel.

optimize_default specifies the type of optimization used when CTX_DDL.OPTIMIZE_INDEX is called without an optimization type. If no value is specified for optimize_default, the default is DEFRAGMENT_TO_TWO_TABLE.

i1t_tablespace, ktb_tablespace, and lst_tablespace specify the tablespaces used for the ConText index tables created during indexing.

sqr_tablespace specifies the tablespace used for the stored query expression result (SQR) table that is created, but not populated, during indexing. The SQR table for a policy stores the results of stored query expressions for the policy.

i1i_tablespace, kid_tablespace, kik_tablespace, and lix_tablespace specify the tablespaces used for the Oracle indexes generated for each ConText index table during indexing.

sri_tablespace specifies the tablespace used for the Oracle index generated for each SQR table.

Note:

For each TABLESPACE attribute that is not specified when creating an Engine preference, the text table owner's default tablespace is used for storing the ConText index objects (tables and indexes).

i1t_storage, ktb_storage, and lst_storage specify the STORAGE clauses used to create the ConText index tables during ConText indexing.

sqr_storage specifies the STORAGE clause used to create the stored query expression result (SQR) table during ConText indexing.

i1i_storage, kid_storage, kik_storage, and lix_storage specify the STORAGE clauses used to create the Oracle indexes for each ConText index table.

sri_storage specifies the STORAGE clause used to create the Oracle index for each SQR table.

i1t_other_parms, ktb_other_parms, and lst_other_parms specify any additional parameters used to create the ConText index tables during ConText indexing.

sqr_other_parms specifies any additional parameters used to create the stored query expression result (SQR) table during ConText indexing.

i1i_other_parms, kid_other_parms, kik_other_parms, and lix_other_parms specify any additional parameters used to create the Oracle indexes for each ConText index table.

sri_other_parms specifies any additional parameters used to create the Oracle index for each SQR table.

Note:

In particular, the other_parms attributes are used to specify a value for the PARALLEL clause in the CREATE TABLE/INDEX command. The PARALLEL clause determines the degree of parallelism used by the Oracle8 parallel query option for operations such as generating Oracle indexes.

sqe/sei_tablespace, sqe/sei_storage, and sqe/sei_other_params are not used by ConText because SQE tables and their accompanying Oracle indexes are not used for storing SQE definitions (all SQE definitions are stored in a system table owned by CTXSYS). As a result, values are not required for these attributes.

See Also:

For descriptions of the tables and indexes that constitute a ConText index, see "Appendix C, "ConText Index Tables and Indexes".

For more information about the storage clauses and other parameters that can be specified for a database table/index, see the CREATE TABLE and CREATE INDEX commands in Oracle8 Server SQL Reference.

For more information about the parallel query option in Oracle8, see Oracle8 Server Tuning.

For more information about SQEs, see Oracle8 ConText Cartridge Application Developer's Guide.

Engine Example

The following example creates a preference named doc_engine for the GENERIC ENGINE Tile:

begin
  ctx_ddl.set_attribute ('INDEX_MEMORY',   30000000 );
  ctx_ddl.set_attribute ('I1T_TABLESPACE', 'DOCUMENTS' );
  ctx_ddl.set_attribute ('I1T_STORAGE',' initial 10M next 2M
                         maxextents 10');
  ctx_ddl.set_attribute ('I1T_OTHER_PARMS',' pctfree 20');
  ctx_ddl.set_attribute ('I1I_OTHER_PARMS',' parallel 2');
  ctx_ddl.create_preference ('DOC_ENGINE', 'Test case',
                             'GENERIC ENGINE' );
end;

Wordlist Category

The Wordlist category contains the following Tiles:

Tile	Attributes	Attribute Values
GENERIC WORD LIST	STCLAUSE	STORAGE clause for Soundex wordlist table
	INSTCLAUSE	STORAGE clause for index on Soundex wordlist table
	SOUNDEX_AT_INDEX	0 (disabled)
		1 (enabled)
	STEMMER	1 (English)
		2 (English -- derivational)
		3 (Dutch)
		4 (French)
		5 (German)
		6 (Italian)
		7 (Spanish)
	FUZZY_MATCH	1 (English and other Western European languages)
		2 (Japanese)
		3 (Korean)
		4 (Chinese)

GENERIC WORD LIST Tile Attribute(s)

The stclause attribute specifies the STORAGE clause used to create the Soundex wordlist table during ConText indexing. The Soundex wordlist table is only created if Soundex is enabled through the soundex_at_index attribute.

The instclause attribute specifies the STORAGE clause used to create the Oracle index for the Soundex wordlist table.

The soundex_at_index attribute specifies whether ConText generates Soundex word mappings and stores them in the Soundex wordlist table during text indexing. If Soundex word mappings are not generated and stored in the wordlist table during indexing, queries that use Soundex are not expanded.

The stemmer attribute specifies the stemmer used for word stemming in text queries. For all the supported languages, the stemmers return standard inflected forms of a word, such as the plural form (e.g. department --> departments).

For English, an additional stemmer is provided which returns standard inflected forms and derived forms (e.g. department --> departments, departmentalize).

The default for stemmer is 1 (inflectional English)

The fuzzy_match attribute specifies which fuzzy matching routines are used for the column. Fuzzy matching is currently supported for English, Japanese, and, to a lesser extent, the Western European languages.

The default for fuzzy_match is 1.

Note:

The fuzzy_match attribute values for Chinese and Korean are dummy attribute values that prevent the English and Japanese fuzzy matching routines from being used on Chinese and Korean text.

See Also:

For more information about the expansion methods supported by ConText, see "WordList Category" in Chapter 5, "Understanding the ConText Data Dictionary".

For more information about expansion methods in queries, see Oracle8 ConText Cartridge Application Developer's Guide.

Wordlist Example

The following example creates a preference named soundex_yes for the GENERIC WORDLIST Tile:

begin
  ctx_ddl.set_attribute('SOUNDEX_AT_INDEX', '1');
  ctx_ddl.create_preference('SOUNDEX_YES',
                            'Will build the soundex mapping during indexing',
                            'GENERIC WORDLIST');
end;

Stoplist Category

The Stoplist category contains the following Tiles:

Tile	Attributes	Attribute Values
GENERIC STOP LIST	STOP_WORD	word (string), sequence (number)

GENERIC STOP LIST Tile Attribute(s)

The stop_word attribute has two values that must be specified:

the word that ConText does include in the text index
the sequence for the word

Sequence is a value from 1 to 4095 and is used in a text index to record the stop words that proceed and follow an indexed term. ConText records up to eight preceding stop words and eight following stop words for each indexed term. This enables text queries for phrases which contain stop words.

For example, consider the sentence "he is at the top of the class" where at, the, top, and of are stop words. The sequences for each of the stop words are recorded as part of the text index entry for the term class, which allows users to include stopwords in a query (e.g. 'top of the class').

Stoplist Example

The following example creates a preference named mini_stop_list for the GENERIC STOPLIST Tile:

begin
  ctx_ddl.set_attribute     ('STOP_WORD', 'A',   1);
  ctx_ddl.set_attribute     ('STOP_WORD', 'AND', 2);
  ctx_ddl.set_attribute     ('STOP_WORD', 'THE', 3);
  ctx_ddl.create_preference ('MINI_STOP_LIST', 'Small', 'GENERIC STOP LIST' );
end;

Tiles, Tile Attributes, and Attribute Values: Text Loading

The following section lists all of the Tiles which can be used to create text loading preferences for use in sources. The section also lists the attributes and attribute values for each text loading Tile. In addition, a brief description of the Tile attributes and examples are provided.

The text loading Tiles are grouped alphabetically by preference category:

Preference Category	Tiles
Reader Category	DIRECTORY READER
Engine Category	GENERIC LOADER
Translator Category	NULL TRANSLATOR
	USER TRANSLATOR

Reader Category

The Reader category contains the following Tiles:

Tile	Attributes	Attribute Values
DIRECTORY READER	DIRECTORIES	pathname for the directory where text loading files are located

DIRECTORY READER Tile Attribute(s)

The directories attribute specifies the full pathname for the directory that the ConText server with the Loader personality scans when looking for new files to load into a column in a table or view.

The structure of the value for pathname will vary depending on the directory naming conventions used by your operating system.

Engine Category

The Engine (Text Loading) category contains the following Tiles:

Tile	Attributes	Attribute Values
GENERIC LOADER	none	N/A

The GENERIC LOADER Tile does not have any attributes. In general, preferences do not need to be created for the Engine category, since the GENERIC LOADER Tile does not have attributes that can be set by the user.

Translator Category

The Translator category contains the following Tiles:

Tile	Attributes	Attribute Values
NULL TRANSLATOR	SEPARATE	N/A
USER TRANSLATOR	COMMAND	translator executable

NULL TRANSLATOR Tile Attribute(s)

The separate attribute specifies that the load files do not contain the actual text of the documents to be loaded, but, rather, contain pointers to separate files where the text of the documents is stored.

See Also:

For more information about how the separate option works for loading text, see "ctxload Utility" in Chapter 9, "Executables and Utilities".

USER TRANSLATOR Tile Attribute(s)

The command attribute specifies the name of the executable used to translate a load file into the format required by ctxload.

Note:

The specified translator executable must be stored in the appropriate directory in the Oracle home directory.

For example, in a UNIX-based environment, all translator executables must be stored in $ORACLE_HOME/ctx/bin.

In a Windows NT environment, the translator executables must be stored in ORACLE_HOME\BIN.

For more information about the required location of executable files, see the Oracle8 installation documentation for your operating system.

Predefined and Default Preferences: Indexing

ConText provides the following predefined indexing preferences, grouped according to preference category:

Preference Category	Predefined Preferences	Default
Data Store Category	DEFAULT_DIRECT_DATASTORE	***
	DEFAULT_OSFILE
	DEFAULT_URL
	MD_BINARY
	MD_TEXT
Filter Category	AUTOB	***
	HTML_FILTER
	WW6B
Lexer Category	DEFAULT LEXER	***
	KOREAN
	VGRAM_CHINESE_1
	VGRAM_CHINESE_2
	VGRAM_JAPANESE_1
	VGRAM_JAPANESE_2
Engine Category	DEFAULT_INDEX	***
	THEME_LEXER
Wordlist Category	KOREAN_WORDLIST
	NO_SOUNDEX	***
	SOUNDEX
	VGRAM_CHINESE_WORDLIST
	VGRAM_CHINESE_WORDLIST
Stoplist Category	DEFAULT_STOPLIST	***
	NO_STOPLIST

Data Store Category

The following section provides descriptions of the predefined preferences for the Data Store category.

Note:

DEFAULT_DIRECT_DATASTORE is the default preference for the Data Store preference category.

DEFAULT_DIRECT_DATASTORE

The DEFAULT_DIRECT_DATASTORE preference calls the DIRECT Tile which is used to indicate that text is stored directly in the text column of a text table.

DEFAULT_DIRECT_DATASTORE does not use any Tile attributes because the DIRECT Tile does not have attributes.

DEFAULT_OSFILE

The DEFAULT_OSFILE preference calls the OSFILE Tile which is used to indicate that text is stored as files in a file system.

DEFAULT_OSFILE uses the PATH Tile attribute and a hardcoded set of dummy directory paths to indicate the directories in which the text files are located.

The hard-coded paths, delimited by colons are: /oracle/data, /oracle/data2, /oracle/data3.

Note:

The DEFAULT_OSFILE preference requires modification to reflect the actual paths for your text files before the preference can be used in a policy.

DEFAULT_URL

The DEFAULT_URL preference calls the URL Tile which is used to indicate that text is stored as URLs.

DEFAULT_URL uses all of the attribute defaults for the URL Tile:

timeout of 30 seconds
up to 8 HTTP threads handled simultaneously
up to 256 HTML documents can be accessed simultaneously
the maximum length of a URL stored in the text column is 256 bytes
the maximum size of an HTML file that the URL data store will access without error is 2 megabytes
no proxy server

MD_BINARY

The MD_BINARY preference calls the MASTER DETAIL Tile which is used to indicate text is stored in a master detail table.

MD_BINARY uses the BINARY Tile attribute and a value of YES to indicate that the text in the table is stored in binary format:

MD_TEXT

The MD_TEXT preference calls the MASTER DETAIL Tile which is used to indicate text is stored in a master detail table.

MD_TEXT uses the Tile attribute BINARY and a value of NO to indicate that the text in the table is stored as ASCII text.

Filter Category

The following section provides descriptions of the predefined preferences for the Filter category.

Note:

DEFAULT_NULL_FILTER is the default preference for the Filter preference category.

AUTOB

The AUTOB preference calls the BLASTER FILTER Tile which specifies an internal filter used to extract text from formatted documents in a text column.

AUTOB uses the FORMAT Tile attribute and a value of 997 to indicate that ConText uses the autorecognize filter to extract text. It can be used to filter text in a column the contains the following document formats:

Document Format	Version
AmiPro for Windows	1, 2, 3
ASCII	N/A
HTML	1, 2, 3
Lotus 123 for DOS	4, 5
Lotus 123 for Windows	2, 3, 4, 5
Microsoft Word for Windows	2, 6.x
Microsoft Word for DOS	5.0, 5.5
Microsoft Word for MAC	3, 4, 5.x
Word Perfect for Windows	5.x, 6.x
WordPerfect for DOS	5.0, 5.1, 6.0
Xerox XIF for UNIX	5, 6

DEFAULT_NULL_FILTER

The DEFAULT_NULL_FILTER preference calls the FILTER NOP Tile which indicates that the text column in a text table contains plain, unformatted (ASCII) text and does not require filtering for indexing and highlighting.

DEFAULT_NULL_FILTER does not use any Tile attributes because the FILTER NOP Tile does not have attributes.

HTML_FILTER

The HTML_FILTER preference calls the HTML FILTER Tile and can be used to filter documents in a column that contains only HTML-formatted documents.

WW6B

The WW6B preference calls the BLASTER FILTER Tile which specifies that, for the BLASTER FILTER Tile, the Microsoft Word for Windows 6 internal filter is used to extract text from Word for Windows 6 documents in a text column.

WW6B uses the format Tile attribute and a value of 11 to indicate ConText uses the Word for Windows 6 filter to extract text. It can be used in a column that contains only Word for Windows 6-formatted documents.

Lexer Category

The following section provides descriptions of the predefined preferences for the Lexer category.

Note:

DEFAULT_LEXER is the default preference for the Lexer preference category.

DEFAULT_LEXER

The predefined DEFAULT_LEXER preference calls the BASIC LEXER Tile, which indicates the lexer settings used to identify word and sentence boundaries for text indexing and text queries.

DEFAULT_LEXER uses the following Tile attributes and values to indicate the lexer settings:

Attribute	Values
punctuations	. ? !
printjoins	NULL (indicates no characters defined as printjoins for the BASIC LEXER; instead, printjoins determined by NLS initialization parameters)
skipjoins	NULL (indicates no characters defined as skipjoins for the BASIC LEXER; instead, skipjoins determined by NLS initialization parameters)
continuation	- \

KOREAN

The KOREAN preference calls the KOREAN LEXER Tile and can be used for parsing Korean text. It has no attributes.

VGRAM_CHINESE_1 and VGRAM_CHINESE_2

The VGRAM_CHINESE preferences call the CHINESE V-GRAM LEXER Tile which indicates the preferences can be used for parsing Chinese text.

The 1 or 2 indicates that the preference uses either method 1 or 2 for identifying tokens in Chinese text (hanzi_indexing attribute).

VGRAM_JAPANESE_1 and VGRAM_JAPANESE_2

The VGRAM_JAPANESE preferences call the JAPANESE V-GRAM LEXER Tile which indicates the preferences can be used for parsing Japanese text.

The 1 or 2 indicates that the preference uses either method 1 or 2 for identifying tokens in Japanese text (kanji_indexing attribute).

THEME_LEXER

The predefined THEME_LEXER preference calls the THEME LEXER Tile, which indicates the preference can be used in a column policy to create theme indexes for a column.

The THEME_LEXER preference does not set any attributes because the THEME LEXER preference doesn't have any attributes.

Engine Category

The following section provides descriptions of the predefined preferences for the Engine category.

DEFAULT_INDEX

The DEFAULT_INDEX preference calls the GENERIC ENGINE Tile which is used to specify the amount of memory reserved for indexing.

DEFAULT_INDEX uses the index_memory attribute and specifies the amount of memory allocated for indexing: 12582912 bytes

Wordlist Category

The following section provides descriptions of the predefined preferences for the Wordlist category.

Note:

NO_SOUNDEX is the default preference for the Wordlist preference category.

NO_SOUNDEX

The NO_SOUNDEX preference contains the GENERIC WORD LIST Tile which specifies whether Soundex word mappings are generated during text indexing. Soundex can be used in text queries to expand the query to include words that sound similar to the query terms.

NO_SOUNDEX uses the soundex_at_index Tile attribute and a value of 0 to indicate that ConText does not generate Soundex word mappings during text indexing.

SOUNDEX

The SOUNDEX preference contains the GENERIC WORDLIST Tile which specifies whether Soundex word mappings are generated during text indexing. Soundex can be used in text queries to expand the query to include words that sound similar to the query terms.

SOUNDEX uses the soundex_at_index Tile attribute and a value of 1 to indicate that ConText generates Soundex word mappings during text indexing.

Stoplist Category

The following section provides descriptions of the predefined preferences for the Stoplist category.

Note:

DEFAULT_STOPLIST is the default preference for the Stoplist preference category.

DEFAULT_STOPLIST

The DEFAULT_STOPLIST preference specifies a list of stop words for the GENERIC STOP LIST Tile.

The preference calls the stop_word attribute for each of the following stop words:

`STOPWORD`	`SEQ`	`STOPWORD`	`SEQ`	`STOPWORD`	`SEQ`	`STOPWORD`	`SEQ`
A	3	COULD	70	MR	18	SUCH	69
ABOUT	34	FOR	8	MRS	20	THAN	43
AFTER	63	FROM	17	MS	21	THAT	9
ALL	62	HAD	51	MZ	19	THE	7
ALSO	50	HAS	29	NO	71	THEIR	47
AN	27	HAVE	32	NOT	61	THERE	67
ANY	76	HE	24	ONLY	72	THEY	37
AND	5	HER	45	OF	1	THIS	35
ARE	28	HIS	44	ON	12	TO	2
AS	14	IF	58	ONE	40	WAS	26
AT	13	IN	4	OR	33	WE	57
BE	23	INC	48	OTHER	54	WERE	52
BECAUSE	66	INTO	75	OUT	59	WHEN	65
BEEN	49	IS	10	OVER	64	WHICH	36
BUT	30	IT	11	S	6	WHO	42
BY	16	ITS	22	SO	73	WILL	31
CAN	68	LAST	56	SAYS	41	WITH	15
CO	60	MORE	38	SHE	25	WOULD	39
CORP	53	MOST	74	SOME	55	UP	46

NO_STOPLIST

The NO_STOPLIST preference contains the GENERIC STOP LIST TILE and specifies that no list of stop words is used during text indexing. All words that ConText encounters are stored in the text index.

NO_STOPLIST contains no stop_word attributes to indicate that there are no stopwords used during indexing.

Predefined and Default Preferences: Text Loading

ConText provides the following predefined text loading preferences for the three preference categories for sources:

Preference Category	Predefined Preferences	Default
Reader Category	DEFAULT_READER	***
Engine Category	DEFAULT_LOADER	***
Translator Category	DEFAULT_TRANSLATOR	***

Reader Category

The following section provides descriptions of the predefined preferences for the Reader category.

DEFAULT_READER

The DEFAULT_READER preference uses the DIRECTORY READER Tile, which has a dummy directory set for the Tile.

Note:

Because it is unknown which directory contains the files to be loaded and path names are operating-system specific, this preference is provided as a default only and should not be used when creating a source.

Before creating a source, you should create your own Reader preference that specifies the directory where your files to be loaded are located.

Engine Category

The following section provides descriptions of the predefined preferences for the Text Loading Engine category.

DEFAULT_LOADER

The DEFAULT_LOADER preference uses the GENERIC LOADER Tile, which indicates the preference can be used to load text from files in a operating system directory.

Translator Category

The following section provides descriptions of the predefined preferences for the Translator category.

DEFAULT_TRANSLATOR

The DEFAULT_TRANSLATOR preference uses the NULL TRANSLATOR Tile, which indicates no translation is performed on the files to be loaded, because the files are in the format required by ctxload.

Template Policies

The following section provides a brief description of the template policies provided with ConText.

The template policies are owned by CTXSYS. A template policy can be specified as the source policy for a policy during creation.

ConText provides the following template policies:

DEFAULT_POLICY

The DEFAULT_POLICY policy can be used to create a policy which uses all of the default preferences:

Default Preferences	Characteristics
DEFAULT_DIRECT_DATASTORE	Text stored in database
DEFAULT_NULL_FILTER	No filter (text stored in plain, ASCII format)
DEFAULT_LEXER	Basic lexer (standard punctuation and continuation characters, no printjoin or skipjoin characters)
DEFAULT_INDEX	Indexing memory = 12582912 bytes, default storage/other clauses for ConText index tables and indexes
NO_SOUNDEX	No Soundex word mappings stored during text indexing
DEFAULT_STOPLIST	Stoplist is active, default list of stop words

Note:

DEFAULT_POLICY is the default for source_policy in CREATE_POLICY and CREATE_TEMPLATE_POLICY in the CTX_DDL package.

TEMPLATE_AUTOB

The TEMPLATE_AUTOB policy can be used to create a policy for a text column that contains documents in mixed formats. The autorecognize Blaster filter is used to automatically identify the format of each document in a column and, if the format is supported by ConText, extract the text of the document for indexing.

TEMPLATE_AUTOB uses the AUTOB predefined preference and all the remaining default preferences.

TEMPLATE_DIRECT

The TEMPLATE_DIRECT policy can be used to create a policy for indexing basic text stored in a text column.

It uses all the default preferences.

TEMPLATE_LONGTEXT_STOPLIST_OFF

The TEMPLATE_LONGTEXT_STOPLIST_OFF policy can be used to create a policy that does not use a stopword list during indexing.

It uses the NO_STOPLIST predefined preference and all the remaining default preferences.

TEMPLATE_LONGTEXT_STOPLIST_ON

The TEMPLATE_LONGTEXT_STOPLIST_ON policy can be used to create a policy that uses a stopword list during indexing.

It uses the DEFAULT_STOPLIST predefined preference and all the remaining default preferences.

TEMPLATE_MD

The TEMPLATE_MD policy can be used to create a policy for indexing plain text stored in the detail column in a master-detail table.

It uses the MD_TEXT predefined preference and all the remaining default preferences.

TEMPLATE_MD_BIN

The TEMPLATE_MD_BIN policy can be used to create a policy for indexing binary text stored in the detail column in a master-detail table.

It uses the MD_BINARY predefined preference and all the remaining default preferences.

TEMPLATE_WW6B

The TEMPLATE_WW6B policy can be used to create a policy for indexing text formatted for Microsoft Word for Windows 6.

It uses the WW6B predefined preference and all the remaining default preferences.

Supported Formats for Mixed-Format Columns

The following section lists all of the formats that ConText supports for columns that use external filters for processing documents in more than one format.

For each format, the format ID is also listed. This is the value that must be specified when creating a Filter preference using the BLASTER FILTER Tile with the executable attribute.

Note:

To index documents in any of these formats using external filters, the external filter must exist and the executable for the filter must be specified in a Filter preference using the executable attribute.

See Also:

For more information about using format IDs in Filter preferences, see "Creating Filter Preferences" in Chapter 6, "Setting Up and Managing Text".

Document Format	Format ID
AmiPro 1.x - 3.1	19
AmiPro Graphics SDW Samna Draw	62
ASCII	90
AT&T Crystal Writer	46
AutoCAD (DXF, DXB)	53
CEOwrite 3.0	78
Computer Graphics Metafile (CGM)	79
CorelDraw 2.x and 3.x	59
CTOS DEF	75
DBase IV 1.0; DBase III, III +	37
DCA/FFT - Final Form Text	27
DCA/RFT - Revisable Form Text	0
Digital DX	15
Digital WPS-PLUS	47
EBCDIC	89
Enable 1.1, 2.0, 2.15	11
Encapsulated PostScript Preview; Encapsulated PostScript Bitmap	66
First Choice 3.0 Data Base	13
FrameMaker (MIF) 3.0; FrameMaker (MIF) 3.0 Win	42
Framework III, 1.0, 1.1	22
FullWrite Professionl 1.0x	31
GIF (Graphical Interchange Format)	51
Harvard Graphics	87
HP Graphics Language (HPGL)	83
HTML Level 1, 2, 3	91
IBM Writing Assistant 1.0	16
IGES	52
Interleaf 5.2; Interleaf 5.2 - 6.0	32
JPEG (Joint Photographic Experts Group)	58
Legacy 1.x, 2.0	41
Lotus 123 4.x; Lotus 123 3.0; Lotus 123 1A, 2.0, 2.1	20
Lotus Freelance	85
Lotus Manuscript 2.0, 2.1	26
Lotus PIC	67
Macintosh Paint	88
Microsoft Windows Paint 2.x	70
Macintosh QuickDraw (PICT)	64
MacWrite 4.5 - 5.0	29
MacWrite II 1.0 - 1.1	30
Mass 11, Version 8.0 -8.33	36
MastSoft Graphics (MSG)	49
Micrografx Designer (DRW)	60
MS Access 2.0	39
MS Excel 5.0 - 6.0; MS Excel 4.0; MS Excel 3.0; MS Excel 2.1	21
MS Powerpoint for Windows 2, 3, 4	84
MS RTF; MS RTF (ANSI Char Set)	17
MS Word for DOS 6.0; MS Word for DOS 5.0, 5.5; MS Word for DOS 4.0; MS Word for DOS 3.0, 3.1	8
MS Word for Mac 5.0, 5.1; MS Word for Mac 4.0; MS Word for Mac 3.0	28
MS Word for Windows 2.0; MS Word for Windows 1.x	18
MS Word for Windows 6.0; MS Word for Mac 6.0	68
MS Works for Windows 3.0	69
MS Write for Windows 3.x	7
MultiMate 4; MultiMate Advantage II; MultiMate Advantage I; MultiMate 3.3	6
Navy DIF (GSA)	35
OfficePower 7; OfficePower 6	44
OfficeWriter 6.0 - 6.2; OfficeWriter 5.0; OfficeWriter 4.0	9
OS/2 Bitmap; Windows Bitmap (BMP); Windows RLE	63
Paradox 3.5, 4.0	38
PC Paintbrush (PCX)	71
PDF (Adobe Acrobat)	57
PeachText 5000 2.1.2	82
PFS:First Choice 3.0; PFS:First Choice 2.0; PFS:First Choice 1.0; PFS:WRITE Ver C; Professional Write 2.0 - 2.2; Professional Write 1.0	12
Quattro Pro DOS; Quattro Pro Windows	45
Q&A 4.0; Q&A Write 1.x, Q&A 3.0	10
Rapid File 1.0	23
RGIP	61
Samna Word IV & IV + 1.0, 2.0	25
Sun Raster Graphics	65
TIFF (Tagged Image File Format)	50
Uniplex V7 - V8	77
Vokswriter 3, 4	74
Wang PC, Version 3	24
Wang WITA	55
Windows Clipboard	72
Windows ICON	73
Windows Metafile (WMF)	48
WiziDraw	86
WiziWord	56
Word For Word Intermediate Communications format (COM)	34
WordPerfect for Windows 6.1; WordPerfect for Windows 6.0; WordPerfect 6.0	1
WordPerfect 5.1 (Mail Merge)	2
WordPerfect for Windows 5.x; WordPerfect 5.1; WordPerfect 5.0	3
WordPerfect Graphics 1 (WPG)	4
WordPerfect Graphics 2 (WPG)	5
WordPerfect 4.2; WordPerfect 4.1	80
WordPerfect Mac 1.0	81
WordPerfect Mac 3.0; WordPerfect Mac 2.1; WordPerfect Mac 2.0	33
WordStar 5.0, 5.5, 6.0, 7.0	40
WordStar 2000, Rel 3.0	14
WriteNow 3.0	54
Xerox - XIF 5.0, 6.0	43
XYWrite IV; XyWrite III Plus	76

10 ConText Data Dictionary

Tiles, Tile Attributes, and Attribute Values: Indexing

Data Store Category

MASTER DETAIL Tile Attribute(s)

OSFILE Tile Attribute(s)

URL Tile Attribute(s)

Data Store Example

Filter Category

BLASTER FILTER Tile Attribute(s)

HTML FILTER Tile Attribute(s)

USER FILTER Tile Attributes(s)

Filter Example

Lexer Category

BASIC LEXER Tile Attribute(s)

CHINESE V-GRAM LEXER Tile Attribute(s)

JAPANESE V-GRAM LEXER Tile Attribute(s)

Lexer Example

Engine Category

GENERIC ENGINE Tile Attribute(s)

Engine Example

Wordlist Category

GENERIC WORD LIST Tile Attribute(s)

Wordlist Example

Stoplist Category

GENERIC STOP LIST Tile Attribute(s)

Stoplist Example

Tiles, Tile Attributes, and Attribute Values: Text Loading

Reader Category

DIRECTORY READER Tile Attribute(s)

Engine Category

Translator Category

NULL TRANSLATOR Tile Attribute(s)

USER TRANSLATOR Tile Attribute(s)

Predefined and Default Preferences: Indexing

Data Store Category

DEFAULT_DIRECT_DATASTORE

DEFAULT_OSFILE

DEFAULT_URL

MD_BINARY

MD_TEXT

Filter Category

AUTOB

DEFAULT_NULL_FILTER

HTML_FILTER

WW6B

Lexer Category

DEFAULT_LEXER

KOREAN

VGRAM_CHINESE_1 and VGRAM_CHINESE_2

VGRAM_JAPANESE_1 and VGRAM_JAPANESE_2

THEME_LEXER

Engine Category

DEFAULT_INDEX

Wordlist Category

NO_SOUNDEX

SOUNDEX

Stoplist Category

DEFAULT_STOPLIST

NO_STOPLIST

Predefined and Default Preferences: Text Loading

Reader Category

DEFAULT_READER

Engine Category

DEFAULT_LOADER

Translator Category

DEFAULT_TRANSLATOR

Template Policies

DEFAULT_POLICY

TEMPLATE_AUTOB

TEMPLATE_DIRECT

TEMPLATE_LONGTEXT_STOPLIST_OFF

TEMPLATE_LONGTEXT_STOPLIST_ON

TEMPLATE_MD

TEMPLATE_MD_BIN

TEMPLATE_WW6B

Supported Formats for Mixed-Format Columns

10
ConText Data Dictionary