Oracle ConText Cartridge Administrator's Guide Release 2.0 A54628_01 |
|
This chapter provides reference information for the ConText data dictionary objects provided with ConText.
The topics discussed in this chapter are:
The following section lists all of the Tiles which can be used to create indexing preferences for use in policies. The section also lists the attributes and attribute values for each indexing Tile. In addition, a brief description of the Tile attributes and examples are provided.
The indexing Tiles are grouped alphabetically by preference category:
The Data Store category contains the following Tiles:
The binary attribute specifies whether the text in a master detail table is in plain text format (0) or binary format (1).
Text in plain text uses newline characters at the end of each line to indicate the end of the line. In contrast, binary format does not use newline characters to indicate the end of the line.
The path attribute specifies the location of text files that are stored externally in a file system.
Multiple paths can be specified for the path attribute, with each path separated by a colon (:). File names are stored in the text column in the text table. If the path attribute is not used to specify a path for external files, ConText requires the path to be included in the file names stored in the text column.
The timeout attribute specifies the length of time, in seconds, that a network operation such as 'connect' or 'read' waits before timing out and returning a timeout error to the application. The valid range for timeout is 0 to 3600 and the default is 30.
Note:
Since timeout is at the network operation level, the total timeout may be longer than the time specified for timeout. |
The maxthreads attribute specifies the maximum number of threads that can be running at the same time. The valid range for maxthreads is 1 to 1024 and the default is 8.
The maxurls attribute specifies the maximum number of rows that the internal buffer can hold for HTML documents (rows) retrieved from the text table. The valid range for maxurls is 1 to 4294967295 and the default is 256.
The urlsize attribute specifies the maximum length, in bytes, that the URL data store supports for URLs stored in the database. If a URL is over the maximum set, an error is returned. The valid range for urlsize is 32 to 65535 and the default is 256.
The maxdocsize attribute specifies the maximum size, in bytes, that the URL data store supports for accessing HTML documents whose URLs are stored in the database. The valid range for maxdocsize is 1 to 4294967295 and the default is 200000 (2 Mb).
The http_proxy attribute specifies the fully-qualified name of the host machine that serves as the proxy (gateway) for the machine on which ConText is installed.
The no_proxy attribute specifies the strings (up to sixteen, separate by commas) which, when encountered in a host name, cause the URL data store to ignore the machine as a proxy machine.
For example, if the string 'us.oracle.com, uk.oracle.com' is entered for no_proxy, any machines that contain either of these domains in their host names are ignored as proxy machines.
The following example creates a preference named doc_ref for the OSFILE Tile:
begin ctx_ddl.set_attribute ('PATH', '/private/mydocs'); ctx_ddl.create_preference ('DOC_PREF', 'Path my for my documents' 'OSFILE'); end;
Note:
This example illustrates usage of OSFILE for documents stored in a UNIX-based environment. The directory path syntax may be different for other environments. |
The Filter category contains the following Tiles:
The format attribute specifies the internal filter used for filtering text stored in a text column.
The executable attribute specifies the external filters that are used to filter text stored in a mixed-format text column. It has three values that must be specified:
Note:
format and executable cannot both be set in the same preference. |
See Also:
For a list of the format IDs supported by the executable attribute, see "Supported Formats for Mixed-Format Columns" in this chapter. |
The code_conversion attribute specifies whether code conversion is enabled for documents which contain Japanese ASCII text with HTML tags.
Code conversion is required for Japanese HTML documents if the documents use more than one of the three character sets supported for HTML text in Japanese. If code conversion is enabled, all Japanese HTML documents are converted to a single, common character set before indexing.
The default for code_conversion is 0 (disabled).
The command attribute specifies the executable for the single external filter used to filter all text stored in a column. If more than one document format is stored in the column, the external filter must recognize and handle all such formats.
The following example creates a preference named word6 for the BLASTER FILTER Tile:
begin ctx_ddl.set_attribute ('FORMAT', '11'); ctx_ddl.create_preference ('WORD6', 'Microsoft Word docs', 'BLASTER FILTER'); end;
The Lexer category contains the following Tiles:
punctuations specifies the characters that indicate the end of a sentence.
printjoins specifies the characters that join words together when they appear between the words with no blank spaces. Words that contain printjoin characters are stored in the text index exactly as they appear in the text.
For example, if a hyphen '-' is defined as a printjoin character, the word pseudo-intellectual is stored in the text index as pseudo-intellectual.
skipjoins specifies the characters that join words together, but the characters are not stored in the text index.
For example, if a hyphen '-' is defined as a skipjoin character, the word pseudo-intellectual is stored in the text index as pseudointellectual.
Note:
printjoins and skipjoins are mutually exclusive. The same characters cannot be specified for both attributes. |
numjoin specifies the characters that, when they appear in a string of digits, cause ConText to index the string of digits as a single unit or word.
For example, a period '.' may be defined as a numjoin character because it often serves as a decimal point when it appears in a string of digits.
numgroup specifies the characters that, when they appear in a string of digits, indicate that the digits are groupings within a larger single unit.
For example, a comma ',' may be defined as a numgroup character because it often indicates a grouping of thousands when it appears in a string of digits.
continuation specifies the characters that indicate a word continues on the next line and should be indexed as a single token. The most common continuation characters are a hyphen '-' and a backslash '\'.
base_letter specifies whether characters that have diacritical marks (umlats, cedillas, acute accents, etc.) are converted to their base form for text indexing and text queries.
The hanzi_indexing attribute specifies the length of the character groups used for pattern matching while indexing.
A value of 1 for hanzi_indexing indicates that the Chinese lexer examines each character individually to determine token boundaries.
A value of 2 for hanzi_indexing indicates that the lexer examines characters in pairs to determine token boundaries.
The default is 2.
The kanji_indexing attribute specifies the length of the character groups used for pattern matching while indexing.
A value of 1 for kanji_indexing indicates that the Japanese lexer examines each character individually to determine token boundaries.
A value of 2 for kanji_indexing indicates that the lexer examines pairs of characters to determine token boundaries.
The default is 2.
The following example creates a preference named doc_link for the BASIC LEXER Tile:
begin ctx_ddl.Set_attribute ('PRINTJOINS', '-*/'); ctx_ddl.create_preference ('DOC_LINK', 'Dash, star, slash', 'BASIC LEXER' ); end;
The Engine category contains the following Tiles:
index_memory specifies the amount of memory, in bytes, allocated for indexing.
optimize_default specifies the type of optimization used when CTX_DDL.OPTIMIZE_INDEX is called without an optimization type. If no value is specified for optimize_default, the default is DEFRAGMENT_TO_TWO_TABLE.
i1t_tablespace, ktb_tablespace, and lst_tablespace specify the tablespaces used for the ConText index tables created during indexing.
sqr_tablespace specifies the tablespace used for the stored query expression result (SQR) table that is created, but not populated, during indexing. The SQR table for a policy stores the results of stored query expressions for the policy.
i1i_tablespace, kid_tablespace, kik_tablespace, and lix_tablespace specify the tablespaces used for the Oracle indexes generated for each ConText index table during indexing.
sri_tablespace specifies the tablespace used for the Oracle index generated for each SQR table.
i1t_storage, ktb_storage, and lst_storage specify the STORAGE clauses used to create the ConText index tables during ConText indexing.
sqr_storage specifies the STORAGE clause used to create the stored query expression result (SQR) table during ConText indexing.
i1i_storage, kid_storage, kik_storage, and lix_storage specify the STORAGE clauses used to create the Oracle indexes for each ConText index table.
sri_storage specifies the STORAGE clause used to create the Oracle index for each SQR table.
i1t_other_parms, ktb_other_parms, and lst_other_parms specify any additional parameters used to create the ConText index tables during ConText indexing.
sqr_other_parms specifies any additional parameters used to create the stored query expression result (SQR) table during ConText indexing.
i1i_other_parms, kid_other_parms, kik_other_parms, and lix_other_parms specify any additional parameters used to create the Oracle indexes for each ConText index table.
sri_other_parms specifies any additional parameters used to create the Oracle index for each SQR table.
sqe/sei_tablespace, sqe/sei_storage, and sqe/sei_other_params are not used by ConText because SQE tables and their accompanying Oracle indexes are not used for storing SQE definitions (all SQE definitions are stored in a system table owned by CTXSYS). As a result, values are not required for these attributes.
See Also:
For descriptions of the tables and indexes that constitute a ConText index, see "Appendix C, "ConText Index Tables and Indexes". For more information about the storage clauses and other parameters that can be specified for a database table/index, see the CREATE TABLE and CREATE INDEX commands in Oracle8 Server SQL Reference. For more information about the parallel query option in Oracle8, see Oracle8 Server Tuning. For more information about SQEs, see Oracle8 ConText Cartridge Application Developer's Guide. |
The following example creates a preference named doc_engine for the GENERIC ENGINE Tile:
begin ctx_ddl.set_attribute ('INDEX_MEMORY', 30000000 ); ctx_ddl.set_attribute ('I1T_TABLESPACE', 'DOCUMENTS' ); ctx_ddl.set_attribute ('I1T_STORAGE',' initial 10M next 2M maxextents 10'); ctx_ddl.set_attribute ('I1T_OTHER_PARMS',' pctfree 20'); ctx_ddl.set_attribute ('I1I_OTHER_PARMS',' parallel 2'); ctx_ddl.create_preference ('DOC_ENGINE', 'Test case', 'GENERIC ENGINE' ); end;
The Wordlist category contains the following Tiles:
The stclause attribute specifies the STORAGE clause used to create the Soundex wordlist table during ConText indexing. The Soundex wordlist table is only created if Soundex is enabled through the soundex_at_index attribute.
The instclause attribute specifies the STORAGE clause used to create the Oracle index for the Soundex wordlist table.
The soundex_at_index attribute specifies whether ConText generates Soundex word mappings and stores them in the Soundex wordlist table during text indexing. If Soundex word mappings are not generated and stored in the wordlist table during indexing, queries that use Soundex are not expanded.
The stemmer attribute specifies the stemmer used for word stemming in text queries. For all the supported languages, the stemmers return standard inflected forms of a word, such as the plural form (e.g. department --> departments).
For English, an additional stemmer is provided which returns standard inflected forms and derived forms (e.g. department --> departments, departmentalize).
The default for stemmer is 1 (inflectional English)
The fuzzy_match attribute specifies which fuzzy matching routines are used for the column. Fuzzy matching is currently supported for English, Japanese, and, to a lesser extent, the Western European languages.
The default for fuzzy_match is 1.
Note:
The fuzzy_match attribute values for Chinese and Korean are dummy attribute values that prevent the English and Japanese fuzzy matching routines from being used on Chinese and Korean text. |
See Also:
For more information about the expansion methods supported by ConText, see "WordList Category" in Chapter 5, "Understanding the ConText Data Dictionary". For more information about expansion methods in queries, see Oracle8 ConText Cartridge Application Developer's Guide. |
The following example creates a preference named soundex_yes for the GENERIC WORDLIST Tile:
begin ctx_ddl.set_attribute('SOUNDEX_AT_INDEX', '1'); ctx_ddl.create_preference('SOUNDEX_YES', 'Will build the soundex mapping during indexing', 'GENERIC WORDLIST'); end;
The Stoplist category contains the following Tiles:
Tile | Attributes | Attribute Values |
---|---|---|
GENERIC STOP LIST |
STOP_WORD |
word (string), sequence (number) |
The stop_word attribute has two values that must be specified:
Sequence is a value from 1 to 4095 and is used in a text index to record the stop words that proceed and follow an indexed term. ConText records up to eight preceding stop words and eight following stop words for each indexed term. This enables text queries for phrases which contain stop words.
For example, consider the sentence "he is at the top of the class" where at, the, top, and of are stop words. The sequences for each of the stop words are recorded as part of the text index entry for the term class, which allows users to include stopwords in a query (e.g. 'top of the class').
The following example creates a preference named mini_stop_list for the GENERIC STOPLIST Tile:
begin ctx_ddl.set_attribute ('STOP_WORD', 'A', 1); ctx_ddl.set_attribute ('STOP_WORD', 'AND', 2); ctx_ddl.set_attribute ('STOP_WORD', 'THE', 3); ctx_ddl.create_preference ('MINI_STOP_LIST', 'Small', 'GENERIC STOP LIST' ); end;
The following section lists all of the Tiles which can be used to create text loading preferences for use in sources. The section also lists the attributes and attribute values for each text loading Tile. In addition, a brief description of the Tile attributes and examples are provided.
The text loading Tiles are grouped alphabetically by preference category:
Preference Category | Tiles |
---|---|
DIRECTORY READER |
|
GENERIC LOADER |
|
NULL TRANSLATOR |
|
|
USER TRANSLATOR |
The Reader category contains the following Tiles:
Tile | Attributes | Attribute Values |
---|---|---|
DIRECTORY READER |
DIRECTORIES |
pathname for the directory where text loading files are located |
The directories attribute specifies the full pathname for the directory that the ConText server with the Loader personality scans when looking for new files to load into a column in a table or view.
The structure of the value for pathname will vary depending on the directory naming conventions used by your operating system.
The Engine (Text Loading) category contains the following Tiles:
Tile | Attributes | Attribute Values |
---|---|---|
GENERIC LOADER |
** none ** |
N/A |
The GENERIC LOADER Tile does not have any attributes. In general, preferences do not need to be created for the Engine category, since the GENERIC LOADER Tile does not have attributes that can be set by the user.
The Translator category contains the following Tiles:
Tile | Attributes | Attribute Values |
---|---|---|
NULL TRANSLATOR |
SEPARATE |
N/A |
USER TRANSLATOR |
COMMAND |
translator executable |
The separate attribute specifies that the load files do not contain the actual text of the documents to be loaded, but, rather, contain pointers to separate files where the text of the documents is stored.
See Also:
For more information about how the separate option works for loading text, see "ctxload Utility" in Chapter 9, "Executables and Utilities". |
The command attribute specifies the name of the executable used to translate a load file into the format required by ctxload.
ConText provides the following predefined indexing preferences, grouped according to preference category:
The following section provides descriptions of the predefined preferences for the Data Store category.
Note:
DEFAULT_DIRECT_DATASTORE is the default preference for the Data Store preference category. |
The DEFAULT_DIRECT_DATASTORE preference calls the DIRECT Tile which is used to indicate that text is stored directly in the text column of a text table.
DEFAULT_DIRECT_DATASTORE does not use any Tile attributes because the DIRECT Tile does not have attributes.
The DEFAULT_OSFILE preference calls the OSFILE Tile which is used to indicate that text is stored as files in a file system.
DEFAULT_OSFILE uses the PATH Tile attribute and a hardcoded set of dummy directory paths to indicate the directories in which the text files are located.
The hard-coded paths, delimited by colons are: /oracle/data, /oracle/data2, /oracle/data3.
Note:
The DEFAULT_OSFILE preference requires modification to reflect the actual paths for your text files before the preference can be used in a policy. |
The DEFAULT_URL preference calls the URL Tile which is used to indicate that text is stored as URLs.
DEFAULT_URL uses all of the attribute defaults for the URL Tile:
The MD_BINARY preference calls the MASTER DETAIL Tile which is used to indicate text is stored in a master detail table.
MD_BINARY uses the BINARY Tile attribute and a value of YES to indicate that the text in the table is stored in binary format:
The MD_TEXT preference calls the MASTER DETAIL Tile which is used to indicate text is stored in a master detail table.
MD_TEXT uses the Tile attribute BINARY and a value of NO to indicate that the text in the table is stored as ASCII text.
The following section provides descriptions of the predefined preferences for the Filter category.
Note:
DEFAULT_NULL_FILTER is the default preference for the Filter preference category. |
The AUTOB preference calls the BLASTER FILTER Tile which specifies an internal filter used to extract text from formatted documents in a text column.
AUTOB uses the FORMAT Tile attribute and a value of 997 to indicate that ConText uses the autorecognize filter to extract text. It can be used to filter text in a column the contains the following document formats:
The DEFAULT_NULL_FILTER preference calls the FILTER NOP Tile which indicates that the text column in a text table contains plain, unformatted (ASCII) text and does not require filtering for indexing and highlighting.
DEFAULT_NULL_FILTER does not use any Tile attributes because the FILTER NOP Tile does not have attributes.
The HTML_FILTER preference calls the HTML FILTER Tile and can be used to filter documents in a column that contains only HTML-formatted documents.
The WW6B preference calls the BLASTER FILTER Tile which specifies that, for the BLASTER FILTER Tile, the Microsoft Word for Windows 6 internal filter is used to extract text from Word for Windows 6 documents in a text column.
WW6B uses the format Tile attribute and a value of 11 to indicate ConText uses the Word for Windows 6 filter to extract text. It can be used in a column that contains only Word for Windows 6-formatted documents.
The following section provides descriptions of the predefined preferences for the Lexer category.
Note:
DEFAULT_LEXER is the default preference for the Lexer preference category. |
The predefined DEFAULT_LEXER preference calls the BASIC LEXER Tile, which indicates the lexer settings used to identify word and sentence boundaries for text indexing and text queries.
DEFAULT_LEXER uses the following Tile attributes and values to indicate the lexer settings:
The KOREAN preference calls the KOREAN LEXER Tile and can be used for parsing Korean text. It has no attributes.
The VGRAM_CHINESE preferences call the CHINESE V-GRAM LEXER Tile which indicates the preferences can be used for parsing Chinese text.
The 1 or 2 indicates that the preference uses either method 1 or 2 for identifying tokens in Chinese text (hanzi_indexing attribute).
The VGRAM_JAPANESE preferences call the JAPANESE V-GRAM LEXER Tile which indicates the preferences can be used for parsing Japanese text.
The 1 or 2 indicates that the preference uses either method 1 or 2 for identifying tokens in Japanese text (kanji_indexing attribute).
The predefined THEME_LEXER preference calls the THEME LEXER Tile, which indicates the preference can be used in a column policy to create theme indexes for a column.
The THEME_LEXER preference does not set any attributes because the THEME LEXER preference doesn't have any attributes.
The following section provides descriptions of the predefined preferences for the Engine category.
The DEFAULT_INDEX preference calls the GENERIC ENGINE Tile which is used to specify the amount of memory reserved for indexing.
DEFAULT_INDEX uses the index_memory attribute and specifies the amount of memory allocated for indexing: 12582912 bytes
The following section provides descriptions of the predefined preferences for the Wordlist category.
Note:
NO_SOUNDEX is the default preference for the Wordlist preference category. |
The NO_SOUNDEX preference contains the GENERIC WORD LIST Tile which specifies whether Soundex word mappings are generated during text indexing. Soundex can be used in text queries to expand the query to include words that sound similar to the query terms.
NO_SOUNDEX uses the soundex_at_index Tile attribute and a value of 0 to indicate that ConText does not generate Soundex word mappings during text indexing.
The SOUNDEX preference contains the GENERIC WORDLIST Tile which specifies whether Soundex word mappings are generated during text indexing. Soundex can be used in text queries to expand the query to include words that sound similar to the query terms.
SOUNDEX uses the soundex_at_index Tile attribute and a value of 1 to indicate that ConText generates Soundex word mappings during text indexing.
The following section provides descriptions of the predefined preferences for the Stoplist category.
Note:
DEFAULT_STOPLIST is the default preference for the Stoplist preference category. |
The DEFAULT_STOPLIST preference specifies a list of stop words for the GENERIC STOP LIST Tile.
The preference calls the stop_word attribute for each of the following stop words:
The NO_STOPLIST preference contains the GENERIC STOP LIST TILE and specifies that no list of stop words is used during text indexing. All words that ConText encounters are stored in the text index.
NO_STOPLIST contains no stop_word attributes to indicate that there are no stopwords used during indexing.
ConText provides the following predefined text loading preferences for the three preference categories for sources:
Preference Category | Predefined Preferences | Default |
---|---|---|
DEFAULT_READER |
*** |
|
DEFAULT_LOADER |
*** |
|
DEFAULT_TRANSLATOR |
*** |
The following section provides descriptions of the predefined preferences for the Reader category.
The DEFAULT_READER preference uses the DIRECTORY READER Tile, which has a dummy directory set for the Tile.
The following section provides descriptions of the predefined preferences for the Text Loading Engine category.
The DEFAULT_LOADER preference uses the GENERIC LOADER Tile, which indicates the preference can be used to load text from files in a operating system directory.
The following section provides descriptions of the predefined preferences for the Translator category.
The DEFAULT_TRANSLATOR preference uses the NULL TRANSLATOR Tile, which indicates no translation is performed on the files to be loaded, because the files are in the format required by ctxload.
The following section provides a brief description of the template policies provided with ConText.
The template policies are owned by CTXSYS. A template policy can be specified as the source policy for a policy during creation.
ConText provides the following template policies:
The DEFAULT_POLICY policy can be used to create a policy which uses all of the default preferences:
Note:
DEFAULT_POLICY is the default for source_policy in CREATE_POLICY and CREATE_TEMPLATE_POLICY in the CTX_DDL package. |
The TEMPLATE_AUTOB policy can be used to create a policy for a text column that contains documents in mixed formats. The autorecognize Blaster filter is used to automatically identify the format of each document in a column and, if the format is supported by ConText, extract the text of the document for indexing.
TEMPLATE_AUTOB uses the AUTOB predefined preference and all the remaining default preferences.
The TEMPLATE_DIRECT policy can be used to create a policy for indexing basic text stored in a text column.
It uses all the default preferences.
The TEMPLATE_LONGTEXT_STOPLIST_OFF policy can be used to create a policy that does not use a stopword list during indexing.
It uses the NO_STOPLIST predefined preference and all the remaining default preferences.
The TEMPLATE_LONGTEXT_STOPLIST_ON policy can be used to create a policy that uses a stopword list during indexing.
It uses the DEFAULT_STOPLIST predefined preference and all the remaining default preferences.
The TEMPLATE_MD policy can be used to create a policy for indexing plain text stored in the detail column in a master-detail table.
It uses the MD_TEXT predefined preference and all the remaining default preferences.
The TEMPLATE_MD_BIN policy can be used to create a policy for indexing binary text stored in the detail column in a master-detail table.
It uses the MD_BINARY predefined preference and all the remaining default preferences.
The TEMPLATE_WW6B policy can be used to create a policy for indexing text formatted for Microsoft Word for Windows 6.
It uses the WW6B predefined preference and all the remaining default preferences.
The following section lists all of the formats that ConText supports for columns that use external filters for processing documents in more than one format.
For each format, the format ID is also listed. This is the value that must be specified when creating a Filter preference using the BLASTER FILTER Tile with the executable attribute.
See Also:
For more information about using format IDs in Filter preferences, see "Creating Filter Preferences" in Chapter 6, "Setting Up and Managing Text". |