2
Text Concepts and Definitions

This chapter explains the fundamental concepts that underlie ConText text and theme processing. The following topics are covered in this chapter:

Documents
Text Storage
Text Retrieval
Query Methods
Query Expressions
Stored Query Expressions
Hitlists
Scoring
Result Tables

Documents

In this manual, the terms documents and text are used interchangeably. However, text is a more general term referring to any collection of unstructured data stored in a database column or in an external system file.

The term document, however, has two specific and distinct meanings:

In the traditional sense, a document is any collection of text taken together as a uniquely identifiable entity, such as a book, newspaper, magazine, specification, letter, etc. In this sense also, a document is any single external system file.
In the ConText environment, a document refers to the specific text stored in one column of one row of the database (i.e., one cell of the database) or a pointer within a cell that identifies an external system file.
If the document is stored in a Master/Detail table, the rows of the detail represent logical subsets of the document such as chapters of a book or paragraphs of an article. In this case, each row of the master table identifies one document.

In this manual, the word document refers only to the second definition above.

Text Storage

ConText supports two methods of text storage:

internal
external

See Also:

For more information about text storage in ConText, see Oracle8 ConText Cartridge Administrator's Guide.

Internal Storage

Documents stored inside the database reside in a text column. A text column can be any standard column that stores unstructured textual data within an Oracle database table.

Documents in a text column can consist of plain text (i.e. ASCII) or formatted text (i.e. Microsoft Word, WordPerfect). In addition, each document in a text column can be in a different format.

External Storage

Besides storing text in an Oracle database, ConText can process text that is stored in operating system files. ConText considers this an indirect data store, because the text column for the table contains a pointer to the external file rather than the actual text.

The pointer can be:

a file name and path for using the operating system to access files stored locally.
a uniform resource locator (URL) for accessing HTML files stored either locally or on the WorldWide Web

Querying, retrieval, and linguistic processing for external files is identical to the processing for documents stored internally. However, because external documents have no direct link back to the column in the database, when a change to a document is made, the change is not recorded automatically in the table.

Text Retrieval

The objective of a query is to identify documents that are most relevant to the user's needs by searching for text in the document collection and then retrieving those documents for the user.

This section discusses:

search options
text queries
theme queries

Search Options

There are several search options available for querying text, including:

exact word or phrase
logical combinations of words and phrases
associations:
- the stem of a word or phrase
- fuzzy match of a word or phrase that allows for misspellings
- words that sound similar to each other
- words specified as LIKE expressions

Text Queries

A text query is a means for encoding search criteria so that the text can be searched efficiently and relevant documents retrieved. Before you can execute a query on a text column, you must index the column.

See Also:

For more information about creating text indexes for columns, see Oracle8 ConText Cartridge Administrator's Guide.

To retrieve relevant documents, a text query must accomplish three tasks:

Identify documents in the text table that meet the conditions of the text query.
Calculate a score to determine the relevance of each document that meets the search criteria.
Return the rows of the text table that contain the relevant documents for display or other use in the application.

The first two tasks produce a list of documents that meet the search criteria with the corresponding score for each document. This list is called the hitlist. The third task returns to the user selected rows and columns of the text table for each document in the hitlist.

The three tasks required to retrieve documents can be accomplished using two-steps, one-step, or an in-memory cursor. All three methods produce exactly the same results. You choose a method depending on the needs of the application.

In addition, ConText allows you to return the number of hits for a query in place of the actual hitlist. This can be useful for queries that produce very long hitlists.

Theme Queries

In addition to querying English-language documents by words or phrases (text query), you can query these documents by theme, or by their main concepts.

Theme queries work similarly to text querying in that you must create an index (theme) for the documents before you can query. Theme queries differ from text queries in that you need not provide the word patterns for the search. ConText interprets your query conceptually according to its view of the world and returns an appropriate document hitlist based on theme, along with a measure of how relevant each document is to the query.

You can use the standard query methods to perform theme queries, namely one-step, two-step, and in-memory. In a theme query, you can use most of the operators you use in regular text queries.

See Also:

For more information about theme queries, see "Using Theme Queries" in Chapter 5.

For more information about creating text indexes for columns, see Oracle8 ConText Cartridge Administrator's Guide.

Query Methods

ConText supports three different methods for performing queries:

two-step
one-step
in-memory

In addition, ConText provides a method for counting query hits without performing an actual query.

Two-step Queries

Two-step queries use a PL/SQL procedure in the first step to create a hitlist and store the results in a specified hitlist result table.

The second step uses a SELECT statement to select the results from the result table. In addition, the hitlist table can be joined with the original table to return more detailed document information. In the two-step method, the physical hitlist table is available to the application program.

See Also:

For more information about using two-step queries, see "Using Two-Step Queries" in Chapter 3.

One-step Queries

In a one-step query, you create a single SQL statement that uses the ConText query functions to search for relevant documents and return a record set of selected rows and columns of the text table directly to the user.

The hitlist is processed by ConText using internal result tables. As a result, you do not have to create result tables before running a one-step query; however, the internal result tables are not available to the application program.

See Also:

For more information about using one-step queries, see "Using One-Step Queries" in Chapter 3.

In-memory Queries

In-memory queries use a buffer and a CONTAINS cursor to the buffer to return query results, rather than the result tables used in two-step and one-step queries. As a result, in-memory queries are generally faster than two-step and one-step queries for shorter hitlists.

In an in-memory query, you open a cursor to the query buffer and run a query. ConText writes the results of the query to the buffer. You fetch the results, then close the cursor.

Results can be returned in order of their textkeys or sorted by score.

See Also:

For more information about using in-memory queries, see "Using In-Memory Queries" in Chapter 3.

Counting Query Hits

In addition to two-step, one-step, and in-memory queries, you can use the CTX_QUERY.COUNT_HITS function to return the number of hits for a query without generating scores for the hits or returning the textkeys for the documents. The documents can be stored in a local or remote database. Counting query hits is generally much faster than performing a full query and can be used to audit queries to ensure large and unmanageable hitlists are not returned.

See Also:

For more information about counting query hits, see "Counting Query Hits" in Chapter 3.

Query Expressions

Query expressions are made up of words and phrases (query terms) combined with operators and other special characters to produce search criteria. Operators specify the relative importance of the query terms, define relationships between those terms, control how the search is performed, and determine how much output is returned.

The most basic kind of query expression is single words or phrases that return documents with a score based on the number of occurrences of the words or phrases. More complex expressions allow the user to weight certain terms, search for words that sound like each other, and find all of the words based on a particular root.

ConText provides a rich vocabulary of operators that can be used to create query expressions that meet many complex user needs.

See Also:

For more information about query expressions, see Chapter 4, "Understanding Query Expressions".

Stored Query Expressions

A stored query expression (SQE) is a named query expression that has been stored in database tables along with the results of the query.

You can combine queries by referencing an SQE within the query expression of another query. Using an SQE in a query results in faster execution of the query because the results are already stored in the database.

Stored query expressions can also be used to perform interactive queries, in which an initial query is refined using one or more additional queries.

See Also:

For more information about using stored query expressions, see "Stored Query Expressions" in Chapter 4.

Hitlists

Whenever a query is executed, ConText generates a list of all the documents that meet the search criteria together with a score to indicate the relative importance of the document with regard to the search criteria. This is a hitlist.

In a two-step query, the hitlist is created explicitly and returned to the user as a result table that must have been allocated by the application program.

In a one-step query, the hitlist is generated and processed internally by ConText. The results of the query, including the generated scores, are returned to the user as a record set of selected documents; the hitlist is not available as a separate table.

In an in-memory query, the hitlist is stored in memory and is returned through a loop that fetches the individual hits from memory.

Scoring

Scoring is the method ConText uses to indicate which of the documents in the hitlist are most closely related to the user's needs based on the search criteria. The score is based on a numerical analysis of the occurrences of the query expression.

For example, a document that contains the search expression 10 times is considered more relevant than one that only contains the expression 5 times.

In basic queries, the score is calculated as the number of times a chosen search word appears in the document, and the score can be used to order the hitlist so that the highest scoring documents appear first. In more complex queries, the score is affected by various relationships between words and phrases; weights applied to various elements of the search expression also affect the score by giving more or less emphasis to the occurrence of those terms within the document.

Scores are generated by the general purpose text engine during queries (text or theme). The engine calculates a relevance score for each cell in the text column that meets the search criteria. The upper bound of the score value is 100, and each row meeting the criteria is assigned a score between 1 and 100.

In two-step queries, the score is generated by the CTX_QUERY.CONTAINS procedure and stored in a result table called the hitlist table.

In one-step queries, the score is generated internally by the CONTAINS function and returned by the SCORE function.

In in-memory queries, score is one of the output arguments specified when running the query and is returned when the hits are retrieved.

Result Tables

Result tables are storage areas used by ConText to store output from user queries. These tables are allocated by the application program or procedure and exist until they are released by the application.

Result tables store the following:

output of a two-step query.
highlighting output for viewing query terms in documents.
linguistic output.

Result tables are also used in one-step queries; however, the tables used in one-step queries are internal tables that are allocated by ConText and cannot be accessed from application program.

You can create result tables using the SQL command CREATE or using the CTX_QUERY.GETTAB function.

See Also:

For more information about creating and using result tables, see "Hitlist Result Tables" in Chapter 3.

For more information about the structure of result tables, see Appendix A, "Result Tables".

For more information about generating linguistic output, see"Generating Linguistic Output" in Chapter 8.

2 Text Concepts and Definitions

Documents

Text Storage

Internal Storage

External Storage

Text Retrieval

Search Options

Text Queries

Theme Queries

Query Methods

Two-step Queries

One-step Queries

In-memory Queries

Counting Query Hits

Query Expressions

Stored Query Expressions

Hitlists

Scoring

Result Tables

2
Text Concepts and Definitions