learn-oracle
free Oracle DBA tutorial Oracle Jobs
Ask A Question
SQL Statement Tuning
Backup and Recovery Concepts
Oracle 11g New Features
Oracle E Suite & Others
Oracle Data Guard
Oracle DBA FAQ





Supporting Multilingual Databases with Unicode


Previous Chapter | Next Chapter

Overview of Unicode
Dealing with many different languages in the same application or database has been complicated and difficult for a long time. To overcome the limitations of existing character encodings, several organizations began working on the creation of a global character set in the late 1980s. The need for this became even greater with the development of the World Wide Web in the mid-1990s. The Internet has changed how we do business today, with an emphasis on the global market that has made a universal character set a major requirement. This global character set needs to contain all major living scripts, support legacy data and implementations, and be simple enough that a single implementation of a product is sufficient for worldwide use. This character set should also support multilingual users and organizations, conform to international standards, and enable world-wide interchange of data. This global character set exists, is in wide use, and is called Unicode.

What is Unicode?

Unicode is a universal encoded character set that allows Oracle DBA to store information from any language using a single character set. Unicode provides a unique code value for every character, regardless of the platform, program, or language.

The Unicode standard has been adopted by many software and hardware vendors. Many operating systems and browsers now support Unicode. Unicode is required by modern standards such as XML, Java, JavaScript, LDAP, CORBA 3.0, and WML. It is also synchronized with the ISO/IEC 10646 standard.

Oracle started supporting Unicode as a database character set in Oracle7. In Oracle9i, Unicode support has been expanded so that Oracle DBA can find the right solution for your globalization needs. Oracle9i supports Unicode 3.0, the third version of the Unicode standard.

Unicode Encoding

There are two different ways to encode Unicode 3.0 characters:

  • UTF-16 Encoding
  • UTF-8 Encoding
UTF-16 Encoding

UTF-16 encoding is the 16-bit encoding of Unicode in which the character codes 0x0000 through 0x007F contain the ASCII characters. One Unicode character can be 2 bytes or 4 bytes in this encoding. Characters from both European and most Asian scripts are represented in 2 bytes. Surrogate pairs (described below) are represented in 4 bytes.

UTF-8 Encoding

UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding in which the character codes 0x00 through 0x7F have the same meaning as ASCII. One Unicode character can be 1 byte, 2 bytes, 3 bytes or 4 bytes in this encoding. Characters from the European scripts are represented in either 1 or 2 bytes, while characters from most Asian scripts are represented in 3 bytes. Surrogate pairs (described below) are represented in 4 bytes.

Surrogate Pairs

Oracle DBA can extend Unicode to encode more than 1 million characters. These extended characters are called surrogate pairs. Surrogate pairs are designed to allow representation of characters in Unicode 3.0 and future extensions of the Unicode standard. Surrogate pairs require 4 bytes in UTF-8 and UTF-16 encoding.

Implementing a Unicode Solution in the Database

Oracle DBA can store Unicode characters in an Oracle9i database in two ways. The following sections explain how to use the two Unicode solutions and how to choose between them:

  • Enabling Multilingual Support with Unicode Databases
  • Enabling Multilingual Support with Unicode Datatypes
Oracle DBA can create a Unicode database that allows you to store UTF-8 encoded characters as SQL CHAR datatypes (CHAR, VARCHAR2, CLOB, and LONG). If you prefer to implement Unicode support incrementally, you can store Unicode data in either the UTF-16 or UTF-8 encoding form in SQL NCHAR datatypes (NCHAR, NVARCHAR2, and NCLOB). The SQL NCHAR datatypes are called Unicode datatypes because they are used for storing Unicode data only.
  • Enabling Multilingual Support with Unicode Databases
    The Oracle9i database has the concept of a database character set, which specifies the encoding to be used in the SQL CHAR datatypes as well as the metadata such as table names, column names, and SQL statements. A Unicode database is a database with UTF-8 as the database character set. There are three Oracle character sets that implement the UTF-8 encoding. The first two are designed for ASCII-based platforms while the third one should be used on EBCDIC platforms.

    AL32UTF8

    The AL32UTF8 character set encodes characters in one to three bytes. Surrogate pairs require four bytes. It is for ASCII-based platforms.

    UTF8

    The UTF8 character set encodes characters in one to three bytes. Surrogate pairs require six bytes. It is for ASCII-based platforms.

    UTFE

    The UTFE character set should be used as the database character set on EBCDIC platforms to support the UTF-8 encoding.

    Exampl Creating a Database with a Unicode Character Set
    To create a database with the UTF8 character set, use the CREATE DATABASE statement. For example:

    
    
    CREATE DATABASE myunicodedatabase
        CONTROL FILE REUSE 
        LOGFILE '/u01/oracle/utfdb/redo01.log' SIZE 1M REUSE
                '/u01/oracle/utfdb/redo02.log' SIZE 1M REUSE
        DATAFILE '/u01/oracle/utfdb/system01.dbf' SIZE 10M REUSE
            AUTOEXTENT ON
            NEXT 10M MAXSIZE 200M
        CHARACTER SET UTF8
    
    
    Enabling Multilingual Support with Unicode Datatypes

    An alternative to storing Unicode data in the database is to use the SQL NCHAR datatypes. You can store Unicode characters into columns of these datatypes regardless of how the database character set has been defined. The NCHAR datatype has been redefined in Oracle9i to be a Unicode datatype exclusively. In other words, it stores data in the Unicode encoding only. Oracle DBA can use the SQL NCHAR datatypes in the same way you use the SQL CHAR datatypes. You can create a table using the NVARCHAR2 and NCHAR datatypes as follows:

    
    CREATE TABLE product 
     (id          NUMBER
      ename       NCHAR(100)
      description NVARCHAR2(1000))
      
    The encoding used in the SQL NCHAR datatypes is specified as the national character set of the database. You can specify one of the following two Oracle character sets as the national character set:

    AL16UTF16

    This is the default character set for SQL NCHAR datatypes. The character set encodes Unicode data in the UTF-16 encoding.

    UTF8

    When UTF8 is specified for SQL NCHAR datatypes, the data stored in the SQL datatypes is in UTF-8 encoding.

    By default, data is stored in the UTF-16 encoding in the SQL NCHAR datatypes, and the length specified in the NCHAR and NVARCHAR2 columns is always in the number of characters instead of the number of bytes.

    You can specify the national character set for the SQL NCHAR datatypes when you create a database using the CREATE DATABASE statement. The following command creates a database with WE8ISO8859P1 as the database character set and AL16UTF16 as the national character set.

    Example Creating a Database with a National Character Set

    
    
    CREATE DATABASE myunicodedatabase
      CONTROL FILE REUSE 
      LOGFILE '/u01/oracle/utfdb/redo01.log' SIZE 1M REUSE
               '/u01/oracle/utfdb/redo02.log' SIZE 1M REUSE
      DATAFILE '/u01/oracle/utfdb/system01.dbf' SIZE 10M REUSE
        AUTOEXTENT ON
        NEXT 10M MAXSIZE 200M
      CHARACTER SET WE8ISO8859P1
      NATIONAL CHARACTER SET AL16UTF16
    
      
    This example also illustrates the fact that you can store Unicode data in a non-Unicode database by using SQL NCHAR datatypes.

    How to Choose Between a Unicode Database and a Unicode Datatype Solution

    In order to choose the right Unicode solution for your database, you need to consider the following:

    1. Programming environment

      What are the main programming languages used in your applications? How do they support Unicode?

    2. Ease of migration

      How easily can your data and applications be migrated to take advantage of the Unicode solution?

    3. Performance

      How much performance overhead are you willing to accept in order to use Unicode in the database?

    4. Type of data

      Is your data mostly Asian or European? Do you need to store multilingual documents into LOB columns?

    5. Type of applications

      What type of applications are you implementing: a packaged application or a customized end-user application?

    This section describes some general guidelines for choosing a Unicode database or a Unicode datatype solution. The final decision largely depends on your exact environment and requirements.

    When Should You Use a Unicode Database?

    You should use a Unicode database when:

    1. You need easy code migration for Java or PL/SQL.

      If your existing application is mainly written in Java and PL/SQL and your main concern is to minimize the code change required to support multiple languages, you may want to use a Unicode database solution. As long as the datatypes used to stored your data remain as the SQL CHAR datatypes, the Java and PL/SQL accessing these columns do not need to change.

    2. You need easy data migration from ASCII.

      If the database character set of your existing database is US7ASCII, you may want to choose a Unicode database solution because you can migrate your database using a single ALTER DATABASE statement. No data conversion is required because ASCII is a subset of UTF-8.

    3. You have evenly distributed multilingual data.

      If the multilingual data will be evenly distributed in your existing schema tables and you are not sure which ones will contain multilingual data, then you should use the Unicode database solution because it does not require you to identify which columns store what data.

    4. Your SQL statements and PL/SQL code contain Unicode data.

      If you need to write SQL statements and PL/SQL code in Unicode, you must use the Unicode database solution. SQL statements and PL/SQL code are converted into the database character set before being processed. If your SQL statements and PL/SQL code contain characters that cannot be converted to your database character set, those characters will be lost. A common place where you would use Unicode data in a SQL statement is in a string literal.

    5. You want to store multilingual documents as BLOBs and use Oracle Text for content searching.

      You must use a Unicode database in this case. The BLOB data is converted to the database character set before being indexed by Oracle Text. If your database character set is not UTF8, then data will be lost when the documents contain characters that cannot be converted to the database character set.

    After you decide to use a Unicode database solution, you need to decide which UTF-8 character set is best for your database. S When Should You Use Unicode Datatypes?

    You should use Unicode datatypes when:

    1. You want to add multilingual support incrementally.

      If you want to add Unicode support to your existing database without migrating the character set, you should consider using Unicode datatypes to store Unicode. You can add columns of the SQL NCHAR datatypes to existing tables or new tables to support multiple languages incrementally.

    2. You want to build a packaged application.

      If you are building a packaged application that will be sold to customers, you may want to build the application using SQL NCHAR datatypes. This is because the SQL NCHAR datatype is a reliable Unicode datatype in which the data is always stored in Unicode, and the length of the data is always specified in UTF-16 code units. As a result, you need only test the application once, and your application will run on customer databases with any database character set.

    3. You want better performance with single-byte database character sets.

      If performance is your biggest concern, you should consider using a single-byte database character set and storing Unicode data in the SQL NCHAR datatypes. Databases using a multibyte database character set such as UTF8 have a performance overhead.

    4. You require UTF-16 support in Windows clients.

      If your applications are written in Visual C/C++ or Visual Basic running on Windows, you may want to use the SQL NCHAR datatypes because you can store UTF-16 data in these datatypes in the same way that you store it in the wchar_t buffer in Visual C/C++ and string buffer in Visual Basic. You can avoid buffer overflow in your client applications because the length of the wchar_t and string datatypes match the length of the SQL NCHAR datatypes in the database.

    Unicode Case Studies
    This section describes a few typical scenarios for storing Unicode characters in an Oracle9i database.

    Unicode Solution with a Unicode Database

    An American company running a Java application would like to add German and French support in the next release of their application, and add Japanese support at a later time. The company currently has the following system configuration:

    • The existing database has a database character set of US7ASCII.
    • All character data in the existing database is composed of ASCII characters.
    • PL/SQL stored procedures are used in the database.
    • The database is around 300 GB.
    • There is a nightly downtime of 4 hours.
    In this case, a typical solution is to choose UTF8 for the database character set because of the following reasons:

    The database is very large and the scheduled downtime is short. How fast the database can be migrated to support Unicode is vital. Because the database is in US7ASCII, the easiest and fastest way of enabling the database to support Unicode is to switch the database character set to UTF8 by issuing the ALTER DATABASE statement. No data conversion is required because US7ASCII is a subset of UTF8.

    Because most of the code is written in Java and PL/SQL, changing the database character set to UTF8 is unlikely to break existing code. Unicode support will be automatically enabled in the application.

    Because the application will support French, German, and Japanese, there will be few surrogate characters. Both AL32UTF8 and UTF8 are suitable.

    Unicode Solution with Unicode Datatypes

    A European company running its applications mainly on Windows platforms wants to add new Windows applications written in Visual C/C++, which will use the existing database to support Japanese and Chinese customer names. The company currently has the following system configuration:

    • The existing database has a database character set of WE8ISO8859P1.
    • All character data in the existing database is composed of Western European characters.
    • The database is around 50 GB.
    In this case, a typical solution is to use NCHAR and NVARCHAR2 datatypes to store Unicode characters, and keep WE8ISO8859P1 as the database character set and use AL16UTF16 as the national character set. The reasons for this are:

    • Migrating the existing database to a Unicode database required data conversion because the database character set is WE8ISO8859P1 (a Latin 1 character set), which is not a subset of UTF8. As a result, there will be some overhead in converting the data to UTF8.
    • The additional languages are supported in new applications only, so there is no dependency on the existing applications or schemas. It is simpler to use the Unicode datatype in the new schema and keep the existing schemas unchanged.
    • Only customer name columns require Unicode support. Using a single NCHAR column meets the customer's requirements without migrating the entire database.
    • Because the languages to be supported are mostly Asian languages, AL16UTF16 should be used as the national character set so that disk space is used more efficiently.
    • Lengths are treated in terms of characters in the SQL NCHAR datatypes. This is the same as the way they are treated when using wchar_t strings in Windows C/C++ programs. This reduces programming complexity.
    • Existing applications using the existing schemas are unaffected.
    Unicode Solution with Both a Unicode Database and Unicode Datatypes

    A Japanese company wants to develop a new Java application on Oracle9i. The company projects that the application will support as many languages as possible in the long run.

    In order to store documents as is, the company decided to use the BLOB datatype to store documents of multiple languages.

    The company may also want to generate UTF-8 XML documents from the relational data for business-to-business data exchange.

    The back-end has Windows applications written in C/C++ using ODBC to access the Oracle database.

    In this case, the typical solution is to create a Unicode database using AL32UTF8 as the database character set and use the SQL NCHAR datatypes to store multilingual data. The national character set should be set to AL16UTF16. The reasons for this solution are:

    When documents of different languages are stored as BLOBs, Oracle Text requires the database character set to be one of the UTF-8 character sets. Because the applications may retrieve relational data as UTF-8 XML format (where surrogate characters are stored as four bytes), AL32UTF8 should be used as the database character set to avoid redundant data conversion when UTF-8 data is retrieved or inserted.

    Because applications are new and written in both Java and Windows C/C++, the company should use the SQL NCHAR datatype for its relational data as both Java and Windows support the UTF-16 character datatype and the length of a character string is always measured in the number of characters.

    If most of the data is for Asian languages, AL16UTF16 should be used in conjunction with the SQL NCHAR datatypes because AL16UTF16 offers better performance and storage efficiency.

    Migrating Data to Unicode
    It is important to separate the task of character set migration from the task of database version migration, For example, if you have an Oracle8i non-Unicode database and you want to migrate it to an Oracle9i Unicode database, you must first migrate it to Oracle9i, then migrate the data to Unicode.

    This section describes how to migrate your data to Unicode in Oracle9i. "Migrating to Use the NCHAR Datatypes" describes how to migrate non-Unicode SQL CHAR datatypes to SQL NCHAR datatypes. It also describes how to migrate pre-Oracle9i SQL NCHAR datatypes to Oracle9i SQL NCHAR datatypes.

    Before you actually migrate your data to Unicode, you need to identify areas of possible data character set conversions and truncation of data. Oracle strongly recommends that you analyze your database using the Character Set Scanner Utility for possible problems before actually migrating the database.

    Migrating to a Unicode Database
    There are three general approaches when migrating data from non-Unicode character set to Unicode:

    • Full Export and Import
    • The ALTER DATABASE CHARACTER SET Statement
    • The ALTER DATABASE CHARACTER SET Statement and Selective Imports and Exports
    Full Export and Import

    In most cases, a full export and import is recommended to properly convert all data to a Unicode character set. It is important to be aware of data truncation issues because character datatype columns might need to be extended before importing to handle the increase of data byte size.

    The steps to migrate to Unicode using a full export and import are:

    1. Scan the database to identify columns that need to be extended. Use the Character Set Scanner Utility.
    2. Export the entire database.
    3. Create a new database using either UTF8 or AL32UTF8 on ASCII-based platforms, using UTFE on EBCDIC platforms.
    4. Create the tables identified in step 1 with extended columns size.
    5. Import the.DMP file exported in step 2 to the new database.
    The ALTER DATABASE CHARACTER SET Statement

    If, and only if, the current database character set is US7ASCII and all the data is in the 7-bit range, you can use the ALTER DATABASE CHARACTER SET statement to expedite migration to a Unicode database character set. Note that this approach cannot be taken on EBCDIC platforms because UTFE is not a strict superset of any EBCDIC character set.

    Use the following steps to migrate to Unicode using the ALTER DATABASE CHARACTER SET statement:

    • Scan the database to make sure all data is in the 7-bit range. Use the Character Set Scanner Utility.
    • Change the database character set to UTF8 or AL32UTF8 using the ALTER DATABASE CHARACTER SET statement.
    The ALTER DATABASE CHARACTER SET Statement and Selective Imports and Exports

    Another approach is to issue an ALTER DATABASE CHARACTER SET statement followed by selective imports. This methods can be used when the distributions of convertible data are known and they are stored within a small number of tables.

    The steps to migrate to Unicode using selective imports are:

    1. Scan the database to identify tables that contain convertible data.
    2. Export those tables identified in step 1.
    3. Delete all rows from those table identified in step 1.
    4. Change the database character set to UTF8 or AL32UTF8 with the ALTER DATABASE CHARACTER SET statement.
    5. Import the dump files into the database.
    Migrating to Use the NCHAR Datatypes
    The Oracle Server introduced in release 8.0 a national character (NCHAR) datatype that allows for a second, alternate character set in addition to the original database character set. NCHAR supports a number of special, fixed-width Asian character sets that were introduced to provide for higher performance processing of Asian character data.

    In Oracle9i, the SQL NCHAR datatypes are limited to the Unicode character set encoding (UTF8 and AL16UTF16) only. Any other Oracle8 Server character sets that were available under the NCHAR datatype, including Asian character sets (for example, JA16SJISFIXED), will no longer be supported.

    The migration steps for existing NCHAR, NVARCHAR, and NCLOB columns through export and import are as follows:

    • Export all SQL NCHAR columns from the Oracle8 or Oracle8i database.
    • Drop the SQL NCHAR columns.
    • Upgrade database to Oracle9i.
    • Import the SQL NCHAR columns into Oracle9i.
    The Oracle9i migration utility can also convert your Oracle8 and Oracle8i NCHAR columns to 9i NCHAR columns. A SQL NCHAR upgrade script called utlchar.sql is supplied with the migration utility. You should run it at the end of the migration to convert your Oracle8 and Oracle8i NCHAR columns to the new Oracle9i NCHAR columns. Once the script has been executed the data cannot be downgraded, because there is no downgrade SQL NCHAR script. The only way for you move back to Oracle8 or Oracle8i is to drop all NCHAR columns, downgrade the database, and import the old NCHAR data from a previous Oracle8 or Oracle8i export file. Make sure your have a backup (export file) of your Oracle8 or Oracle8i NCHAR data, in case you need to downgrade your database in the future.

    To take advantage of the new Unicode NCHAR datatypes, you can also use the Export and Import utilities to migrate SQL CHAR columns to SQL NCHAR columns:

    1. Export the SQL CHAR columns that you want to convert to SQL NCHAR.
    2. Drop the columns that were just exported.
    3. Import the columns as SQL NCHAR columns.
    Designing Database Schemas to Support Multiple Languages
    In addition to choosing a Unicode solution, the following should also be taken into consideration when the database schema is designed to support multiple languages:

    • Specifying Column Limits
    • Storing Data of Multiple Languages
    • Storing Documents in LOBs
    Specifying Column Limits

    When you use NCHAR and NVARCHAR2 datatypes for storing multilingual data, the column limit specified for a column is always in character semantics (which is in terms of the number of Unicode code units).

    When you use CHAR and VARCHAR2 datatypes for storing multilingual data, the column limit specified for each column is, by default, in number of bytes. If the database needs to support Thai, Arabic, or multibyte languages such as Chinese and Japanese, the limits for the CHAR, VARCHAR, and VARCHAR2 columns may need to be extended. This is because the number of bytes required to encode these languages in UTF8 or AL32UTF8 may be significantly larger than those for English and Western European languages. For example, one Thai character in the Thai character set requires 3 bytes in UTF8 or AL32UTF8. In addition, the maximum limits for CHAR, VARCHAR, and VARCHAR2 datatypes are 2000 bytes, 4000 bytes, and 4000 bytes respectively. If applications need to store more than 4000 bytes, you should use the CLOB datatype for the data.

    Storing Data of Multiple Languages

    The Unicode character set includes characters of most written languages around the world, but it does not tell you the language to which a given character belongs. In other words, a character such as ä does not contain information about whether it is a French or German character. In order to provide information in the language a user desires, data stored in a Unicode database should accompany the language information to which the data belongs.

    There are many ways for a database schema to relate data to a language. Here is one example.

    Store Language Information with the Data

    For data such as product descriptions or product names, you can add a language column (language_id) of CHAR or VARCHAR2 datatype to the product table to identify the language of the corresponding product information. This enables accessing applications to retrieve the information in the desired language. The possible values for this language column are the 3-letter abbreviations of the valid NLS_LANGUAGE values of the database.

    You can also create a view to select the data of the current language. For example:

    CREATE OR REPLACE VIEW product AS
    SELECT product_id, product_name
    FROM products_table
    WHERE language_id = sys_context('USERENV','LANG');.

    Select Translated Data Using Fine-Grained Access Control

    Fine-grained access control allows you to limit the degree to which a user can view information in a table or view. Typically, this is done by appending a WHERE clause. Once you add a WHERE clause as a fine-grained access policy to a table or view, Oracle9i automatically appends the WHERE clause to any SQL statements on the table at run time so that only those rows satisfying the WHERE clause can be accessed.

    You can use this feature to avoid specifying the desired language of an user in the WHERE clause in each and every SELECT statement in your applications. The following WHERE clause limits the view of a table to the rows corresponding to the desired language of a user:

    WHERE language_id = sys_context('userenv', 'LANG')

    When you specify this WHERE clause as a fine-grained access policy for your product_table as follows:

    DBMS_RLS.ADD_POLICY ('scott', 'product_table', 'lang_policy', 'scott',
    'language_id = sys_context('userenv', 'LANG')', 'select');

    Then any SELECT statement on the table product_table will automatically append the WHERE clause.

    Storing Documents in LOBs

    You can store documents in multiple languages in CLOB, NCLOB or BLOB and set up Oracle Text to enable content search for the documents.

    Data in CLOB columns is always stored as UTF-16 internally when the database character set is of varying width, such as UTF8 or AL32UTF8. Document contents are converted to UTF-16 when they are inserted into a CLOB column. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document, such as Japanese, in a CLOB column requires less storage space than the same document in a LONG column using UTF8 (typically around 30% less, depending on the contents of the document).

    Documents in NCLOB are also stored as UTF-16 regardless of the database character set or national character set. The storage space requirement is the same as in CLOB. Document contents are converted to UTF-16 when they are inserted into a NCLOB column. If you want to store multilingual documents in a non-Unicode database, you should choose NCLOB. However, content search on NCLOB is not yet supported.

    Documents in BLOB format are stored as they are. No data conversion occurs during insert and retrieval. However, SQL string manipulation functions (such as LENGTH or SUBSTR) and collation functions (such as NLS_SORT and ORDER BY) are not applicable to the BLOB datatype. You can also create a view to select the data of the current language. For example:

    CREATE OR REPLACE VIEW product AS
    SELECT product_id, product_name
    FROM products_table
    WHERE language_id = sys_context('USERENV','LANG');

    Select Translated Data Using Fine-Grained Access Control

    Fine-grained access control allows you to limit the degree to which a user can view information in a table or view. Typically, this is done by appending a WHERE clause. Once you add a WHERE clause as a fine-grained access policy to a table or view, Oracle9i automatically appends the WHERE clause to any SQL statements on the table at run time so that only those rows satisfying the WHERE clause can be accessed.

    You can use this feature to avoid specifying the desired language of an user in the WHERE clause in each and every SELECT statement in your applications. The following WHERE clause limits the view of a table to the rows corresponding to the desired language of a user:

    WHERE language_id = sys_context('userenv', 'LANG')

    When you specify this WHERE clause as a fine-grained access policy for your product_table as follows:

    DBMS_RLS.ADD_POLICY ('scott', 'product_table', 'lang_policy', 'scott',
    'language_id = sys_context('userenv', 'LANG')', 'select');

    Then any SELECT statement on the table product_table will automatically append the WHERE clause.

    Storing Documents in LOBs
    You can store documents in multiple languages in CLOB, NCLOB or BLOB and set up Oracle Text to enable content search for the documents.

    Data in CLOB columns is always stored as UTF-16 internally when the database character set is of varying width, such as UTF8 or AL32UTF8. Document contents are converted to UTF-16 when they are inserted into a CLOB column. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document, such as Japanese, in a CLOB column requires less storage space than the same document in a LONG column using UTF8 (typically around 30% less, depending on the contents of the document).

    Documents in NCLOB are also stored as UTF-16 regardless of the database character set or national character set. The storage space requirement is the same as in CLOB. Document contents are converted to UTF-16 when they are inserted into a NCLOB column. If you want to store multilingual documents in a non-Unicode database, you should choose NCLOB. However, content search on NCLOB is not yet supported.

    Documents in BLOB format are stored as they are. No data conversion occurs during insert and retrieval. However, SQL string manipulation functions (such as LENGTH or SUBSTR) and collation functions (such as NLS_SORT and ORDER BY) are not applicable to the BLOB datatype.

    Creating Multi-Lexers
    The first step in creating the multi-lexer is the creation of language-specific lexer preferences for each language supported. The following example creates English, French, and Japanese lexers with PL/SQL procedures:

    ctx_ddl.create_preference('english_lexer', 'basic_lexer');
    ctx_ddl.set_attribute('english_lexer','index_themes','yes');
    ctx_ddl.create_preference('german_lexer', 'basic_lexer');
    ctx_ddl.set_attribute('german_lexer','composite','german');
    ctx_ddl.set_attribute('german_lexer','alternate_spelling','german');
    ctx_ddl.set_attribute('german_lexer','mixed_case','yes');
    ctx_ddl.create_preference('japanese_lexer', 'JAPANESE_VGRAM_LEXER');

    Once the language-specific lexer preferences are created, they need to be gathered together under a single multi-lexer preference. First, create the multi-lexer preference, using the MULTI_LEXER object:

    ctx_ddl.create_preference('global_lexer','multi_lexer');

    Now we must add the language-specific lexers to the multi-lexer preference using the add_sub_lexer call:

    ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer');
    ctx_ddl.add_sub_lexer('global_lexer', 'japanese', 'japanese_lexer');
    ctx_ddl.add_sub_lexer('global_lexer', 'default','english_lexer');

    This nominates the german_lexer preference to handle German documents, the japanese_lexer preference to handle French documents, and the english_lexer preference to handle everything else, using DEFAULT as the language.

    Building Indexes for Documents Stored as CLOBs

    The multi-lexer decides which lexer to use for each row based on a language column. This is a character column in the table which stores the language of the document in the text column. You should use the Oracle language name to identify the language of a document in this column. For instance, if you use CLOBs to store your documents, then you must add the language column to the table where the documents are stored:

    
    
    CREATE TABLE globaldoc ( 
      doc_id    NUMBER       PRIMARY KEY, 
      language  VARCHAR2(30), 
      text      CLOB 
      ); 
    
      
    To create an index for this table, use the multi-lexer preference and specify the name of the language column:

    CREATE INDEX globalx ON globaldoc(text)
    indextype IS ctxsys.context
    parameters ('lexer global_lexer
    language column language');

    Creating Indexes for Documents Stored as BLOBs
    In addition to the language column, the character set and format columns must be added in the table where your documents are stored. The character set column stores the character set of the documents using the Oracle character set names. The format column specifies whether a document is a text or binary document. For instance, your table would looks like:

    
    CREATE TABLE globaldoc ( 
       doc_id       NUMBER       PRIMARY KEY, 
       language     VARCHAR2(30), 
       characterset VARCHAR2(30), 
       format       VARCHAR2(10), 
       text         BLOB 
      ); 
    
    
    With the format column, you may put word-processing or spreadsheet documents into the table and specify binary in the format column. For text documents such as HTML, XML and text, you may put them into the table and specify text in the format column. With the character set column, you can store text documents in different character sets.

    When you create the index, specify the names of the format and character set columns:

    
    
    CREATE INDEX globalx ON globaldoc(text) 
      indextype is ctxsys.context 
      parameters ('filter inso_filter 
                    lexer global_lexer 
                    language column language 
                    format  column format 
                    charset column characterset'); 
    
    	
    You may use the charset_filter if all documents are in text format.

    Previous Chapter | Next Chapter

    More Tutorials on Oracle dba ...

    Source :Oracle Documentation

    Liked it ? Want to share it ? Social Bookmarking

    Add to: Mr. Wong Add to: BoniTrust Add to: Newsider Add to: Digg Add to: Del.icio.us Add to: Reddit Add to: Jumptags Add to: StumbleUpon Add to: Slashdot Add to: Netscape Add to: Furl Add to: Yahoo Add to: Spurl Add to: Google Add to: Blinklist Add to: Technorati Add to: Newsvine Information




    Want to share or request Oracle Tutorial articles to become a Oracle DBA. Direct your requests to webmaster@oracleonline.info