| ||
|
free Oracle DBA tutorial
Oracle Jobs Ask A Question SQL Statement Tuning Backup and Recovery Concepts Oracle 11g New Features Oracle E Suite & Others Oracle Data Guard Oracle DBA FAQ |
What is Unicode?
Unicode is a universal encoded character set that allows Oracle DBA to store information from any language using a single character set. Unicode provides a unique code value for every character, regardless of the platform, program, or language.
The Unicode standard has been adopted by many software and hardware vendors. Many operating systems and browsers now support Unicode. Unicode is required by modern standards such as XML, Java, JavaScript, LDAP, CORBA 3.0, and WML. It is also synchronized with the ISO/IEC 10646 standard.
Oracle started supporting Unicode as a database character set in Oracle7. In Oracle9i, Unicode support has been expanded so that Oracle DBA can find the right solution for your globalization needs. Oracle9i supports Unicode 3.0, the third version of the Unicode standard.
Unicode Encoding
There are two different ways to encode Unicode 3.0 characters:
UTF-16 encoding is the 16-bit encoding of Unicode in which the character codes 0x0000 through 0x007F contain the ASCII characters. One Unicode character can be 2 bytes or 4 bytes in this encoding. Characters from both European and most Asian scripts are represented in 2 bytes. Surrogate pairs (described below) are represented in 4 bytes.
UTF-8 Encoding
UTF-8 is the 8-bit encoding of Unicode. It is a variable-width encoding in which the character codes 0x00 through 0x7F have the same meaning as ASCII. One Unicode character can be 1 byte, 2 bytes, 3 bytes or 4 bytes in this encoding. Characters from the European scripts are represented in either 1 or 2 bytes, while characters from most Asian scripts are represented in 3 bytes. Surrogate pairs (described below) are represented in 4 bytes.
Surrogate Pairs
Oracle DBA can extend Unicode to encode more than 1 million characters. These extended characters are called surrogate pairs. Surrogate pairs are designed to allow representation of characters in Unicode 3.0 and future extensions of the Unicode standard. Surrogate pairs require 4 bytes in UTF-8 and UTF-16 encoding.
Implementing a Unicode Solution in the Database
Oracle DBA can store Unicode characters in an Oracle9i database in two ways. The following sections explain how to use the two Unicode solutions and how to choose between them:
AL32UTF8
The AL32UTF8 character set encodes characters in one to three bytes. Surrogate pairs require four bytes. It is for ASCII-based platforms.
UTF8
The UTF8 character set encodes characters in one to three bytes. Surrogate pairs require six bytes. It is for ASCII-based platforms.
UTFE
The UTFE character set should be used as the database character set on EBCDIC platforms to support the UTF-8 encoding.
Exampl Creating a Database with a Unicode Character Set
An alternative to storing Unicode data in the database is to use the SQL NCHAR datatypes. You can store Unicode characters into columns of these datatypes regardless of how the database character set has been defined. The NCHAR datatype has been redefined in Oracle9i to be a Unicode datatype exclusively. In other words, it stores data in the Unicode encoding only. Oracle DBA can use the SQL NCHAR datatypes in the same way you use the SQL CHAR datatypes. You can create a table using the NVARCHAR2 and NCHAR datatypes as follows:
AL16UTF16
This is the default character set for SQL NCHAR datatypes. The character set encodes Unicode data in the UTF-16 encoding.
UTF8
When UTF8 is specified for SQL NCHAR datatypes, the data stored in the SQL datatypes is in UTF-8 encoding.
By default, data is stored in the UTF-16 encoding in the SQL NCHAR datatypes, and the length specified in the NCHAR and NVARCHAR2 columns is always in the number of characters instead of the number of bytes.
You can specify the national character set for the SQL NCHAR datatypes when you create a database using the CREATE DATABASE statement. The following command creates a database with WE8ISO8859P1 as the database character set and AL16UTF16 as the national character set.
Example Creating a Database with a National Character Set
How to Choose Between a Unicode Database and a Unicode Datatype Solution
In order to choose the right Unicode solution for your database, you need to consider the following:
What are the main programming languages used in your applications? How do they support Unicode?
How easily can your data and applications be migrated to take advantage of the Unicode solution?
How much performance overhead are you willing to accept in order to use Unicode in the database?
Is your data mostly Asian or European? Do you need to store multilingual documents into LOB columns?
What type of applications are you implementing: a packaged application or a customized end-user application?
When Should You Use a Unicode Database?
You should use a Unicode database when:
If your existing application is mainly written in Java and PL/SQL and your main concern is to minimize the code change required to support multiple languages, you may want to use a Unicode database solution. As long as the datatypes used to stored your data remain as the SQL CHAR datatypes, the Java and PL/SQL accessing these columns do not need to change.
If the database character set of your existing database is US7ASCII, you may want to choose a Unicode database solution because you can migrate your database using a single ALTER DATABASE statement. No data conversion is required because ASCII is a subset of UTF-8.
If the multilingual data will be evenly distributed in your existing schema tables and you are not sure which ones will contain multilingual data, then you should use the Unicode database solution because it does not require you to identify which columns store what data.
If you need to write SQL statements and PL/SQL code in Unicode, you must use the Unicode database solution. SQL statements and PL/SQL code are converted into the database character set before being processed. If your SQL statements and PL/SQL code contain characters that cannot be converted to your database character set, those characters will be lost. A common place where you would use Unicode data in a SQL statement is in a string literal.
You must use a Unicode database in this case. The BLOB data is converted to the database character set before being indexed by Oracle Text. If your database character set is not UTF8, then data will be lost when the documents contain characters that cannot be converted to the database character set.
You should use Unicode datatypes when:
If you want to add Unicode support to your existing database without migrating the character set, you should consider using Unicode datatypes to store Unicode. You can add columns of the SQL NCHAR datatypes to existing tables or new tables to support multiple languages incrementally.
If you are building a packaged application that will be sold to customers, you may want to build the application using SQL NCHAR datatypes. This is because the SQL NCHAR datatype is a reliable Unicode datatype in which the data is always stored in Unicode, and the length of the data is always specified in UTF-16 code units. As a result, you need only test the application once, and your application will run on customer databases with any database character set.
If performance is your biggest concern, you should consider using a single-byte database character set and storing Unicode data in the SQL NCHAR datatypes. Databases using a multibyte database character set such as UTF8 have a performance overhead.
If your applications are written in Visual C/C++ or Visual Basic running on Windows, you may want to use the SQL NCHAR datatypes because you can store UTF-16 data in these datatypes in the same way that you store it in the wchar_t buffer in Visual C/C++ and string buffer in Visual Basic. You can avoid buffer overflow in your client applications because the length of the wchar_t and string datatypes match the length of the SQL NCHAR datatypes in the database.
Unicode Solution with a Unicode Database
An American company running a Java application would like to add German and French support in the next release of their application, and add Japanese support at a later time. The company currently has the following system configuration:
The database is very large and the scheduled downtime is short. How fast the database can be migrated to support Unicode is vital. Because the database is in US7ASCII, the easiest and fastest way of enabling the database to support Unicode is to switch the database character set to UTF8 by issuing the ALTER DATABASE statement. No data conversion is required because US7ASCII is a subset of UTF8.
Because most of the code is written in Java and PL/SQL, changing the database character set to UTF8 is unlikely to break existing code. Unicode support will be automatically enabled in the application.
Because the application will support French, German, and Japanese, there will be few surrogate characters. Both AL32UTF8 and UTF8 are suitable.
Unicode Solution with Unicode Datatypes
A European company running its applications mainly on Windows platforms wants to add new Windows applications written in Visual C/C++, which will use the existing database to support Japanese and Chinese customer names. The company currently has the following system configuration:
A Japanese company wants to develop a new Java application on Oracle9i. The company projects that the application will support as many languages as possible in the long run.
In order to store documents as is, the company decided to use the BLOB datatype to store documents of multiple languages.
The company may also want to generate UTF-8 XML documents from the relational data for business-to-business data exchange.
The back-end has Windows applications written in C/C++ using ODBC to access the Oracle database.
In this case, the typical solution is to create a Unicode database using AL32UTF8 as the database character set and use the SQL NCHAR datatypes to store multilingual data. The national character set should be set to AL16UTF16. The reasons for this solution are:
When documents of different languages are stored as BLOBs, Oracle Text requires the database character set to be one of the UTF-8 character sets. Because the applications may retrieve relational data as UTF-8 XML format (where surrogate characters are stored as four bytes), AL32UTF8 should be used as the database character set to avoid redundant data conversion when UTF-8 data is retrieved or inserted.
Because applications are new and written in both Java and Windows C/C++, the company should use the SQL NCHAR datatype for its relational data as both Java and Windows support the UTF-16 character datatype and the length of a character string is always measured in the number of characters.
If most of the data is for Asian languages, AL16UTF16 should be used in conjunction with the SQL NCHAR datatypes because AL16UTF16 offers better performance and storage efficiency.
This section describes how to migrate your data to Unicode in Oracle9i. "Migrating to Use the NCHAR Datatypes" describes how to migrate non-Unicode SQL CHAR datatypes to SQL NCHAR datatypes. It also describes how to migrate pre-Oracle9i SQL NCHAR datatypes to Oracle9i SQL NCHAR datatypes.
Before you actually migrate your data to Unicode, you need to identify areas of possible data character set conversions and truncation of data. Oracle strongly recommends that you analyze your database using the Character Set Scanner Utility for possible problems before actually migrating the database.
In most cases, a full export and import is recommended to properly convert all data to a Unicode character set. It is important to be aware of data truncation issues because character datatype columns might need to be extended before importing to handle the increase of data byte size.
The steps to migrate to Unicode using a full export and import are:
If, and only if, the current database character set is US7ASCII and all the data is in the 7-bit range, you can use the ALTER DATABASE CHARACTER SET statement to expedite migration to a Unicode database character set. Note that this approach cannot be taken on EBCDIC platforms because UTFE is not a strict superset of any EBCDIC character set.
Use the following steps to migrate to Unicode using the ALTER DATABASE CHARACTER SET statement:
Another approach is to issue an ALTER DATABASE CHARACTER SET statement followed by selective imports. This methods can be used when the distributions of convertible data are known and they are stored within a small number of tables.
The steps to migrate to Unicode using selective imports are:
In Oracle9i, the SQL NCHAR datatypes are limited to the Unicode character set encoding (UTF8 and AL16UTF16) only. Any other Oracle8 Server character sets that were available under the NCHAR datatype, including Asian character sets (for example, JA16SJISFIXED), will no longer be supported.
The migration steps for existing NCHAR, NVARCHAR, and NCLOB columns through export and import are as follows:
To take advantage of the new Unicode NCHAR datatypes, you can also use the Export and Import utilities to migrate SQL CHAR columns to SQL NCHAR columns:
When you use NCHAR and NVARCHAR2 datatypes for storing multilingual data, the column limit specified for a column is always in character semantics (which is in terms of the number of Unicode code units).
When you use CHAR and VARCHAR2 datatypes for storing multilingual data, the column limit specified for each column is, by default, in number of bytes. If the database needs to support Thai, Arabic, or multibyte languages such as Chinese and Japanese, the limits for the CHAR, VARCHAR, and VARCHAR2 columns may need to be extended. This is because the number of bytes required to encode these languages in UTF8 or AL32UTF8 may be significantly larger than those for English and Western European languages. For example, one Thai character in the Thai character set requires 3 bytes in UTF8 or AL32UTF8. In addition, the maximum limits for CHAR, VARCHAR, and VARCHAR2 datatypes are 2000 bytes, 4000 bytes, and 4000 bytes respectively. If applications need to store more than 4000 bytes, you should use the CLOB datatype for the data.
Storing Data of Multiple Languages
The Unicode character set includes characters of most written languages around the world, but it does not tell you the language to which a given character belongs. In other words, a character such as ä does not contain information about whether it is a French or German character. In order to provide information in the language a user desires, data stored in a Unicode database should accompany the language information to which the data belongs.
There are many ways for a database schema to relate data to a language. Here is one example.
Store Language Information with the Data
For data such as product descriptions or product names, you can add a language column (language_id) of CHAR or VARCHAR2 datatype to the product table to identify the language of the corresponding product information. This enables accessing applications to retrieve the information in the desired language. The possible values for this language column are the 3-letter abbreviations of the valid NLS_LANGUAGE values of the database.
You can also create a view to select the data of the current language. For example:
CREATE OR REPLACE VIEW product AS
Select Translated Data Using Fine-Grained Access Control
Fine-grained access control allows you to limit the degree to which a user can view information in a table or view. Typically, this is done by appending a WHERE clause. Once you add a WHERE clause as a fine-grained access policy to a table or view, Oracle9i automatically appends the WHERE clause to any SQL statements on the table at run time so that only those rows satisfying the WHERE clause can be accessed.
You can use this feature to avoid specifying the desired language of an user in the WHERE clause in each and every SELECT statement in your applications. The following WHERE clause limits the view of a table to the rows corresponding to the desired language of a user:
WHERE language_id = sys_context('userenv', 'LANG')
When you specify this WHERE clause as a fine-grained access policy for your product_table as follows:
DBMS_RLS.ADD_POLICY ('scott', 'product_table', 'lang_policy', 'scott',
Then any SELECT statement on the table product_table will automatically append the WHERE clause.
You can store documents in multiple languages in CLOB, NCLOB or BLOB and set up Oracle Text to enable content search for the documents.
Data in CLOB columns is always stored as UTF-16 internally when the database character set is of varying width, such as UTF8 or AL32UTF8. Document contents are converted to UTF-16 when they are inserted into a CLOB column. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document, such as Japanese, in a CLOB column requires less storage space than the same document in a LONG column using UTF8 (typically around 30% less, depending on the contents of the document).
Documents in NCLOB are also stored as UTF-16 regardless of the database character set or national character set. The storage space requirement is the same as in CLOB. Document contents are converted to UTF-16 when they are inserted into a NCLOB column. If you want to store multilingual documents in a non-Unicode database, you should choose NCLOB. However, content search on NCLOB is not yet supported.
Documents in BLOB format are stored as they are. No data conversion occurs during insert and retrieval. However, SQL string manipulation functions (such as LENGTH or SUBSTR) and collation functions (such as NLS_SORT and ORDER BY) are not applicable to the BLOB datatype.
You can also create a view to select the data of the current language. For example:
CREATE OR REPLACE VIEW product AS
Select Translated Data Using Fine-Grained Access Control
Fine-grained access control allows you to limit the degree to which a user can view information in a table or view. Typically, this is done by appending a WHERE clause. Once you add a WHERE clause as a fine-grained access policy to a table or view, Oracle9i automatically appends the WHERE clause to any SQL statements on the table at run time so that only those rows satisfying the WHERE clause can be accessed.
You can use this feature to avoid specifying the desired language of an user in the WHERE clause in each and every SELECT statement in your applications. The following WHERE clause limits the view of a table to the rows corresponding to the desired language of a user:
WHERE language_id = sys_context('userenv', 'LANG')
When you specify this WHERE clause as a fine-grained access policy for your product_table as follows:
DBMS_RLS.ADD_POLICY ('scott', 'product_table', 'lang_policy', 'scott',
Then any SELECT statement on the table product_table will automatically append the WHERE clause.
Data in CLOB columns is always stored as UTF-16 internally when the database character set is of varying width, such as UTF8 or AL32UTF8. Document contents are converted to UTF-16 when they are inserted into a CLOB column. This means that the storage space required for an English document doubles when the data is converted. Storage for an Asian language document, such as Japanese, in a CLOB column requires less storage space than the same document in a LONG column using UTF8 (typically around 30% less, depending on the contents of the document).
Documents in NCLOB are also stored as UTF-16 regardless of the database character set or national character set. The storage space requirement is the same as in CLOB. Document contents are converted to UTF-16 when they are inserted into a NCLOB column. If you want to store multilingual documents in a non-Unicode database, you should choose NCLOB. However, content search on NCLOB is not yet supported.
Documents in BLOB format are stored as they are. No data conversion occurs during insert and retrieval. However, SQL string manipulation functions (such as LENGTH or SUBSTR) and collation functions (such as NLS_SORT and ORDER BY) are not applicable to the BLOB datatype.
ctx_ddl.create_preference('english_lexer', 'basic_lexer');
Once the language-specific lexer preferences are created, they need to be gathered together under a single multi-lexer preference. First, create the multi-lexer preference, using the MULTI_LEXER object:
ctx_ddl.create_preference('global_lexer','multi_lexer');
Now we must add the language-specific lexers to the multi-lexer preference using the add_sub_lexer call:
ctx_ddl.add_sub_lexer('global_lexer', 'german', 'german_lexer');
This nominates the german_lexer preference to handle German documents, the japanese_lexer preference to handle French documents, and the english_lexer preference to handle everything else, using DEFAULT as the language.
Building Indexes for Documents Stored as CLOBs
The multi-lexer decides which lexer to use for each row based on a language column. This is a character column in the table which stores the language of the document in the text column. You should use the Oracle language name to identify the language of a document in this column. For instance, if you use CLOBs to store your documents, then you must add the language column to the table where the documents are stored:
CREATE INDEX globalx ON globaldoc(text)
When you create the index, specify the names of the format and character set columns:
More Tutorials on Oracle dba ... Want to share or request Oracle Tutorial articles to become a Oracle DBA. Direct your requests to webmaster@oracleonline.info |
|