Choosing and Migrating Character Sets
Previous Chapter | Next Chapter
Choosing the appropriate database character set for your database is an important decision and requires taking into account many factors. Some of these factors are:
A related topic is choosing a new character set for an existing database, which is called migrating character sets. Migrating from one database character set to another involves additional considerations beyond those of simply choosing a character set. In particular, it is a complex planning process with the goal of minimizing the possibility of losing data because of data truncation and character set conversions during the migration.
- The type of data Oracle DBA need to store
- The number of languages the database character set can represent
- The different sizing requirements of each character set and their performance implications
When the database is created using byte semantics, the sizes of character datatypes CHAR and VARCHAR2 are specified in bytes, not characters. Hence, the specification CHAR(20) in a table definition allows 20 bytes for storing character data. This is acceptable when the database character set uses a single-byte character encoding scheme because the number of characters will be equivalent to the number of bytes. If the database character set uses a multibyte character encoding scheme, however, there is no such correspondence. That is, the number of bytes no longer equals the number of characters because a character can consist of one or more bytes. This situation can cause problems.
During migration to a new character set, it is important to verify the column widths of existing CHAR and VARCHAR columns because they might need to be extended to support encoding that requires multibyte storage. If the character set width differs during the import process, truncation of data can occur if conversion causes expansion of data. In it, ä (a with an umlaut) is a single-byte character in WE8MSWIN1252, but it becomes a double-byte character in UTF8. Also, the Euro symbol goes from one byte to three bytes in this conversion.
The maximum number of bytes for CHAR and VARCHAR2 data types are 2000 and 4000 respectively. If the data columns in the new destination character set require more than 2000 and 4000 bytes, Oracle DBA need to change your schema.
The following are some known restrictions caused by data truncation:
Within the database data dictionary, schema object names cannot exceed 30 bytes in length. Schema objects are tables, clusters, views, indexes, synonyms, tablespaces, and usernames. Renaming schema objects is required if they exceed 30 bytes in the new database character set. For example, one Thai character in the Thai national character set requires 1 byte, but, in UTF8, it requires 3 bytes. So, if Oracle DBA have defined a table with 11 Thai characters, then this table name must be shortened to 10 or fewer Thai characters when changing the database character set to UTF8.
If your existing Oracle usernames or passwords are created based on characters that will change in size in the target character set, these users will experience login difficulties due to authentication failures after the migration to a new character set. This is because the encrypted usernames and passwords stored in the data dictionary are not updated during migration to a new character set. For example, assuming the current database character set is WE8MSWIN1252 and the target database character set is UTF8, the username scött (o with an
When CHAR data contains characters that will be expanded after migration to a new character set, space padding will not be removed during database export by default. This means that these rows will be rejected upon import into the database with the new character set. The workaround is to set the BLANK_TRIMMING initialization parameter to TRUE prior to the import.
Character Set Conversions
When migrating to a new database character set, the Export and Import utilities can handle character set conversions from the original database character set to the new database character set. However, character set conversions can sometimes cause data loss or data corruption. For example, if Oracle DBA are migrating from character set A to character set B, the destination character set B should be a superset of character set A. Characters that are not available in character set B will be converted to replacement characters, which are usually specified as ? or ¿ or other linguistically-related characters. For example, ä (a with an umlaut) can be converted to a. Replacement characters are defined by the target character set. Figure 10-2 shows a sample conversion where the copyright and Euro symbols are converted to ? and ä to a.
To reduce the risk of losing data, choose a destination character set with similar character repertoires, if possible. Migrating to Unicode can be an attractive option because UTF8 contains characters from most legacy character sets.
Another scenario that can cause the loss of data is migrating a database containing data of a different character set from that of the database character set. Users can insert data into the database from another character set if the client NLS_LANG character set setting is the same as the database character set. When these settings are the same, Oracle assumes that the data being sent or received is from the same character set, so no validations or conversions are performed.
This can lead to two possible data inconsistency problems. One problem occurs when a database contains data from another character set but the same codepoints exist in both character sets. For example, if the database character set is WE8ISO8859P1 and the end user Chinese Windows NT client's NLS_LANG setting is SIMPLIFIED CHINESE_CHINA.WE8ISO8859P1, then all multibyte Chinese data (from the ZHS16GBK character set) is stored as multiples of single-byte WE8ISO8859P1 data. This means that Oracle will treat these characters as single-byte WE8ISO8859P1 characters. Hence all SQL string manipulation functions such as SUBSTR or LENGTH will be based on bytes rather than characters. All bytes constituting ZHS16GBK data are legal WE8ISO8859P1 codes. If such a database is migrated to another character set, for example, UTF8, character codes will be converted as if they were in WE8ISO8859P1. This way, each of the two bytes of a ZHS16GBK character will be converted separately, yielding meaningless values in UTF8.
The second possibility is having data from mixed character sets inside the database. For example, if the data character set is WE8MSWIN1252, and two separate Windows clients using German and Greek are both using the NLS_LANG character set setting as WE8MSWIN1252, then the database will contain a mixture of German and Greek characters. Figure 10-4 shows how different clients can use different character sets in the same database.
Database Character Set Migration
Database character set migration has two distinct stages:
- Data Scanning
- Conversion of Data
Before Oracle DBA actually migrate your character set, you need to identify areas of possible database character set conversions and truncation of data. This step is called data scanning.
Data scanning identifies the amount of effort required to migrate data into the new character encoding scheme prior to the change of the database character set. Some examples of what are found during a data scan are the number of schema objects where the column widths need to be expanded and the extent of the data that does not exist in the target repertoire. This information will assist in determining the best approach for the conversion of the database character set.
Conversion of Data
There are generally three approaches in migrating data from one database character set to another.
In most cases, a full export or import is recommended to properly convert all data to a new character set. It is important to be aware of data truncation issues because character data type columns might need to be extended prior to import to handle the increase in size required. Existing PL/SQL code should be reviewed to ensure all byte-based SQL functions such as LENGTHB, SUBSTRB, and INSTRB, and PL/SQL CHAR and VARCHAR2 declarations are still valid. However, if, and only if, the new character set is a strict superset of the current character set, you can use the ALTER DATABASE CHARACTER SET statement to expedite migration to a new database character set. The target character set is a strict superset if, and only if, each and every character in the source character set is available in the target character set with the same corresponding codepoint value. For instance, because US7ASCII is a strict subset of UTF8, then an ALTER DATABASE CHARACTER SET statement can be used to upgrade the database character set from US7ASCII to UTF8.
ALTER DATABASE CHARACTER SET Statement Restrictions
In Oracle9i, CLOB data is stored as UCS-2 (2-byte fixed-width Unicode) for multibyte database character sets. For single-byte database character sets, CLOB data is stored as the database character set. Because the ALTER DATABASE CHARACTER SET statement does not perform any data conversion, if the database character set is migrated from single-byte to multiple byte using ADCS, then CLOB columns will remain in the original database character set encoding. This introduces data inconsistency in the CLOB columns. Likewise, if you migrate from one Unicode national character set to another, the SQL NCHAR datatype columns will be corrupted.
The migration procedure for CLOB and the SQL NCHAR datatype columns is:
Export the tables containing CLOB and SQL NCHAR columns.
Drop the tables containing CLOB and SQL NCHAR columns.
Use the ALTER DATABASE CHARACTER SET and ALTER DATABASE NATIONAL CHARACTER SET statements.
The syntax is:
ALTER DATABASE [db_name] CHARACTER SET new_character_set;
ALTER DATABASE [db_name] NATIONAL CHARACTER SET new_NCHAR_character_set;
The database name is optional. The character set name should be specified without quotes. For example:
ALTER DATABASE CHARACTER SET UTF8;
To change the database character set, perform the following steps:
Shut down the database, using either a SHUTDOWN IMMEDIATE or a SHUTDOWN NORMAL statement.
Do a full backup.
Complete the following statements:
ALTER SYSTEM ENABLE RESTRICTED SESSION;
ALTER SYSTEM SET JOB_QUEUE_PROCESSES=0;
ALTER SYSTEM SET AQ_TM_PROCESSES=0;
ALTER DATABASE OPEN;
ALTER DATABASE CHARACTER SET ;
SHUTDOWN IMMEDIATE; -- or NORMAL
To change the national character set, replace the ALTER DATABASE CHARACTER SET statement with the ALTER DATABASE NATIONAL CHARACTER SET statement. You can issue both statements together if desired.
When using Oracle9i Real Application Clusters, ensure that no other Oracle background processes are running, with the exception of the one session through which a user is connected, before attempting to issue the ALTER DATABASE CHARACTER SET statement. Use the following SQL statement to verify your environment:
SELECT SID, SERIAL#, PROGRAM FROM V$SESSION;
Setting the initialization parameter CLUSTER_DATABASE to FALSE allows the character set change to go through. This is required in an Oracle9i Real Application Cluster environment; an exclusive startup is not sufficient.
The last approach is to perform an ALTER DATABASE CHARACTER SET statement followed by selective imports. This method is best suited for a known distribution of convertible data that is stored within a small number of tables. A full export and import will be too expensive in this scenario. For example, suppose you have a 100GB database with over 300 tables, but only 3 tables requires character set conversions. The rest of the data is of the same encoding as the destination character set. The 3 tables can be exported and imported back to the new database after issuing the ALTER DATABASE CHARACTER SET statement.
Incorrect data conversion can lead to data corruption, so perform a full backup of the database before attempting to migrate the data to a new character set.
Previous Chapter | Next Chapter
More Tutorials on Oracle dba ...
Source :Oracle Documentation
Liked it ? Want to share it ? Social Bookmarking
Want to share or request Oracle Tutorial articles to become a Oracle DBA. Direct your requests