free Oracle DBA tutorial Oracle Jobs
Ask A Question
SQL Statement Tuning
Backup and Recovery Concepts
Oracle 11g New Features
Oracle E Suite & Others
Oracle Data Guard
Oracle DBA FAQ

Choosing a Character Set

Previous Chapter | Next Chapter

Character Set Encoding
When computer systems process characters, they use numeric codes instead of the graphical representation of the character. For example, when the database stores the letter A, it actually stores a numeric code that is interpreted by software as that letter. These numeric codes are important in all databases. They are especially important when working in a global environment because of the need to convert between different character sets.

What is an Encoded Character Set?
An encoded character set is specified when you create a database. The choice of character set determines what languages can be represented in the database. This choice influences how you create the database schema and develop applications that process character data. It also influences interoperability with operating system resources and database performance.

A group of characters (for example, alphabetic characters, ideographs, symbols, punctuation marks, and control characters) can be encoded as an encoded character set. An encoded character set assigns unique numeric codes to each character in the character repertoire. Table 2-1 shows examples of characters that are assigned a numeric code value.

Table 2-1 Encoded Characters in the ASCII Character Set

Character Description Code Value
! Exclamation Mark 21
# Number Sign 23
$ Dollar Sign 24
1 Number 1 31
2 Number 2 32
3 Number 3 33
A Uppercase A 41
B Uppercase B 42
C Uppercase C 43
a Lowercase a 61
b Lowercase b 62
c Lowercase c 63

There are many different coded character sets used throughout the computer industry. Oracle supports most national, international, and vendor-specific encoded character set standards. The complete list of character sets supported by Oracle is listed in Appendix A, "Locale Data". Character sets differ in the following ways:

  • The number of characters available
  • The characters available (the character repertoire)
  • The writing scripts and the languages represented
  • The code values assigned to each character
  • The encoding scheme used to represent a character

Which Characters to Encode?

When you choose a character set, first decide what languages you wish to store in the database. The characters that are encoded in a character set depend on the writing systems that are represented.

Writing Systems
A writing system can be used to represent a language or group of languages. For the purposes of this book, writing systems can be classified into two categories: phonetic and ideographic.

Phonetic Writing Systems
Phonetic writing systems consist of symbols that represent different sounds associated with a language. Greek, Latin, Cyrillic, and Devanagari are all examples of phonetic writing systems based on alphabets. Note that alphabets can represent more than one language. For example, the Latin alphabet can represent many Western European languages such as French, German, and English.

Characters associated with a phonetic writing system (alphabet) can typically be encoded in one byte because the character repertoire is usually smaller than 256 characters.

Ideographic Writing Systems
Ideographic writing systems consist of ideographs or pictographs that represent the meaning of a word, not the sounds of a language. Chinese and Japanese are examples of ideographic writing systems that are based on tens of thousands of ideographs. Languages that use ideographic writing systems may use a syllabary as well. Syllabaries provide a mechanism for communicating phonetic information along with the pictographs when necessary. For instance, Japanese has two syllabaries: Hiragana, normally used for grammatical elements, and Katakana, normally used for foreign and onomatopoeic words.

Characters associated with an ideographic writing system typically must be encoded in more than one byte because the character repertoire has tens of thousands of characters.

Punctuation, Control Characters, Numbers, and Symbols
In addition to encoding the script of a language, other special characters, such as punctuation marks, need to be encoded such as punctuation marks (for example, commas, periods, and apostrophes), numbers (for example, Arabic digits 0-9), special symbols (for example, currency symbols and math operators) and control characters for computers (for example, carriage returns, tabs, and NULL).

Writing Direction
Most Western languages are written left to right from the top to the bottom of the page. East Asian languages are usually written top to bottom from the right to the left of the page, though exceptions are frequently made for technical books translated from Western languages. Arabic and Hebrew are written right to left from the top to the bottom.

Another consideration is that numbers reverse direction in Arabic and Hebrew. So even though the text is written right to left, numbers within the sentence are written left to right. For example, "I wrote 32 books" would be written as "skoob 32 etorw I". Regardless of the writing direction, Oracle stores the data in logical order. Logical order means the order used by someone typing a language, not how it looks on the screen.

How are Characters Encoded?

Different types of encoding schemes have been created by the computer industry. The character set you choose affects what kind of encoding scheme will be used. This is important because different encoding schemes have different performance characteristics, and these characteristics can influence your database schema and application development requirements. The character set you choose will typically use one of the following types of encoding schemes:

  • Single-Byte Encoding Schemes
    • 7-Bit Encoding Schemes
    • 8-Bit Encoding Schemes
  • Multibyte Encoding Schemes
    • Fixed-Width Multibyte Encoding Schemes
    • Variable-Width Multibyte Encoding Schemes
Single-Byte Encoding Schemes
Single byte encoding schemes are the most efficient encoding schemes available. They take up the least amount of space to represent characters and are easy to process and program with because one character can be represented in one byte.

7-Bit Encoding Schemes

Single-byte 7-bit encoding schemes can define up to 128 characters and normally support just one language. One of the most common single-byte character sets, used since the early days of computing, is ASCII (American Standard Code for Information Interchange).

8-Bit Encoding Schemes

Single-byte 8-bit encoding schemes can define up to 256 characters and often support a group of related languages

Multibyte Encoding Schemes
Multibyte encoding schemes are needed to support ideographic scripts used in Asian languages like Chinese or Japanese because these languages use thousands of characters. These schemes use either a fixed number of bytes to represent a character or a variable number of bytes per character.

Fixed-Width Multibyte Encoding Schemes

In a fixed-width multibyte encoding scheme, each character is represented by a fixed number of n bytes, where n is greater than or equal to two.

Variable-Width Multibyte Encoding Schemes

A variable-width encoding scheme uses one or more bytes to represent a single character. Some multibyte encoding schemes use certain bits to indicate the number of bytes that will represent a character. For example, if two bytes is the maximum number of bytes used to represent a character, the most significant bit can be toggled to indicate whether that byte is a single-byte character or the first byte of a double-byte character. In other schemes, control codes differentiate single-byte from double-byte characters. Another possibility is that a shift-out code is used to indicate that the subsequent bytes are double-byte characters until a shift-in code is encountered.

Oracle's Naming Convention for Character Sets

Oracle uses the following naming convention for character set names:

<#_of_bits_representing_a_character>[S | C]

Note that UTF8 and UTFE are exceptions to this naming convention.

Some examples are:

  • US7ASCII is the U.S. 7-bit ASCII character set
  • WE8ISO8859P1 is the Western European 8-bit ISO 8859 Part 1 character set
  • JA16SJIS is the Japanese 16-bit Shifted Japanese Industrial Standard character set
  • The optional "S" or "C" at the end of the character set name is used to differentiate character sets that can be used only on the server (S) or only on the client (C).
  • On Macintosh platforms, the server character set should always be used. The Macintosh client character sets are obsolete. On EBCDIC platforms, if available, the "S" version should be used on the server and the "C" version on the client.

Previous Chapter | Next Chapter

More Tutorials on Oracle dba ...

Source :Oracle Documentation

Liked it ? Want to share it ? Social Bookmarking

Add to: Mr. Wong Add to: BoniTrust Add to: Newsider Add to: Digg Add to: Del.icio.us Add to: Reddit Add to: Jumptags Add to: StumbleUpon Add to: Slashdot Add to: Netscape Add to: Furl Add to: Yahoo Add to: Spurl Add to: Google Add to: Blinklist Add to: Technorati Add to: Newsvine Information

Want to share or request Oracle Tutorial articles to become a Oracle DBA. Direct your requests to webmaster@oracleonline.info