ARTICLE
Unicode - Overview
Character Sets
Application Server ABAP supports both Unicode and non-Unicode systems.
Non-Unicode systems are ABAP systems in
which one character is usually represented by one byte. Unicode systems
are ABAP systems that are based on a Unicode character format and an
appropriate operating system and database.
Before Unicode, SAP used various different codes for representing
characters in different fonts, such as ASCII , EBCDIC as
single-byte code pages , or double-byte
code pages :
ASCII ( American Standard Code for Information Interchange )
encodes every character with one byte. This means that a maximum of 256
characters can be displayed (strictly speaking, standard ASCII
only encodes one character using 7 bit and can therefore only represent
128 characters. The extension to 8 bit is introduced with
ISO-8859 ). Examples of common code
pages are ISO-8859-1 for Western European, or ISO-8859-5
for Cyrillic fonts.
EBCDIC ( Extended Binary Coded Decimal Interchange ) also
encodes each character using one byte, and can therefore also represent
256 characters. For example, EBCDIC 0697/0500 is an IBM
format that has been used on the AS/400 platform (now known as
IBM System i ) for Western European fonts.
Double byte code pages require between 1 and 2 bytes per
character. This enables the representation of 65,536 characters, of
which only 10,000 to 15,000 characters are normally used. For example,
the code page SJIS is used for Japanese and BIG5 for
traditional Chinese fonts.
Using these character sets, all languages can be handled individually in
one AS ABAP . Difficulties arise if
texts from different incompatible character sets are mixed in one
central system. The exchange of data between systems with incompatible
character sets can also lead to problems.
The solution to this problem is the use of a character set that includes
all characters at once. This is realized by
Unicode ( ISO/IEC 10646 ) with the character set
UCS . A variety of Unicode character
formats is possible for the Unicode character set, for example
UTF , in which a character can occupy
between one and four bytes or UCS-2 ,
where a character occupies two bytes. The system code page of a Unicode
system is UTF-16 and the programming
language supports the character format UCS-2
. This mostly matches the UTF-16 format and includes all its
characters except for those from the
surrogate area . A restriction to UCS-2 means that a character
is always assumed as having a length of two bytes. This generally only
produces problems if character strings are truncated in the middle of a
character representation or if individual characters from sets of
characters are compared in character string processing.
Using Unicode offers the following benefits:
A single Unicode system can cover all business processes in all
countries concerned.
Data can be transferred between different Unicode systems without loss
of information.
Unicode systems can show more than one language at once on a single user
interface.
Switching to Unicode
Before Unicode support was introduced, many ABAP programmers assumed
that one character corresponded to one byte. Before a non-Unicode system
is converted to Unicode, therefore, ABAP programs must be changed
wherever an explicit or implicit assumption is made about the internal
length of a character. This mainly affects
character and byte string processing (of
course)
access to structures. The latter case applies since
flat structures in non-Unicode
programs are handled like character-like data objects and some
programming techniques exploit this fact. The
Unicode fragment view was
introduced to enable the handling of structures.
ABAP supports this conversion using new syntax rules and new language
constructs, whereby emphasis was placed on retaining as much of the
existing source code as possible. As a preparation for the conversion to
Unicode - but also independently of whether a system will actually be
converted to Unicode - the checkbox Unicode checks active can be
selected in the program properties. The transaction
UCCHECK supports the activation of this check for existing
programs. If this property is set, the program is identified as a
Unicode program. In a Unicode program, an additional stricter syntax
check is performed than in non-Unicode programs. In some cases,
statements must also be enhanced by using new additions. A syntactically
correct Unicode program will normally run with the same semantics and th
e same results in Unicode and non-Unicode systems. (Exceptions to this
rule are low-level programs that query and evaluate the number of bytes
per character.) Programs that are required to run in both systems should
therefore also be tested on both platforms.
In a Unicode system, only Unicode programs can be executed. Before
converting to a Unicode system, the profile parameter
abap/unicode_check should be set to
"ON" so that only the execution of Unicode programs is
permitted. Non-Unicode programs can only be executed in non-Unicode
systems. All language constructs that have been introduced for Unicode
programs can, however, also be used in non-Unicode programs.
It has been established that existing programs that have been programmed
with no errors mostly fulfill the new Unicode rules and therefore
require very little modification. Conversely, most programs that require
significant changes are due to an error-prone and therefore questionable
programming style. Even if planning a conversion to a Unicode system is
not planned, Unicode programs are preferable because they are more
easily maintained and less prone to errors. Just as outdated and
dangerous language constructs are
declared obsolete and are no longer permitted for use in ABAP
Objects, the rules for Unicode programs also offer increased security
when programming, for example when working with character fields and
mixed structures. This applies particularly for the storage of external
data (for example using the file interface), which has been completely
reviewed for use in Unicode programs. When creating a new program, SAP
therefore recommends that the program is always identified as a Unicode
program; older programs can be converted to Unicode in stages.
Notes
The program RSUNISCAN_FINAL can be
used instead of transaction UCCHECK .
A Unicode program with correct syntax must be checked for functional
correctness in Unicode systems (and possibly also in non-Unicode
systems). This can be done using unit tests
from ABAP UNIT and associated test
coverage checks from Coverage
Analyzer .
Documentation extract taken from SAP system, � Copyright SAP AG. All rights reserved