Skip to main content

UTF-8 Encoding

Basics

Applications in CODESYS can process a wide variety of characters, for example, to output an error message in various languages. Or to display visualizations in a language selected by the user which accepts user input in a wide variety of languages, characters, or symbols.

If a comprehensive character set is not necessary, or if a project should not be changed, then strings which are encoded Latin-1 format can still be used.

Table 3. Character set tables

Character Set

Code Page Number

Description

Character Encoding

ASCII

20127

  • 128 characters

  • Suitable for English texts

7-bit encoded character

DOS-Latin-1

819, 850

  • Complies with ISO 8859

  • Suitable for Western European languages in the Windows command line window

8-bit encoded character

Latin-1

28591

  • Complies with ISO-8859-1

  • Often used for HTML pages with äöüß but without € or for example without special French characters

8-bit encoded character

Windows 1252 Encoding

1252

  • Default Windows character set for Western European countries

  • Windows uses the UTF-16 format internally

  • Contains all characters from ISO 8859-1 and ISO 8859-15, but partly with different encoding

8-bit encoded character

Unicode

  • Universal character set for all possible languages, including historical languages, Braille, music, or emojis

  • More than 100,000 characters can be displayed.

  • Each character has a numeric code.

  • In contrast to ASCII, a separation is made between the assignment of code points to characters and the encoding of the characters.

  • Numeric code < 128 are ASCII compatible

  • Numeric codes < 256 are ISO 8859-1 compatible

For more information, see: https://home.unicode.org/

Unicode 14.0

144,697 characters

UTF-16

1200

  • Special Unicode

  • Used in some operating systems (Windows, OS X) and programming languages (Java, .NET) for internal character representation

  • It should be noted that different computer architectures encode the 4-byte characters differently.

    Little endian byte order for UTF-16LE

16-bit encoded characters

The characters are encoded either in 2 bytes or 4 bytes.

UTF-8

65001

  • Byte-oriented encoding format of Unicode characters

  • Most widespread

  • Used in GNU/Linux and Unix operating systems, and in various Internet services (email, web, browser)

  • Compatible with ASCII characters in the first 128 characters (0–127)

Tuple of 8-bit words per character

The characters are encoded in different length from 1 to 4 bytes.



UTF-8 in CODESYS

Tip

UTF-8 encoding is the encoding with the most comprehensive character set. Therefore, it is recommended that you enable UTF-8 encoding for new projects as well as for existing projects to be used in a new context.

Table 4. Project-wide encoding in CODESYS

Data Type

Compile Option: UTF8 Encoding for STRING

Which encoding is used project-wide?

STRING

Enabled

UTF-8

Disabled

Windows 1252 encoding (default Windows encoding)

Latin-1

WSTRING

Enabled

UTF-16

Disabled

UTF-16



In CODESYS, the STRING data type can be encoded in Latin-1 or UTF-8 formats. The WSTRING data type always encodes its characters as Unicode in UTF-16.

Encoding a single string literal in UTF-8 format

Even if the project-wide encoding format is set to Latin-1, you can encode a single literal in UTF-8 format. To do this, add the UTF8# type prefix to the literal.

{attribute 'monitoring_encoding' := 'UTF-8'}
strVarUtf8: STRING := UTF8#'你好,世界!ÜüÄäÖö';

For more information, see: Constant: UTF8# String; Pragma Attribute: monitoring_encoding

String conversion for UTF-8 encoding

If you have enabled UTF-8 encoding project-wide, then you can use the string conversion functions as usual.

String manipulation

Use library functions to manipulate your strings.

If STRING variables should be manipulated, then an index access to a variable in ASCII format often leads to the desired result. It is better not to use this construct. It is not just a bad programming style. To make matters worse, with UTF-8 encoding, index access leads to unwanted string manipulation.

UTF-8 encoding only for project-wide configuration

A UTF-8 encoding is used if the project-wide compile option UTF8 encoding for STRING is enabled. Library functions and add-ons are then also oriented according to this setting.

If you use single UTF-8 encoded strings, then you need to make sure that they are interpreted correctly wherever they are used. For example, a string variable in the OPC server will be converted to UTF-8 before being transferred to a client if the setting is not selected. Values such as UTF8#'äöü' would then be misinterpreted. Similar problems can arise when outputting strings in the visualization.