UTF-8 Encoding

Basics

Applications in CODESYS can process a wide variety of characters, for example, to output an error message in various languages. Or to display visualizations in a language selected by the user which accepts user input in a wide variety of languages, characters, or symbols.

If a comprehensive character set is not necessary, or if a project should not be changed, then strings which are encoded Latin-1 format can still be used.

Table 3. Character set tables

Character Set	Code Page Number	Description	Character Encoding
ASCII	20127	128 characters Suitable for English texts	7-bit encoded character
DOS-Latin-1	819, 850	Complies with ISO 8859 Suitable for Western European languages in the Windows command line window	8-bit encoded character
Latin-1	28591	Complies with ISO-8859-1 Often used for HTML pages with äöüß but without € or for example without special French characters	8-bit encoded character
Windows 1252 Encoding	1252	Default Windows character set for Western European countries Windows uses the UTF-16 format internally Contains all characters from ISO 8859-1 and ISO 8859-15, but partly with different encoding	8-bit encoded character
Unicode		Universal character set for all possible languages, including historical languages, Braille, music, or emojis More than 100,000 characters can be displayed. Each character has a numeric code. In contrast to ASCII, a separation is made between the assignment of code points to characters and the encoding of the characters. Numeric code < 128 are ASCII compatible Numeric codes < 256 are ISO 8859-1 compatible For more information, see: https://home.unicode.org/
Unicode 14.0		144,697 characters
UTF-16	1200	Special Unicode Used in some operating systems (Windows, OS X) and programming languages (Java, .NET) for internal character representation It should be noted that different computer architectures encode the 4-byte characters differently. Little endian byte order for UTF-16LE	16-bit encoded characters The characters are encoded either in 2 bytes or 4 bytes.
UTF-8	65001	Byte-oriented encoding format of Unicode characters Most widespread Used in GNU/Linux and Unix operating systems, and in various Internet services (email, web, browser) Compatible with ASCII characters in the first 128 characters (0–127)	Tuple of 8-bit words per character The characters are encoded in different length from 1 to 4 bytes.

UTF-8 in CODESYS

Tip

UTF-8 encoding is the encoding with the most comprehensive character set. Therefore, it is recommended that you enable UTF-8 encoding for new projects as well as for existing projects to be used in a new context.

Table 4. Project-wide encoding in CODESYS

Data Type	Compile Option: UTF8 Encoding for STRING	Which encoding is used project-wide?
`STRING`	Enabled	UTF-8
`STRING`	Disabled	Windows 1252 encoding (default Windows encoding) Latin-1
`WSTRING`	Enabled	UTF-16
`WSTRING`	Disabled	UTF-16

In CODESYS, the STRING data type can be encoded in Latin-1 or UTF-8 formats. The WSTRING data type always encodes its characters as Unicode in UTF-16.

Encoding a single string literal in UTF-8 format

Even if the project-wide encoding format is set to Latin-1, you can encode a single literal in UTF-8 format. To do this, add the UTF8# type prefix to the literal.

{attribute 'monitoring_encoding' := 'UTF-8'}
strVarUtf8: STRING := UTF8#'你好,世界!ÜüÄäÖö';

For more information, see: Constant: UTF8# String; Pragma Attribute: monitoring_encoding

String conversion for UTF-8 encoding

If you have enabled UTF-8 encoding project-wide, then you can use the string conversion functions as usual.

String manipulation

Use library functions to manipulate your strings.

If STRING variables should be manipulated, then an index access to a variable in ASCII format often leads to the desired result. It is better not to use this construct. It is not just a bad programming style. To make matters worse, with UTF-8 encoding, index access leads to unwanted string manipulation.

UTF-8 encoding only for project-wide configuration

A UTF-8 encoding is used if the project-wide compile option UTF8 encoding for STRING is enabled. Library functions and add-ons are then also oriented according to this setting.

If you use single UTF-8 encoded strings, then you need to make sure that they are interpreted correctly wherever they are used. For example, a string variable in the OPC server will be converted to UTF-8 before being transferred to a client if the setting is not selected. Values such as UTF8#'äöü' would then be misinterpreted. Similar problems can arise when outputting strings in the visualization.