Character set - Translation Encyclopaedia

Character set in IT

Computers and digital circuits can only store and process the symbols 0 and 1 (binary digits). Therefore, each character is stored in a character string, known as a bit code. There are approximately 100 important characters – including numbers, letters, umlauts, punctuation marks, symbols, special characters, control characters and formula characters – for which 7 bits are sufficient. The character set determines which character corresponds to which bit code. Due to the internationalisation of the Internet, character codes must be standardised in order to ensure a smooth data exchange independent of language.

Development of the character set

The idea of giving meaning to signals evolved early on. With the development of electrical telegraphy in 1837, electrical pulses were used for the first time to transmit the characters. In order to understand the transmitted message, the signals first had to be converted into characters. For this purpose, pointer telegraphs and teleprinters were developed around 1900 that converted signals into legible text. Coding was revolutionised by the French engineer Émile Baudot, who mapped texts as a sequence of five binary digits. The 32 possible signals, combined from 5 keys, had to be entered by the sender themselves – the birth of the first 5-bit character set. Since computers require a larger unit for data processing, the 7-bit character set ASCII was developed in 1963, which was the standard character set in IT for a long time. The first 8-bit character set, EBCDIC, was created at the same time as ASCII and was in use on mainframe computers until recently. It can be used to assign 256 different characters. In order to be able to represent all the languages of the world in one character set, a universal character chart was developed at the end of the 1980s: Unicode.

ASCII, ISO and Unicode…

A PC character set includes not only the individual elements of a character set, but also their rules for encoding. The best-known character encodings are ASCII, the ISO/IEC 8859 family and the internationally standardised Unicode. In addition, some computer company character sets and specific national variants exist.

ASCII character set

ASCII stands for ‘American Standard Code for Information Interchange’ and is one of the first character sets based on 7-bit information. The character set includes the Latin alphabet in upper and lower case, the ten Arabic numerals and some punctuation marks. Over the years, ASCII was expanded and umlauts and frame characters were added. However, there is no uniform standard, which is why there can be problems in the representation of characters when ASCII files are exchanged.

ISO 8859 character set

The ISO 8859 family comprises 15 different 8-bit character sets. The basis is the ASCII code, which has been extended for various language areas, including all European languages, Arabic, Hebrew, Thai and Turkish. However, because the limitation of the ISO system to 256 characters is not sufficient to represent all internationally valid characters, ISO 8859 will no longer be developed further and will be replaced by Unicode.

Unicode character set

The most important character set in computing is Unicode. It is an international standard and contains characters and elements of all known script cultures and character systems. The aim is to eliminate incompatible coding in different countries. Each Unicode character has a stable code and fixed properties such as the respective character type or upper and lower case. Unicode also provides character strings that are used to sort characters. Unicode is constantly being added to, breaking through the old 8-bit limit of ISO 8859. Once codes have been introduced, they are no longer removed because this is the only way to ensure the longevity of digital data. Character set encodings for Unicode are UTF-8, UTF-16 and UCS- 4.

Unicode coding

In addition to the actual characters, Unicode also defines a number of encodings – in the Unicode Transformation Format (UTF). This should allow the entire Unicode character set to be implemented in a website. In the header, modern Internet pages have UTF-8 character set information that allows access to all characters. UTF-16 is now also used in numerous operating systems.

What is a character set?

Need a translation?