A character set is the totality of all characters used to represent information. Characters are, for example, the letters of an alphabet or numbers, but also other symbols such as special characters, pictograms and control characters. In electronic data processing (EDP), the number of characters in a character set is limited by the number of bits.
Computers and digital circuits can only store and process the symbols 0 and 1 (binary digits). Therefore, each character is stored in a character string, known as a bit code. There are approximately 100 important characters – including numbers, letters, umlauts, punctuation marks, symbols, special characters, control characters and formula characters – for which 7 bits are sufficient. The character set determines which character corresponds to which bit code. Due to the internationalisation of the Internet, character codes must be standardised in order to ensure a smooth data exchange independent of language.
Development of the character set
The idea of giving meaning to signals evolved early on. With the development of electrical telegraphy in 1837, electrical pulses were used for the first time to transmit the characters. In order to understand the transmitted message, the signals first had to be converted into characters. For this purpose, pointer telegraphs and teleprinters were developed around 1900 that converted signals into legible text. Coding was revolutionised by the French engineer Émile Baudot, who mapped texts as a sequence of five binary digits. The 32 possible signals, combined from 5 keys, had to be entered by the sender themselves – the birth of the first 5-bit character set. Since computers require a larger unit for data processing, the 7-bit character set ASCII was developed in 1963, which was the standard character set in IT for a long time. The first 8-bit character set, EBCDIC, was created at the same time as ASCII and was in use on mainframe computers until recently. It can be used to assign 256 different characters. In order to be able to represent all the languages of the world in one character set, a universal character chart was developed at the end of the 1980s: Unicode.
ASCII, ISO and Unicode…
A PC character set includes not only the individual elements of a character set, but also their rules for encoding. The best-known character encodings are ASCII, the ISO/IEC 8859 family and the internationally standardised Unicode. In addition, some computer company character sets and specific national variants exist.
ASCII character set
ASCII stands for ‘American Standard Code for Information Interchange’ and is one of the first character sets based on 7-bit information. The character set includes the Latin alphabet in upper and lower case, the ten Arabic numerals and some punctuation marks. Over the years, ASCII was expanded and umlauts and frame characters were added. However, there is no uniform standard, which is why there can be problems in the representation of characters when ASCII files are exchanged.
ISO 8859 character set
The ISO 8859 family comprises 15 different 8-bit character sets. The basis is the ASCII code, which has been extended for various language areas, including all European languages, Arabic, Hebrew, Thai and Turkish. However, because the limitation of the ISO system to 256 characters is not sufficient to represent all internationally valid characters, ISO 8859 will no longer be developed further and will be replaced by Unicode.
Unicode character set
The most important character set in computing is Unicode. It is an international standard and contains characters and elements of all known script cultures and character systems. The aim is to eliminate incompatible coding in different countries. Each Unicode character has a stable code and fixed properties such as the respective character type or upper and lower case. Unicode also provides character strings that are used to sort characters. Unicode is constantly being added to, breaking through the old 8-bit limit of ISO 8859. Once codes have been introduced, they are no longer removed because this is the only way to ensure the longevity of digital data. Character set encodings for Unicode are UTF-8, UTF-16 and UCS- 4.
Unicode coding
In addition to the actual characters, Unicode also defines a number of encodings – in the Unicode Transformation Format (UTF). This should allow the entire Unicode character set to be implemented in a website. In the header, modern Internet pages have UTF-8 character set information that allows access to all characters. UTF-16 is now also used in numerous operating systems.
Load more
FAQ: More questions about character sets
What are the character encodings?
There are three different character encodings for Unicode: UTF-8, UTF-16 and UTF-32.
What is Unicode?
Unicode is the international standard for encoding characters or text elements. The system enables the storage and processing of texts in digital systems.
How many characters does UTF-8 have?
Without Unicode restriction, a whole 4,398,046,511,104 character mappings would be possible with UTF-8. Due to the 4-byte limitation in Unicode, the effective number is 221, which corresponds to 2,097,152 characters.
How do you edit characters that are not on the keyboard?
There are numerous special characters that can be inserted via key combinations. You can look them up at https://tools.oratory.com/altcodes.html, for example.
What are special characters?
Special characters are all letters and numbers beyond the Latin alphabet. These include punctuation marks ( ? ! . , ; : – ), symbols (§ / # $ %), ligatures of two letters (ß æ œ) and letters with so-called diacritical marks (ü á ô è ñ).
What is a character set?
A character set is the set of all characters used to represent information. The character set depends on the display system.
What is character encoding?
In computing, character encoding (also character coding) refers to the process of translating a particular string of characters into a special format.
What is the term for a character set in printing?
In printing, a character set is called a font.
This site is registered on wpml.org as a development site.