Linguistics in Tech: Unicode

It’s the same governing body to standardize JS in the 1990’s, making the official name ECMAScript.

QuoraEnter UnicodeUnicode is an 8, 16, or 32 bit encoded library/database of characters.

It was the dream of a group of people in the late 1980's from Xerox, Apple, Microsoft, NeXT, Metaphor, RLG, and Sun Microsystems.

And that dream was to have a “unique, unified, universal” encoding system, which, of course, is where the name comes from.

By 1991, the Unicode Consortium, created their first volume and a second volume with Han ideographs quickly after in 1992.

Even with lofty goals, Unicode’s modus operandi was to assume other general encoding protocols and so it started out with virtually copying ASCII and some of its more useful extensions right into the first volume.

This allowed for backwards compatibility and better integration and adoption.

Nineteen-nintey-six was an important year for Unicode.

They realized that they needed even more code points and expanded memory for each character.

Surrogate character pairs were introduced, allowing for two different code points to point to a third code point (effectively doubling the amount of space encoded for those characters).

Fun fact: The addition of space allowed for more fun languages to be included (Egyptian hieroglyphs), but also for the inclusion of many characters previously thought to be inconsequential (but now realized to be important) by the unfortunately narrow-minded westerners who decided on the first few volumes.

Code space expansionUnicode now has over a million code points.

Characters can be composed of single or multiple code points.

Code points are expressed with U+ and a hexadecimal value.

They also come with a short description.

For example, U+004D is the Unicode Latin Capital Letter M.

The description is important in distinguishing between characters that look similar but span different language families and are not always related.

We need a way to speak about specific characters uniquely, and Unicode offers these description of sorts as a way to refer to characters in English as an alternative to the code point.

This phenomenon and others similar have resulted in a huge amount of ‘duplicated’ characters.

Many of these characters differ slightly or not at all from other entries, but are part of collection for specific uses.

The M above is part of the Latin characters family, but there are other latin letters used in Chinese/Japanese/Korean expressions which may look similar, but have different historical and current applications.

Some duplicates also arose from Unicodes willingness to consume other standards into itself, creating duplicates in the name of compatibility.

The 1,114,112 code points are split into 17 planes.

These are just abstractions created to group different alphabets and related symbols together.

The first plane is called the Basic Multilingual Plane, which is where most of the characters we see everyday live.

It uses four digit hexadecimal code points and the planes above the first use five-six digit hexadecimals.

Depending on the amount of memory you choose to allocate to it, you get a more or less complete set of characters.

Most websites in the world use 8-bit encoding and would be able to get the first plane of basic characters.

Shelter/the letter h, an Egyptian hieroglyph is at U+13254 on the second plane, the Supplementary Multilingual Plane.

It’s not interestingly named, but it is descriptive.

You need to use a 16-bit encoder to access these characters.

The highest planes are for private use characters and special purpose characters.

WikipediaControversyMany languages have nuances that make for complicated encoding challenges.

Cyrillic characters have different forms in cursive for different regions/countries.

Unicode has previously used the same character to represent all of these forms as long as their non-cursive form is similar/the same.

Cursive may seem trivial to latin and germanic based language users, but some languages actually require the use of cursive in certain situations.

For example, Arabic uses cursive in almost every situation where you are writing words.

So the cursive form of letters can be the difference between correct representation and a complete butchering of the language.

The Thai alphabet has a different order of writing letters and speaking letters.

The order of encoding has suffered from legacy incorporations, but it nonetheless creates a paradox for how best to convey the order of letters.

While Unicode incorporated many more Han ideographs since its first iterations, Han unification has been a bone of contention.

Han unification is the grouping of similar characters in East Asian languages which have roots in older character forms.

Many of these characters have been encoded to one point, excluding modern nuance and regional differences.

Many historical characters and historical versions of characters have not been included at all and there are many people working to encode this expansive amount of characters specifically from the East Asian and Indic language groups.

WikipediaElements of LanguageGlyphs are the graphical representations of characters and graphemes are the smallest unit of meaning in a language.

In English, a is a grapheme.

It is a single character (with many meanings and sounds, but still connected to the one character).

However, the letter a actually has a few different glyphs as seen below.

As an English speaker, we recognize both of these glyphs are referring to one grapheme.

How should Unicode deal with such variations?.This question is both a linguistic question and also a technical question.

Another linguistic element is called a ligature.

It is the graphical combination of two graphemes into on glyph.

One example of ligature is the ash symbol(æ).

In some instances the meaning of the double letter is the same as the unconnected version and other times it is an entirely new grapheme.

There are a few situations where there is controversy about whether certain ligatures should be separately encoded and used in tandem to create a combination glyph or encoded as a new character.

There are arguments for both sides which favor memory usage, legacy compatibility, future flexibility, etc.

TodayThere have been two new versions of Unicode released just this year.

They are still adding characters!.And in the future they should continue adding new characters as more emojis and symbols are created.

WikipediaSourcesWhat is Unicode?This page used to feature a series of translations in many different languages and scripts, in part to highlight the…unicode.

orgUnicodeU+004D is the unicode hex value of the character Latin Capital Letter M.

Char U+004D, Encodings, HTML…www.

compart.

comSurrogate Characters | iOS Internationalization: Characters and Encoding | InformITLanguages use different characters with accent marks and pronunciation marks to accentuate or provide meaning.

In this…www.

informit.

comWhat is glyph?.- Definition from WhatIs.

comIn information technology, a glyph (pronounced GLIHF ; from a Greek word meaning carving) is a graphic symbol that…whatis.

techtarget.

comChart provided by Unicode to describe the Egyptian hieroglyph section.

. More details

Leave a Reply