Understanding Swift’s CharacterSet

This is precisely the reason why NSCharacterSet.characterIsMember(UTF8 or UTF16 or UTF32) internally calls longCharacterIsMember(UTF32) which only accepts a UTF32 character.³CharacterSet and UTF32The best way to search the membership of a character within CharacterSet is to get the UTF32 code point for that character and pass it to NSCharacterSet's longCharacterIsMember()..It looks like this:What’s Inside CharacterSet.decimalDigits?Let’s go back to our original goal: to see what is inside of CharacterSet.decimalDigits..Here is an example of why it's so important to understand what's inside of it:Given above, do you think “᧐᪂᧐” will be considered numerical?Readers on Chrome, this is what you would be seeing on Firefox or Safari.What?.I did not expect that ᧐᪂᧐ would evaluate as numerical characters..Let’s explore further to understand why.Printing the Contents of CharacterSetArmed with how Unicode works for CharacterSet, let's write some code to print the contents of CharacterSet.⁴ Because NSCharacterSet only provides a way to check if an UTF32 character is a member in the set, we will have to loop through every possible UTF32 character and check for their membership in the set:And if you ran that in Xcode Swift Playgrounds (or here on repl.it), these are the characters that you would see.⁵So, What About ᧐᪂᧐?It turns out that those characters are numerical digits in other writing systems..The ᧐ is the character for “zero” in the New Tai Lue alphabet..The ᪂ is the character for “two” used in the Tai Tham Script.I encourage you to find other CharacterSets to explore!Thanks to David Solberg for the inspiration.Keehun loves to take things apart before putting them back together at LivefrontFootnotes¹ Apple does provide exact code points of characters in certain sets such as .whitespacesAndNewlines and .newlines, but for most other sets, the documentation only describes the Unicode General Category represented by the set, which may not be particularly helpful (e.g. Apple states that the .alphanumerics set contains “Unicode General Categories L*, M*, and N*”)..This website (among others) does provide a sorted list of Unicode General Categories and their characters which I find slow to use.² Here’s how to translate the character’s code point value to UTF8 binaryKnowing the Standardized UTF8 Bit StructureThe UTF8 variable-width structureHere’s how to translate the character’s code point value to UTF8 binary: Fill in all the xs in the table above with the binary value for the character.To figure out how many bytes are needed, consider the length of the character’s code point in binary..1-byte UTF8 can only accommodate 7 bits (only 7 xs in the table)..A 2-byte UTF8 can accommodate 11 bits..3-bytes can accommodate 16 bits, and a 4-byte UTF8 can accommodate 21 bits.For the "€" character (U+20AC 10 0000 1010 1100), we need at least 14 bits meaning it will need the 3-byte structure which can accommodate between 12 and 16 bits..The binary digits filled into the UTF8 structure looks like this: 11100010 10000010 10101100 (with the code point in bold).Notice that if you take away the non-bold binary digits, you get back the original binary for the "€" character!The first byte in a 3-bite-long UTF8 character always begins with 1110.. More details

Leave a Reply