ꓧ𐐬𝗆𐐬𝗀ⅼУрႹ ⅰѕ 𝗌е𝗍 𝗈ſ ဝո𝖾 𝗈г ꝳо𝗋е ɡ𝗋аρႹ𝖾ⅿе𝗌 𝗍Ⴙа𝗍 Ⴙ𝖺ѕ 𝗂ꝱ𝖾ꝴ𝗍𝗂𐐽а𝗅 о𝗋 ѵ𝖾г𝗒 𝗌Ꭵⅿі𝗅аꝵ ⅼꝏ𝗄 𝗍ᴏ 𝗌იო𝖾 о𝗍ꜧ𝖾𝗋 𐑈е𝗍 ဝſ ɡꝵ𝖺рႹеოеѕ. Like in previous sentence, that does not use a single ASCII letter:
ꓧ - LISU LETTER XA
𐐬 - DESERET SMALL LETTER LONG O
𝗆 - MATHEMATICAL SANS-SERIF SMALL M
𐐬 - DESERET SMALL LETTER LONG O
𝗀 - MATHEMATICAL SANS-SERIF SMALL G
ⅼ - SMALL ROMAN NUMERAL FIFTY
У - CYRILLIC CAPITAL LETTER U
р - CYRILLIC SMALL LETTER ER
Ⴙ - GEORGIAN CAPITAL LETTER CHIN
...
Homoglyphs are not Unicode specific, but it was ability to write in many scripts using single UTF encoding that made them popular.
Similarity is conditional
It is font dependent. Two sets of graphemes looking very similar (or even identical) in one font may not look that similar in another. For example т - CYRILLIC SMALL LETTER TE
looks like ASCII T
, but in cursive fonts (those that resembles handwriting connected letters) looks like m
.
Similarity is subjective
For many people unfamiliar with given alphabets Ǧ
and Ğ
may look exactly the same. But if someone is using those letters on daily basis he will notice immediately that first one has CARON
and the other has BREVE
on top.
They are not limited to single grapheme
For example ထ - MYANMAR LETTER THA
looks like two ASCII o
letters. And the other way - ASCII rn
looks like single ASCII letter m
.
Applications?
Fun. 𐐑ǃkǝ pɹoducǃng weird looking bᴝt ɹeadɐble ʇext.
Trolling. Programmer's classic is to replace in someone's code
;
with;
-GREEK QUESTION MARK
- and watch some funny debugging attempts. More advanced version is to modify keybinding. For example on macOS create~/Library/KeyBindings/DefaultKeyBinding.dict
with following content:
{
";" = (insertText:,";");
}
And observe how Python suddenly became someone's favorite language of choice :P
Just promise you won't troll stressed out junior dev before the end of sprint.
- Phishing. This is "Fun with UTF-8" sub series, but unfortunately this application is anything but fun. Homoglyphs are massively used to spoof company names, bypass anti-spam filters and create fake domains. For example can you spot difference between
Paypal
andꓑayраl
?
Common way to detect those is to check Script
Unicode property, more on those in this post. Single word using more than one script should be considered suspicious:
$ raku -e '"Paypal".comb.classify( *.uniprop("Script") ).say'
{Latin => [P a y p a l]} # real
$ raku -e '"ꓑayраl".comb.classify( *.uniprop("Script") ).say'
{Cyrillic => [р а], Latin => [a y l], Lisu => [ꓑ]} # fake
Raku note: Method comb
without param extracts list of characters. Those characters are classified by classify
method. Classification key is output of uniprop
method for given character.
Tools
I'm maintaining HomoGlypher library/package which allows to handle common homoglyph operations:
Unwind. From ASCII text create list of all possible homoglyphied text variants. This is useful for example in checking if some domain is spoofed.
Collapse - From homoglyphied text recover all possible ASCII text variants. Useful for normalization of text before passing it to content filters.
Randomize - From ASCII text create single homoglyphied text with given replacement probability.
Tokenize. Create regular expression token that will match homoglyphied text equivalent to given ASCII text. I think this may be the only homoglyph related library in the existence having this feature :)
Huge list of mappings is provided, so you won't have to dig through Unicode blocks on your own to find possible similarities between graphemes.
Give it a try. And if you know other homoglyph libraries please leave a note in the comments for future readers.