Tim Gadanidis - Ribenyu mo liuchang ni huasemasu!

Here’s a funny and embarrassing story about “AI” and text-to-speech. TikTok recently introduced “Digital Avatars”, which they are selling as a way to “scale creative strategies” and “localize global campaigns”. One of the ways they are supposed to do that is by allowing influencers to reach audiences who speak other languages: a model trained on the influencer’s voice can output a machine-translated version of the original marketing text, making it appear as though they can really speak that language and helping them Connect More Authentically With Global Markets.

Forget about how horrifyingly dystopian this all is for a minute and watch this video of TikTok head of Global Ops, Adrienne Lahens, demoing her own avatar:

The relevant bit about language is from [00:21–00:39]. Here’s a transcription of that section:

We believe this technology has the potential to fuel the creator economy and offer new avenues for creators and marketers to scale their content globally. In fact, I can now speak over 30 different languages, from español [Spanish] to français [French], ich kann Deutsch sprechen [I can speak German], ribenyu mo liuchang ni huasemasu [gibberish, but transcribed as “I can even speak Japanese fluently”].

What happened at the end there? Anyone with a rudimentary knowledge of Japanese could tell you that that’s not Japanese.

It looks like the machine translation of “I can even speak Japanese fluently” was 日本語も流暢に話せます. Here’s the transliteration and gloss:

日本語	も	流暢	に	話	せます
nihongo	mo	ryuuchou	ni	hanasemasu
Japanese	even	fluent	-ly	can-speak

Anyone can see that ribenyu and nihongo (for example) are completely different. Why couldn’t the text-to-speech model output the correct pronunciation? The reason lies in the Japanese writing system and the way it is encoded in modern computing.

The Japanese writing system includes kana (仮名), which are exclusive to Japanese, and kanji (漢字), which originate from Chinese. Characters in this latter set are commonly known as Chinese characters or “Han characters”. Aside from Japanese, they are mainly used in writing Chinese languages such as Mandarin and Cantonese, and historically were also used in writing Vietnamese and Korean. While some characters or character forms are exclusive to a particular writing system (e.g., 込 is only used in Japanese, and simplified Chinese characters are mainly used in China and Singapore), many characters are similar or identical across languages. For example, 大 means “big” and is the same in every single writing system that uses Han characters. And 直 (Japanese) and 直 (Chinese) are variants of the same character. (These should look slightly different, but may look the same if your browser is not properly configured to display Japanese and Chinese character sets differently.)

Today, most text used in computing is encoded with Unicode, a text encoding system that aims to allow for encoding and representing text in every human writing system. Because so many Han characters overlap, the Unicode Consortium didn’t want to encode characters like 大 multiple times for each writing system that used them, which would have been a waste of space. Instead, they decided to encode them once, and have the same code point used in multiple writing systems. They also went a step further and encoded variants of the same character, like 直 (Japanese) and 直 (Chinese), using the same code point, which turned out to be a bit controversial, since some people feel quite strongly that their version of a given character should be recognized as distinct. The process of mapping different writing systems to the same unified character set was called Han unification. Because of Han unification, to a computer, unless the character set to be used is marked somehow, there is no way to tell what language a given Han character is being used for, just like the letter “a” is the same character whether it is used in English, French, German, Esperanto, or some other language that uses it. One way to indicate which language is being used is the HTML lang attribute, which is what I am using to display the different variants on this page, but that only works when you are writing in HTML.

Of course, a human being or a sophisticated computer program can rely on context to determine what is going on. For example, if we see 流暢に, the kana に is a clear clue that we are in Japanese, because kana are only used in Japanese. TikTok’s fancy “AI”-powered digital avatar software stack is evidently not sophisticated enough to do this, though. Instead, it reads all the Han characters using the Mandarin pronunciation, and reads all the kana using the Japanese pronunciation. This kind of primitive text-to-speech engine definitely isn’t what I would call “intelligent”!

Text:	日本語	も	流暢	に	話	せます
Mandarin:	rìběnyǔ		liúchàng		huà
Japanese:		mo		ni		semasu
End result:	rìběnyǔ	mo	liúchàng	ni	huà	semasu

Here are some lessons we can take away from this story:

Writing systems and text encoding are really interesting!
Most companies using “AI” to “innovate” are actually just sloppily gluing together existing technology and hoping people won’t know any better.
If you’re using this software for translation, you should probably get someone who actually speaks the language to check it afterwards to avoid embarrassing yourself. (But actually, creating and using a fake avatar of yourself to make your marketing videos more “authentic” might be embarrassing enough already.)