[ale] language(locale) detection in gmail
Michael B. Trausch
fd0man at gmail.com
Fri Oct 6 10:28:36 EDT 2006
Jerry Yu wrote:
> Reading an email in Chinese, I noticed that all sponsored links served
> by gmail are in Japanese. With my limited Japanese training (one year in
> college), I can tell the links, albeit in wrong language, are actually
> pertinent to the content of the email.
> This comes to a question, anybody know how google,or anybody for the
> matter, detect the locale (charset encoding?), given a chunk of text?
>
My guess -- and mind you, this is only a guess -- is that it would be
something to do with a combination of headers, characters and words.
For example, the e-mails that I compose are in UTF-8, even though I
mostly use the ASCII characters that represent Roman languages.
The messages that I send go out with the following header in the
plain-text portion of the e-mail:
Content-Type: text/plain; charset=UTF-8
Which tells whatever software that is reading it that it can expect in
that section to find plain text encoded as UTF-8. From there, all it
needs to do it identify the language. If I were to put a bunch of
Chinese/Japanese characters in my e-mail, it would likely identify that,
because those characters only live in a subset of the characters that
comprise of Unicode, just as these ASCII letters that I am typing do.
As I understand it, Chinese and Japanese share a common written
language, and if that is the case, it is possible that it is detecting
the character set based on the glyphs that are used, and pulling
advertisements from their files that are comprised of the same subset of
Unicode characters, even though it is a different language.
-- Mike
--
Michael B. Trausch <fd0man at gmail.com> - Jabber: fd0man at livejournal.com
Demand freedom: Use open and free protocols, standards, and software.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
More information about the Ale
mailing list