The points made may be salient, but the tone is so condescending as to be nearly unreadable without getting angry. A large number of his points around languages that are radically different from English could be summed up as: "Your language sucks to deal with in computing because your native country didn't develop computing machinery first."
Firstly, the problems he raises would be hard even if the native speakers of a given language were the first to develop computers. In fact I read a fascinating argument (from a link on HN?.. I forget) about Japanese computers developing into "appliances" (a gaming machine and a separate more expensive word processing machine for professionals) because early computers were too under-powered to support the Japanese character set. And that it was because lacking these problems that Americans developed "general" computing first, with the Japanese stuck in the "appliance" mindset until the late nineties.
Secondly, I'm not a native English speaker and I always had this sentiment; "why isn't all computing in English", "why do we have i18n", etc. For one, I don't think everyone studying a standard foreign language is bad, and while English is probably not the best option (Spanish is reportedly easier), it's there, it's as close or closer to being lingua franca as any other, and it's certainly easier than the language with the largest number of native speakers (Chinese). And then even someone with bad English might benefit from at least having File and Edit menus instead of often crazy translations of these things. A little bit of English and a little bit of standardization can't hurt.
Of course ultimately computers ought to display text in all those languages - if not in menus, etc. then certainly in the content itself - so this sentiment of mine isn't very helpful at all. But I wasn't offended by the article in the slightest.
In the context of some other language being the origin of computing, and since the OP mentions Russian as fairly easy to work with... What if this had been Russian everywhere instead of English? If all of the terminology, POSIX, C and other language keywords had all been Russian, with the rest of the world (including English speakers) instead learning Russian as lingua franca? (Yes, deviating substantially from history for this) Would the outcome be roughly the same? Would it avoid those pitfalls of Japanese? Would it be mildly more / less convenient?
Yes, the tone is a far cry from the cool-headed, neutral, sterile aesthetic one would expect from technical or academic writing. However, the article is very informative, and should probably be bookmarked for later reference --- in case one ever has to do this kind of work.
The article may have been written in anger, and thus it provokes anger. Good! Programmers who don't get angry while solving hard, inconvenient, uncomfortable problems are, in my experience, unlikely to truly solve those problems --- sweeping issues under the rug, or passing the problem to someone else is more likely. Anger is a very powerful motivational force if honed correctly, whereas contentment often leads to docility, complacency, and quitting.
I find that problems that make you angry or enraged are simultaneously: important and instructive. Most performance problems (or even multi-threading problems), for example, fall into this category.
> The article may have been written in anger, and thus it provokes anger. Good! Programmers who don't get angry while solving hard, inconvenient, uncomfortable problems are, in my experience, unlikely to truly solve those problems --- sweeping issues under the rug, or passing the problem to someone else is more likely.
That's a really interesting point and it makes me think of a developer that I've worked with (in a company of non-developers) who cared more about 'inconveniences' than I did.
It occurs to me that he did a much better job of engaging with these 'inconveniences' and fixing them.
I can only speak with confidence about the section on the German ss/ß, and it is dead wrong. (This has been explained a few times in sibling-of-a-sibling comments.) So it's not informative unless you fact-check every single paragraph of it.
I also don't remember anger being part of any successful learning experience.
> "Your language sucks to deal with in computing because your native country didn't develop computing machinery first."
What nonsense, the article says the complete opposite of that. All of the article deals with examples of why other writing systems (misnamed "languages") have specific properties that make it hard to deal with on computers. Latin and Cyrillic alphabets just so happen to be easier because of their regular properties, not because they were the first to be implemented. Lucky us.
Summarizing the article as being rooted in politically incorrectness is disingenious. Or do you really want to suggest the Chinese would have had an easy time implementing their writing system on computers if only they had invented them first?
How would those issues with the languages be dealt with if, say, one of the worst offending Asian languages(according to the author) had developed computing machinery first?
English+ASCII is simple because the rules are simple, not because the US invented the standard first.
The people with the worst offending languages had plenty of headstart for developing printing presses (at large scale) and advanced mathematics (calculus, algebra, abstract geometry, set theory, etc) first. Yet, for some reason, they didn't.
Nobody studding sociology has came out and claimed they couldn't do it only because of their language, but nearly nobody questions that the language was an important factor, and they were very diverse groups, with little else in common.
Japan and China have had some kind of computing machinery since the 1980s at least, as well as digital typesetting, so you cannot say "They never had typewriters, that's why their encoding is so messed up".
Hangeul has very simple rules, and is generally simple, but if you want Hangul Jamo (i.e., archaic letters that don't compose neatly into "standardized" syllables), you'll be in a similar place as if you want to produce mediaeval ligatures or thorn letters. Þis is hardly ever neſceſʒary in modern text.
Up until IIRC the 1950s Japan actually officially planned to cease use of Chinese characters and switch to kana entirely. This was not done for reasons of tradition, but it's what the Koreans did.
On one hand, I realize it's totally meant tongue-in-cheek. So I don't want to pick on the article, because it's useful and funny.
On the other hand, I've been seeing this attitude in regards to other topics lately. "Why can't humans be perfect... like this machine?" Well, guess what, bub. We came first, and this stupid computer had better get used to it. Me, I like my clumsy, imperfect analog world.
"That's nice. I'll deal with it when you make it economically worthwhile for me to deal with it. Until then, enjoy your fixed-format fields that truncate your name."
Yeah, I've seen unironic suggestions on HN that languages should change to better accommodate computer processing. Besides the obvious absurdity, that completely ignores the problem of handling the huge amount of existing text.
Who cares about tone? Why be sensitive in this context? It's a language. Why would anyone get offended when someone says "Your language sucks"? It's not like they're saying your hair sucks or your tongue is incapable of tasting sweets or your ear is shaped like a tulip.
It's a language. No one can help what language they learned when they were a child. It wasn't up to them. Of course some of them are going to suck overall more than others, and of course there are different reasons for the suckage, and of course all languages suck in different ways.
It seems like if someone is getting offended about someone else saying their language sucks, they're probably going to get offended about pretty much anything.
At first I thought you were being sarcastic, but then I realized you are actually arguing that only overly-sensitive people are offended when their language is insulted.
Since language is a core component of human culture, many people become offended when their language is insulted or threatened. Countries have gone to war over language. Provinces succeed over language. Language is a very, very sensitive topic in many places, such as Quebec, Catalonia, the nations of former Yugoslavia, and so on and so forth.
You may think that people shouldn't be sensitive if their language is denigrated or insulted, but that is demonstrably not the case now. Some of us may not care, but many people -- maybe even most people -- would become angry or annoyed if their native language is open ridiculed.
He's not arguing about language though, but writing systems specifically. If someone would say the English writing system sucks, and gave specific reasons why, I would probably agree with them. Spellings are absurd, we have unnecessary letters, case makes no sense, etc.
But I can't get being offended even if criticizing the language itself. English certainly has a lot of strange and arbitrary rules. What is with "be", "are", "am" for example? Why should languages be above criticism?
That's very interesting. Thank you. Which cultures have gone to war over language? I'd like to educate myself on their history, and try to understand how something like language was one of the primary reasons for something as serious as a war.
Nicholas Ostler's Empires of the Word: A Language History of the World is a journey through human history as seen through a lense of expanding, contracting and competing languages spheres. Well worth a read.
>In the 20th century, Serbo-Croatian served as the official language of the Kingdom of Yugoslavia (when it was called "Serbo-Croato-Slovenian"),[9] and later as one of the official languages of the Socialist Federal Republic of Yugoslavia. The dissolution of Yugoslavia affected language attitudes, so that social conceptions of the language separated on ethnic and political lines. Since the breakup of Yugoslavia, Bosnian has likewise been established as an official standard in Bosnia and Herzegovina, and there is an ongoing movement to codify a separate Montenegrin standard. Serbo-Croatian thus generally goes by the ethnic names Serbian, Croatian, Bosnian, and sometimes Montenegrin.[10]
> Who cares about tone? Why be sensitive in this context? It's a language. Why would anyone get offended when someone says "Your language sucks"? It's not like they're saying your hair sucks or your tongue is incapable of tasting sweets or your ear is shaped like a tulip.
Yet article 1 of the United Nations Charter states that one of the four purposed of the UN is "To achieve international co-operation in solving international problems of an economic, social, cultural, or humanitarian character, and in promoting and encouraging respect for human rights and for fundamental freedoms for all without distinction as to race, sex, language, or religion".
Language politics are in fact really complicated. I myself am from a country that less than a decade ago went without a federal government for 20 months because the different language groups couldn't get along.
> It's a language. No one can help what language they learned when they were a child.
And while we're at it, why would anyone be sensitive about skin color or nationality?
Leaving aside the idea that you are the one to decide what others should be sensitive on: Because it makes the signal-to-noise ratio of this blog post _crap_. I'm not sure if particular things are a serious grief or just a bewildered look at new things. I'm not sure of the intention. Should it educate or just rant? It's unclear and wastes a lot of time on things that don't add anything.
Interesting thought: It's possible that computers would inevitably be invented in the US/UK/Russia or some other country with a relative simple character set, because the doing something like Chinese, first would posse a hindrance.
If you need 32bit to do Chinese, just to store the characters then your first step into computing is going to be harder, compared to the American that can do with 8bit.
8 bits is too much for what was used historically. Morse code had 1-5 "bits" (effectively having the compression built-in). Baudot code was 5 bits, invented by a French engineer in France in 1870, inspired by other non-English-speaking Europeans.
He says they suck to deal w/ in computing because they're hugely more complicated than English and Russian which have very small, simple alphabets that are very easy to process.
'Au contraire, mon ami, it is you and your programs that suck' retorted some languages. Though in fairness I think there was as much comic hyperbole as grating condescension.
Clearly tongue-in-cheek, and informative. So nice somehow.
But still: the fact that English phonology and orthography are only lightly related, and hardly anyone is willing to do the effort to fix this, should give the author second thoughts on complaining about others' artifacts and not complaining about the ones in English.
Many languages have pure or nearly-pure phonetic orthographies, which are a pleasure to use. English could have one. And as the de-facto international language, it would help make everyone's life better.
There's a lot of half-knowledge and misdirection in this blog post, though.
* Languages aren't writing systems, and vice-versa.
* There are often decent reasons and upsides to the things he calls out as sucking. His rant on Hangul, for example - for human users, the morpho-syllabic grouping of letters into blocks in Korean writing tends to be beneficial, because they make the orthography of morphemes in different contexts more regular and morpheme reuse smoother. I also tend to consider the way Hangul is implemented in Unicode actually quite clever (you can use a formula to map from the code points of the individual letters to the code points of the pre-composed blocks). (I've contrasted Chinese and Hangul a little in this LWN comment: http://lwn.net/Articles/608260/)
* The German ß isn't a simple ligature for 'ss'. It has a different history, an orthographic dimension and also serves as a useful prounciation hint.
* Point 5, "The width of a character depends on context", is full of half-knowledge. First of, it's not "Jamo vowel" - 자모 (ja-mo) is the Korean word for "letter", so it's "vowel jamo" if anything. The alphabet is called 한글 (han-gul). Second, combining characters aren't in any way unique to Asian scripts in Unicode. Unicode contains separate code points for the individual Hangul letters and the blocks they can be combined into. Unicode also contains separate code points for Latin letters and the various diacritics and accents they combine with in Latin-derived Western scripts, along with code points for those combinations. For human users, these sort of consistent combinatorics in a script are a boon, and that Unicode chooses to represent both primitive elements and combined versions is a general trait of Unicode.
In general, the thrust here is "your writing systems sucks for me while doing my job", which is subject to this guy's performance and qualifications. It's not the same as "writing system sucks for its users". I find the specter of stuff like this catching on as memes among linguistically and culturally ignorant hackers (because hey, we write English, so we're in the doing-it-right camp, right?) frightening, because programmer convenience is only one factor among many, and most likely not the most important one.
I think the author makes it pretty obvious that he's only speaking about "your writing systems sucks for me while doing my job". Sure, the title is hyperbolic - but with a title like "Your language sucks" nobody should expect it to be anything else. That's like saying the title "Dependency Injection will be the death of me" is flawed because it's highly unlikely the author will actually die because of it.
Sure, but I think it's fair to be just as unforgiving in a critique then. My concern here is that we hackers love it when (seemingly) low-hanging fruit are pointed out and react with "well this sucks - let's drop it" very easily, and if this happens to someone as a result of reading this blog, it's bad. That said, I don't think eye-for-an-eye discourse is nice, either, and would certainly prefer it if this was a more rigorously fact-checked article.
I think your points are correct. However, it really is difficult and requires lots of specialized domain knowledge to properly support a big group of the world's languages. I'd almost go so far as to hire a specialist developer just to do international character handling if that was a requirement of the software.
It is difficult and does require some learning, yeah. In fact, it's more difficult than the article hints at, because while it falsely speaks of languages it's actually only concerned with writing systems, and a lot of the more interesting localization problems happen on the actual language/grammar level (for example, plain gettext isn't good enough to handle Korean post-positions, so we try to do better[1] in KDE).
But we can and have built reusable tools to shoulder much of the burden, and because these problems are so fundamental and culture-enabling, it can actually be a really rewarding and satisfying area to work in. In fact, I bet if you catch the author on a better day, I wonder if he wouldn't agree.
i think the blocks-as-syllables concept in hangul makes lots of sense. that, and the straightforward mapping of letters to the english alphabet, made it fairly easy to learn how to read korean. understanding what you're reading is a completely different matter however.
The funky part being that the blocks aren't just syllables, the boundaries between blocks often correspond to the boundaries between morphemes, because of how the phonotactics of the language have evolved and Hangul was designed to fit them. To use English as an example, the word "cats" contains two morphemes, the stem "cat" and the plural suffix "s". In Korean+Hangul, this sort of combination tends to split neatly into blocks, and when there are multiple options for how the letters can be distributed over them, the orthography usually prefers the variant that keeps the morpheme spelled consistently everywhere. This gives the writing a neat sort of lego blocks feel.
There's plenty of other nifty traits to Hangul, such as the way many letters derive from each other graphically and thus lend themselves to two-step entry and reduced-number-of-keys inputs like keypads. And the design of the letters is featural, not arbitrary, that is they are often visualizations of tongue position or mouth shape when forming the respective sound.
The theme here is: There are many other concerns when it comes to the quality of a writing system. What the author is concerned with here isn't entirely unimportant, but a very, very narrow view.
This is partly because while Korean and Chinese are distinct languages, a large part of the Korean vocabulary consists of imported Chinese stems, and Chinese is heavily monosyllabic. As such, a lot of Hangul blocks are close homophones to the sound values of Han characters.
As a Finnish speaker, I must disappoint you that both Russian and English do have broken alphabets (although this might not affect their character coding). English makes extensive use of silent letters and letters whose pronunciation is dependent on the context (word) they appear in (take e in "context" and "appear" as an out of hat example). I haven't studied much Russian, but I know that the letter ъ in their alphabet is never pronounced, and instead used to modify the pronunciation of the preceding letter.
English pronunciation leads to algorithms like Soundex [1], which just wouldn't seem necessary if words were written purely phonetically. Hangul gets closer, in a way. It seems hard to make rhymes with words like 'reprise' and avoid 'despise'. Not all languages need to be so ambiguous.
You really don't want a writing system which is too tied to pronunciation. If you go too far in that direction you'll end up with people with different dialects of the same language not being able to communicate clearly, even in writing! That would be a very undesirable state of affairs.
Yet you still want some correlation between spelling and pronunciation, otherwise the writing system essentially becomes completely arbitrary and thus impossible to learn (since we humans rely so much on pattern recognition for learning).
I am from country with strongly phonetical language and you are just plainly wrong. There are minor dialects but it is considered rather as slang in compare to language taught in schools or comunicate with officials or even broader audience.
(On the other side, it is much more difficult for us to learn English and get used to that disconection between alphabet and said words. And we don't have spelling competitions :).)
I am not wrong. My native country did this experiment about 200-300 years ago -- it failed badly, and that was with a very small population which was (geographically) widely distributed.
Trying to read (as schoolchildren) a few samples of text written in different dialects during that time dispelled any notion that it was a good idea.
Also: China (as another poster pointed out).
EDIT: Btw, if you have a good grasp of English accents, it's pretty easy to see what kind of chaos would occur in a country as small as England. (Thinking of the "north south divide" thing.)
So how does Finnish compare up? Do you not have silent letters?
In Dutch we have some useless things like combined letters: "c" and "h" are both used (like in "cake" or "hotel") but when combined to "ch" they sound like a "g". There are even duplicate ones: "au" and "ou" are both the same sound, and "ij" and "ei" too. You can't mix them though, "ijs" is ice and "eis" is a demand even though they sound 100% the same. I'm not sure about silent letters though, the "h" is audible even if only slightly.
Interesting to see Dutch mentioned. From what little I know about Dutch, its orthography does seem more similar to English than any other non-english-derived language I've seen.
I seem to remember that some (clearly not all!) of the horrible mess of English orthography originates from Dutch in fact, since some of the first people to run printing presses in England (or to print books which were distributed in England) were Dutch.
This may be true from a pronunciation standpoint, but he's mostly writing from a text processing standpoint. From a text processing standpoint, Russian and English are pretty easy to deal with.
On the other hand, it's a bit funny to point those out as easy to deal with, because they are bicameral alphabets, with notions of upper and lower case, case mappings, title casing rules, and the like. That's a complex system, that if you didn't grow up with it and already have had to deal with and come to terms with it, you would probably include it in a rant like this.
Yup, both ъ and ь either modify the preceding constant (тварь) or insert a sort of separation between letters (подъезд).
There are also some non-phonetic aspects to the language, which I didn't really think about until taking Russian in college (sort of native speaker, took the class to make some friends.) For instance, "o" is frequently pronounced as "ah" (хорошо - horosho - is usually pronounced as 'harasho'.)
All I hear is: "I wanted to model the real world and it didn't fit my mental model."
Sadly, there is some information in all that cruft that is interesting to people coming from an english context when dealing with international character handling, but well, nothing you couldn't get somewhere else.
Turns out nerds are actually not necessarily as godly intelligent as they base their identities around being, and are actually pretty bigoted a lot of the time
Well yes, because you're distinguishing between Masse and Maß (Mass). They're not precisely equivalent if you're attempting to determine meaning, but for many purposes they are.
For example, searching for "ss" in Chrome will hit ß. That's good behavior.
Actually, it's more messed up than this. In Germany, the DIN 5007 defines two variants:
1) ä and a are equivalent (normal order, e.g. in an encyclopedia)
2) ä and ae are equivalent (to be used in "name lists", e.g. telephone books)
But Germany is not the only place where German is one of the official languages.
E.g. there is an "Austrian order", where a is followed by ä, except for name lists where often DIN 5007 variant 2 is used as well (ä equivalent to ae).
Then again, Swedish also knows the letter "ä", but there ä is (usually) sorted after z (well, after å which comes after z).
Now, imagine you run an English-language service but need to sort surnames such as "Ährlich". What sort order would be the right one? DIN 5007/1, 5007/2, Austrian, Swedish, something else entirely?
Imagine you run a German-language service that is frequented by Germans and Austrians...
Whether you agree with its tone/conclusion or not, this is a really informative compilation of problems that can arise in processing international text. Very nice.
This article is interesting in that, while it's not the intention of the writer, it seems to point out some technical underpinnings for how language might impact culture. If a computer's handling of a language can vary so drastically, imagine how the human brain must deal with expressing or receiving complex thoughts through these.
I recall reading some discussion of this concept in the field of International Relations' Constructivist theory, but any salient links escape me now.
> If a computer's handling of a language can vary so drastically, imagine how the human brain must deal with expressing or receiving complex thoughts through these.
Orthography and language are not the same thing. Having a complex orthography has no bearing on processing complexity. If a language has a complex orthography, it may be hard to read and write, which are learned skills, but it won't be hard to speak and listen (which are naturally acquired skills given exposure to a language in the critical period).
That assumes that each of these properties exist in a vacuum, which is surely not the case. There's an obvious interaction between written and spoken language, and even without that, hundreds or thousands of years of differences in the written interactions and data transferred through reading within a culture could have immense impact on collective experiences and ways of thinking.
OP should be thankful to the Greek, Roman and Arab ancient civilizations for sane-to-compute-with languages. Basically all other languages are so fucked up and mind-fucking that it's no wonder modern mathematics, science, engineering and then computer science developed so fast in particular regions of the world... not to be a cultural bigot, but Sapir-Whorf ftw :)
Sapir-Whorf is largely discredited in modern linguistics (especially so for the strong form that your comment seems to suggest).
The general consensus is that human languages are more or less equivalent in their expressive power. (Note that doesn't mean certain writing systems aren't easier to learn/use than others. It's important to distinguish reading/writing from language).
Depends, actually. In Switzerland, you do write "Buße" as "Busse", despite the different pronounciation indicated with ß. The German written in Switzerland just doesn't know the ß character and uses "ss".
Nonetheless, you're right if you're just looking at German as written and used in Germany (And maybe Austria? Not sure how they handle ß there).
Well, this may be the first time anyone, anywhere called the Georgian language "straight-forward". Georgian is a very cool language with a fascinating alphabet, but it's almost mind-bendingly complex.
Also, Russian, the other language listed in addition to English and Georgian as "not-broken", has many irregular noun declensions ("unstable" cases in the writer's unusual terminology), which was one of his indicators for "broken" languages. In fact, 99% of all languages with a noun case system have at least some irregular cases. I'd say 100%, but then I'm sure someone will point to some language spoken by 10 people in a remote village in the Caucasus mountains that features strictly regular noun declensions.
Anecdote time:
I once worked on a software project where all of the sorting was done using an internal library made by a Large Trustworthy International Corporation. We discovered halfway through that the transitive property was not being maintained during sorts mixing half-width and full-width numerals. (In other words, 1 < 2 < 1 > 1.) Switching to ICU left me ultra-impressed at its thoroughness.
"ICU provides a data-driven, flexible, and run-time-customizable mechanism called "tailoring". Tailoring overrides the default order of code points and the values of the ICU Collation Service attributes"
Also, you sometimes need context to properly sort strings. Examples:
Are you sorting phone book entries or items in a dictionary? In some languages, that does make a difference.
Are you sorting Swiss German or German German?
Given two 'obviously' Italian words, should you apply Italian collation rules? You probably/maybe shouldn't when both are words in an English-language dictionary.
You can recognize a broken writing system truly from two hints.
A) It takes several years to learn how to write for a native child.
B) Or there is spelling competitions with native children.
I think pretty much everyone here would agree. It's common convention that we need to change.
I'm Dutch and trying to push for YYYY-MM-DD because it's the only format that makes sense when sorting. Classmates have a lot of trouble with it and still put "Notes 18-10.docx" in a directory full of "Notes 2014-10-18.odt"-like files.
Though I would be pretty fine with DD-MM-YYYY, at least be consistent... The only format that I really object to is the American MM/DD/YYYY format.
So the format comes from how it is read? That sounds plausible. "October 19th, 2014". I'd pronounce a date like "day, month, year" but always format it yyyy-mm-dd, it never struck me that a date format (abbreviation) should be possible to pronounce without first parsing it completely. It does make sense, I just never thought about it.
A reasonable (i.e. sortable) date format is 2014/10/19 but that would make sense in spoken form only as "2014, October 19th" which doesn't sound quite right.
A sidenote: in northern europe (Germany, Sweden etc.) we do it wrong on addresses: [Street, Number, PostalCode, City/Region, Country]. Going from smallest to largest context is actually done correctly in english [Number, Street, PostalCode, City/Region, Country]. You can sort addresses lexically!
So, this is actually more about "your writing system, and legacy encodings of it, suck."
While we do still have to support legacy encodings, increasingly we should just be using a universal encoding that doesn't have all of these problems he mentions. UTF-8 is such an encoding; it is ASCII compatible, contains no nulls, has the usual ASCII control characters encoded the same, no shift sequences, if you jump to an arbitrary byte in the middle of a text there is a maximum number of bytes you need to go forward or backwards to find a character boundary and you can do so with no ambiguity, etc.
And a large portion of the writing system suck rant has to do with Chinese (hanzi/kanji/hanja) characters, because there are just so many of them and standardizing them is a fairly difficult process. And it's true; that writing system does suck. It may have some good points, but it's amazing that it's still in use, it would be enormously economically beneficial to move to a simpler, easier to learn writing system.
Of course, if you did that and people no longer learned the older, much more cumbersome and complicated system, they would lose access to a large number of older cultural works which do use it, so it's a pretty tough sell.
Just to note, Korea actually does use a much simpler, more efficient writing system for the bulk of their text, but they do have a few words that they still use Chinese characters for. One of the reforms in North Korea was to abolish the use of Chinese characters entirely, using only the much simpler hangul writing system. His complaint about Korean is more due to how hangul is encoded in Unicode, which could be considered complex or clever depending on how you look at it; in any other sense, though, hangul is a great, very efficient and easy to deal with writing system.
These are human languages created and used by humans, they weren't planned, they just happened. They get used independently with loosely applied rules. I'm not really sure what the point of this article was. Was it to state that human language != programming language? Thanks for the revelation. I will surely stay posted to receive further brain-burstingly deep insights.
Yep, languages are not made to be easily encoded in binary form.
Also, symbol systems are just tiny bits of what a human language really is. In regards to grammar/semantics and phonetics for example, English is painfully loose to deal with in a programmatic way. That's just the way things are.
That said, I believe Unicode's approach to inclusion of new characters in the set is wrong. It shouldn't be so dynamic. Unicode needs a smaller base, a standard extension system and standard extensions (in a standard extension format, reviewed in long but predictable timeframes, lets say every 10 or 20 years). The current system is a pain.
I think the point was just to show all the complications that can appear representing human language for a computer to understand.
I don't think the author actually thinks that language suck, though I can't speak for him, he seems to be too knowledgeable on languages for not actually appreciating them.
> Russian has a very simple alphabet, strictly phonetic.
I don't speak Russian but I know enough to say it's not "strictly" phonetic in that there is not always a 1 to 1 mapping of letters to phonemes. For example as an outsider looking at the language I get confused at things like the "г" in "его" being pronounced as a "v" and not "g".
In terms of phonetic writing, I personally really like Spanish. Its writing seems the most phonetic amongst languages I have exposure to. (Admittedly limited). With the accent marks you can know the stress of every word even if it's one you haven't seen. You can adjust for regional accents based on spelling ("s" and "z", "ll" and "y" kept distinct even if a lot of speakers say them identically.) There are quirky rules about how a consonant sounds if some vowels come after it but at least it's consistent.
I speak Russian and Spanish, and was about to write the same thing. Russian is indeed not strictly phonetic, thought it has a phonemic orthography to a degree. Aside from 'v'/'g' sound, there are bigger problems with vowels reduction when non-stressed 'o' most of the time sounds as 'a', omitting 'ё' and so on. After struggling with English and Russian, Spanish is indeed a delight to read/write. Though obligatory accents make writing a bit cumbersome.
The o/a thing is not totally intuitive to my foreign ears but I'm willing to forgive it as a phonetic feature of the language and less to do with spelling. (I'm reminded of how older people in my [American] family used to say words like "potato", or certain northeastern US pronunciations of words like "Florida" - I guess I have little trouble seeing how o/a can be a flexible concept for some.) Similarly I'm sometimes not sure when "e" becomes "je", but after listening to people speak I think I get something of an intuition...
The problem with o/a is that in some dialects they indeed pronounce 'o' instead of 'a'. Pronouncing 'a', or 'akanye', as it's done in Moscow, became the right way and was imposed through the education system throughout the Soviet Union. Hence, not doing so may be perceived as a sign of a low education level/social status.
This makes me wonder what effect printing had on English. English used to be chock full of ligatures and special forms for letters (s and ſ for example). I can only imagine as a typesetter and type maker this was all very annoying and people simply started to drop them. However, WP claims [1] ligatures made typesetting easier and it was the introduction of Sans serif fonts in the 20th century that finally drove them away. I'd be interested in a more definitive history.
In either case, by the early 20th century, English (and most Latin based writing systems) were in pretty good shape for computerization, and the limited capacity of computing systems were kind of built around English orthography.
Nevertheless, most of these complaints would be true of English as well if we had carried through with our traditional system of writing. One can almost think of Hangul as a 100% ligature based writing system...there's fairly complex rules about what letters can be typed next to what other letters and produce a valid character.
In Asia, it took longer for general purpose computing to catch on because of both memory and high-resolution display requirements. 7-bit characters being displayed in an 8x8 on-screen block weren't going to cut it. It took several generations of advancement before computers that could handle general purpose Japanese properly went on sale. For Japan, a culture that prizes tradition in many ways, the idea of just throwing away 2 of the 3 written systems in order to cram writing into barely functional computing systems, and then adopting that convention culture wide, would be unthinkable. Japan instead went full-tilt on a paper-based system and computerization was focused on scanning and transmission: fax machines, scanners, high speed telecommunications, etc. Even today, the typical Japanese office is a paper-filled minefield full of paper handling devices the typical American office haven't seen since the early 1980s. More importantly, there are reasons the Japanese use three (really four if you count romaji) scripts. Something written in katakana gives it an elements of "foreignness", a bit of semantic hinting about it's origin and how it should be considered. Imagine you're 11 years old about to play the role of a Western Knight in a video game, the game asks you to input your name, but only offers you katakana, this feels to you more immersive. We have ways of trying to do the same in English, but it's far more limited.
Before computing took off, handling Asian languages was always a chore. When telegraphs were common, Chinese was encoded using a complex telegraph code [2]. This code was so useful that it's still used today in some places. During the cultural revolution, China had a moment where they could have introduced a simpler orthography, Zhuyin Fuhao, that's easy to learn to read and write (only slightly more complicated than Korean) and has full tonal representation. But it does a poor job disambiguating the tremendous number of homophones in the highly monosyllabic dialects of Chinese. It also doesn't provide a unified representation for words across dialects (the names "Wu" and "Ng" might be the same Chinese characters, but would be written differently in Zhuyin Fuhao). This would reverse thousands of years of cultural unification via written scripts previously enjoyed by China's many ethnicities and dialect speakers. (Fun fact, Christianity was first spread to Korea through Korean diplomatic missions to China, where Korean scholars communicated with Chinese converts through writing only as they didn't speak each other's language). These tradeoffs were simply unacceptable and instead China simply tried to reduce the number of strokes and characters in their writing reform.
So does all this suck for computing? Yeah sure. But there's lots of weight in the semantics providing the why things are the way they are. China can't just adopt a simple to compute writing system any more easily than English Speakers could adopt a complex logography. Or for English Speakers to adopt a sane spelling system (the reasons for why our spelling is so horrible are similar to many of the historic reasons Asian languages have hung on to their overly complex orthographies).
If he's complaining about these things and pointing to English as an island of relative sanity, he should be glad he doesn't have to support older versions of the English language and orthography!
HWÆT, WE GAR-DEna in geardagum,
þeodcyninga þrym gefrunon,
hu ða æþelingas ellen fremedon!
Ligatures and letterforms like the "long S" (ſ) actually arose as a result of printing, interestingly. If you look at 13th century manuscripts created in Norman England, they're extremely legible and regularized. The very earliest printed texts ("incunabula") produced in England tried to mimic this medieval blackletter script, and are actually quite easy to read if you get used to them. By the Elizabethan era, many typesetters were switching to Italian-derived forms that they thought looked more distinct and elegant in a letterpress context (serif fonts like Garamond). German printers kept working with the medieval style, however, giving rise to the Fraktura script that we now tend to associate with WW2-era Germany.
This is a great post, I'm really enjoying reading it.
I remember first seeing the long S as a child on a field trip to a local colonial village and being truly confused by it. It seemed so weird and unnecessary. It wasn't until many years later I even learned the rule for when it was used (middle of a word).
My field trip or his much more excellent post? I submitted his post just in case that's what you meant.
My field trip was a pretty boring normal school affair. If you grow up in one of the original 13 U.S. colonies you'll probably visit some former colonial stuff sometime during school. IIR I first noticed the long 's' at Colonial Williamsburg, probably in some reproduction early newspaper or something.
Mechanizing Korean: The Evolution of Korean Typewriters is interesting thought-fodder along those lines. This LWN comment thread touched a little bit on the effect of printing on scripts, as well how different means of input, such as cellphone keyboards, interact with different scripts: http://lwn.net/Articles/607599/
I'd be really interested in reading that, but my google-fu can't seem to turn it up (I can find the author, Dr. Kim) but not the book. Any ideas where I can find it?
This was originally a presentation in English at a conference in the Netherlands, but it was also published in a Korean academic journal, and there used to be a PDF of the English version floating about ... but I can't turn it up anymore, either :/. I've fired off a mail to the author asking for help; if I get a response I'll edit/reply.
>But the winner (loser?) here is Serbian, which in some places uses both Latin and Cyrillic characters interchangeably!
When I was in Sarajevo once it a while you'd see writing with Latin characters, once in a while with Cyrillic characters and I also saw writing in Arabic. All on the same street. Also Bosnian used to have a third alphabet - Arebica, a variant of the the Perso-Arabic script.
Also the license plates in Bosnia and Herzegovina only contain the letters that are in both the Latin and Cyrillic alphabet - A, E, O, J, K, M, T.
This is non technical point but english is broken on other level. As Finnish person I find it ridiculous when people have long discussions about how to pronounce something. Shouldn't it be immediately clear for everyone how something is written or pronounced. Those thing should be 100% directly exchangeable. If I just make up a word and say it Finnish people can write or read the word without any problems. Also all kind of exceptions to generic simple rules should be dropped, just to simplify grammar.
This article could be summed up with "My tools are built with (primarily) english in mind and that's a problem for me when dealing with languages other than english".
That guy will have a bad surprise about how English 'has simple sorting rules' when someone expects names such as 'Saint Thomas' to sort in the same order as 'St. Thomas' and that somehow 'St.' may also stand for 'street' depending on the order, so that "Thomas St.' should compare equal to 'Thomas Street', and 'Thomas, St.' is probably referring to 'Saint Thomas'...
Your colonial approach to understanding diverse forms of human to human communication sucks. Like seriously, ASCII is supposed to be global? Lazy at best.
We have lots of shitty representations. Try delving into DateTime sometime. Korean (Hangul) does not suck, btw.
The languages have been around hundreds of years, how long has your machine representation been around? Give it time. It just shows the software is still immature relative to the problem.
Even if it's tongue-in-cheek, I take issue with the general sentiment. If supporting languages spoken in the real world sucks, then we need better approaches to dealing with text. Not more whining about how unicode is hard.
16. Hyphenation in your language is context dependent.
English, your rules for deciding how to hyphen words are JUST HORRIBLY BROKEN! Hyphenation should at most depend on the word to be hyphenated, not on the semantics of the whole sentence.
English still has a specific ligature that's treated specially if things are done by the book: "Mc". All Rolodex/card catalog systems included a special position for it and they were sorted separately.
Regarding point 7, that the collation order is context-dependent: does a title that starts with T come before or after a word that starts with U? What if it starts with "The"?
The worst part is that he's just talking about things like collation, that are relevant to the libc API. Add displaying that text to the equation and the list explodes.
The article does mention RTL languages (Arabic, Hebrew, Farsi?) They are kind of annoying to deal with especially when you have multilingual text. (Google "Unicode bidi") And UIs have to be mirrored.
so, if someone has to develop a text(English)-->voice, or voice-->text(English) application, how many special cases would the developer have to handle?
This fairly reeks of tone-deafness.