Proposal for a standard plain text format for iOS documents

Since the last times we visited the matter of working with documents on iOS, I have read with great interest more write-ups of people describing how they work on the iPad, because of course it is always good to hear about people being able to do more and more things on the iPad, but also because (as John Gruber so astutely noted) Dropbox almost always seems to be involved. I don’t feel that Dropbox solves the external infrastructure problem I raised in my first post on the matter, I consider Dropbox external infrastructure as well, if only because it requires you to be connected to the Internet to merely be able to transfer documents locally on your iPad (and that’s not a knock on Dropbox mind you, this is entirely the doing of restrictions Apple imposes).

I am going to concede one advantage the current de facto iOS model of documents in a per-app sandbox plus next to that an explicit container for document interchange, which is that it forces apps to actually consider support of interchange document formats. With the Grand Unified Model, and whatever we call the model Mac OS X now uses since Snow Leopard, applications would first only concern themselves with creating documents to save the state of the user’s work for them to pick it up later, without concern for other applications; and when the authors of the applications would come to consider standard formats, or at the very least creating an interchange format without data that are of no interest to another app (e.g. which tool was selected at the time the image was saved) or would amount to implementation details, they would realize that other applications had managed to slog through their undocumented document format to open it, and as a result the authors did not feel so pressured to support writing another document format. The outcome is that the onus of information interchange falls only on the readers, which need to keep adding support for anything the application that writes the document feels like adding to the format, in the way it feels like doing so.

However, with the de facto model used by iOS, apps may start out the same way, but when they want to claim Dropbox support, they have damn well better write in there documents in a standard or documented interchange format, or their claims of Dropbox support become pretty much meaningless. I am not sure the tradeoff is worth it compared to the loss of being able to get at the original document directly as a last resort (in case, for instance, the document exchanged on Dropbox has information missing compared to the document kept in the sandbox), but it is indeed an advantage to consider. An issue with that, though, is that as things currently stand there is no one to even provide recommendations as to the standard formats to use for exchanging documents on iOS: Dropbox the company is not in a position to do so, and as far as Apple is concerned document interchange between iOS apps does not exist.

So when I read that the format often used in Dropbox to exchange data between apps is plain text, while this is better than proprietary formats, this saddens me to no end. Why? Because plain text is a lie. There is no such thing as plain text. Plain text is a myth created by Unix guys to control us. Plain text is a tall tale parents tell their children. Plain text is what you find in the pot at the end of the rainbow. Plain text is involved in Hercules’ labors and Ulysses’ odyssey. Perpetual motion machines run on plain text. The Ultimate Question of Life, the Universe and Everything is written in plain text.

I sense you’re skeptical, so let me explain. Plain text is pretty well defined, right? ASCII, right? Well, let me ask you: what is a tab character supposed to do? Bring you over to the next tab stop every 4 spaces? Except that on the Mac tab stops are considered to occur every 8 spaces instead (and even on Unix not everyone agrees). And since we are dealing with so-called plain text, the benefit of being able to align in a proportional context does not apply: if you were to rely on that, then switch to another editor that uses a different font, or switch the editing font in your editor, then your carefully aligned document would become all out of whack. Finally, any memory saving brought by the tab character has become insignificant given today’s RAM and storage capacities.

Next are newlines. Turns out, hey, no one agrees here either: you’ve got carriage return, line feed, the two together (and both ways). More subtle is wrapping… What’s this, you say? Editors always word wrap? Except piconano, for instance, doesn’t by default. And Emacs, in fact, does a character wrap: by default it will cut in the middle of a word. It seems inconsequential, but it causes users of non-wrapping editors to complain that others send them documents with overly long lines, while these others complain that the first guys write lines with an arbitrary limit, causing for instance unsightly double-wrapping when used on a window narrower than that arbitrary width.

And of course, you saw it coming, comes the character encoding, we have left the 7-bit ASCII world eons ago. Everything is Unicode capable by now, but some idiosyncrasies still remain: for instance as far as I can tell out of the box TextEdit in Mac OS X still opens text files in MacRoman by default.

This is, simply, a mess. There is not one, but many plain text formats. So what can we do?

The proposal

Goals, scope and rationale (non-normative)

The most important, defining characteristic of the proposal for Standard Plain Text is that it is meant to store prose (or poetry). Period. People might sometimes happen to use it for, e.g. source code, but these use cases shall not be taken into considerations for the format. If you want to edit makefiles or a tab separated values file, use a specialized tool. However, we do want to make sure that more specialized humane markup/prose-like formats can be built above the proposal, in fact for instance Markdown and Textile over Standard Plain Text ought to be able to be trivially defined as being, well, Markdown and Textile over Standard Plain Text.

Then, we want to be able to recover the data on any current computer system in case of disaster. This means compatibility with existing operating systems, or at least being capable of recovering the data using only programs built in these operating systems.

And we want the format to be defined thoroughly enough to limit as much as possible disagreements and misunderstandings, while keeping it simple to limit risks of mistakes in implementations.

Requirements

Standard Plain Text files shall use the Unicode character set, and be encoded in UTF-8. Any other character set or encoding is explicitly forbidden.

This seem obvious, until you realize this causes Asian text, among others, to take up 50% more storage than it would using UTF-16, so there is in fact a tradeoff here, and compatibility was favored; I apologize to all our Japanese, Chinese, Korean, Indian, etc. friends.

Standard Plain Text files shall not contain any character in the U+0000 – U+001F range, inclusive, (ASCII control characters) except for U+000A (LINE FEED). As a result, tabulation characters are forbidden and the line ending shall be a single LINE FEED. Standard Plain Text files shall not contain any character in the U+00FF – U+011F range, inclusive (DELETE and C1 range). Standard Plain Text files shall not contain any U+FEFF character (ZERO WIDTH NO-BREAK SPACE aka byte order mark), either at the start or anywhere else. All other code points between U+0020 and U+10FFFF, inclusive, that are allowed in Unicode are allowed, including as-yet unassigned ones.

Standard Plain Text editors shall word wrap, and shall support arbitrarily long stretches of characters and bytes between two consecutive LINE FEEDs. They may support proportional text, but they shall support at least one monospace font.

These requirements shall be enforced at multiple levels in Standard Plain Text editors, both at the user input stage and when writing to disk at least: pasting in text containing forbidden characters shall not result in them being written as part of a Standard Plain Text file. Editors may handle tabulation given as input any way they see fit (e.g. inserting N spaces, inserting enough spaces to reach the next multiple of N column, etc.) as long as it does not result in a tab character being written as part of a Standard Plain Text file in any circumstance.

Standard Plain Text files should have the .txt extension for compatibility. No MacOS 4-char type code is specified. No MIME type is specified for the time being. If a Uniform Type Identifier is desired, net.wanderingcoder.projects.standard-plain-text (conforming to public.utf8-plain-text) can be used as a temporary solution.

Clearly this is the part that still needs work. Dropbox supports file metadata, but I have not fully investigated the supported metadata, in particular whether there is a space for an UTI.

Appendix A (non-normative): recovery methods

On a modern Unix/Linux system: make sure the locale is a UTF-8 variant, then open with the text editor of your preference.

On a Mac OS X system: in the open dialog of TextEdit, make sure the encoding is set to Unicode (UTF-8), then open the file. Being a modern Unix, the previous method can also be applied.

On a modern Windows system (Windows XP and later): in the open dialog of Wordpad, make sure the format to open is set to Text Document (.txt) (not Unicode Text Document (.txt)), then open the file. Append a newline then delete it, then save the file. In the open dialog of Notepad, make sure the encoding is set to UTF-8, then open the latter file.

The reason for this roundabout method is that Wordpad does not support UTF-8 (Unicode in its opening options in fact means UTF-16) but supports linefeed line endings, while Notepad does not support linefeed line endings. Tested on Windows XP.