Power Of Plain Text

Original assertion: storing resources as plain text makes them easier to view, maintain, increases their longevity, and/or makes it easier to share information with multiple languages and systems. Popularized by the UnixWay.


Tentative pattern form

(Except the WikiName of this page isn't a good pattern name. Anyone care to suggest something better?) (I like it very much.)

As this title is also from ThePragmaticProgrammer, I submit the name of the tip, KeepKnowledgeInPlainText?.

Context: a program needs to store resources in persistent form. Common choices for this are plain-text formats and binary formats in the filesystem, or some sort of database.

Problem: the choice of a storage medium and format for persistent resources usually has far-reaching consequences on both the design and the ergonomy of an application. If we have formally expressed requirements that favor one medium or format over all others, we will let those dictate our choice; however there are situations in which the functionality we are implementing demands some form of persistent storage, but no particular choice of medium or format seems to be particularly appropriate.

Forces: Consider also: Solution: unless you have definite indications that a different format is more appropriate, consider storing resources as plain text. This is particularly useful when storing configuration information, but will also apply to application "data" in some instances.

Resulting context: during development, it is much easier to see what is happening to your stored resources. Plain-text format can be parsed or deserialized "by eye"; spotting data corruption due to your application is easy. Plain-text resources can be placed under version control, where a quick "diff" will usually narrow down why something broke.

Resulting problems: it might be harder to provide a user interface for modifying resources than if they were stored in some sort of binary format (e.g. serialized object graphs). If you have large volumes of plain-text data, and parsing the entire resource for each use is not appropriate, you will need some scheme for random access. (However, plain-text makes it easy to provide modularity in the form of include/import directives.)

If the application is used internationally, formats for numbers and character encodings may vary. Likewise the different end-of-line conventions used for text files stored on different platforms.

A plain-text format which is trying to do too much will show a tendency to degrade into a complex vocabulary, soon becoming impossible to understand for an average user or developer. Consider AlternateHardAndSoftLayers.

Examples: Advocates of plain text are divided into the "DoItRightTheFirstTime" school, which tends to produce verbose, generic, complicated, extensible plaintext formats (e.g. ExtensibleMarkupLanguage), and the "WorseIsBetter" school, which tends to produce custom, neat, concise, simple plaintext formats (e.g. WikiMarkupLanguage).


Discussion

A more comprehensive listing of pros and cons:

Benefits Disadvantages Practical tools: This is a principle advocated in ThePragmaticProgrammer.


If you like your pretty document formats, and you don't understand the dinosaurs who seem stuck in dusty old plain-ASCII, please read T. V. Raman's article "Welcome To The Universe Of Fancy Colored Paper!" [http://www.cs.cornell.edu/Info/People/raman/publications/colored-paper.html], "an attempt to encourage the use of interchangeable document encodings."

(Or is this more about TheRealStrengthOfXml?)


Good reasons to use Plain Text Good reasons to use binary codes Bad reasons to use Plain Text Bad reasons to use binary codes
I don't agree with this. As a consultant in a more technical field of work I find the amount of work I have to perform to do certain text editing things in functional or technical specifications is a lot more in MicrosoftWord than vi. The power of the VimTextEditor is the ability to do certain repetitive tasks with the speed of lightning. Word's mouse approach takes a lot more time.

binary codes are always smaller than plain text?

Not really. Far too many people design special, custom binary formats that are larger than compressed text files.

Yes, a .DWG ("efficient binary") file is shorter than a .DXF (more text oriented) file representing the same image. But DXF is best because .DXF.ZIP files are shorter than either one (even shorter than .DWG.ZIP).

Yes, a .pdf ("efficient binary") file is shorter than a .ps (plain text) file representing the same document. But .ps is best because a .ps.gz file is shorter than either one (even shorter than .pdf.gz). See http://www.complang.tuwien.ac.at/~anton/why-not-pdf.html and PdfSucks.

{The above is an unfair characterization. Not all forms of "efficiency" are aimed at spatial optimizations, and not all spatial-optimizations are aimed at on-disk space. Encode time, Decode time, Memory Overhead, Indexing, Partial Edit Performance, etc. can all be forms of 'efficiency' that may reasonably apply. If you look at .ps.gz just in terms of space-on-disk, it might be better. How does it compare regarding other performance features?}

I think this illustrates PrematureOptimization -- the misplaced focus on making individual parts of the binary file small interferes with the real goal of making the entire file small. If you're only allowed to look at one part at a time, the best you can do is a Huffman-style compression. But if you're able to look at even 2 or 3 parts at a time, you can often get much better compression -- LZ, LZW, and similar kinds of compression.

This is worth highlighting; if space is your concern, compression algorithms are going to do a better job of "designing" your binary format then you could ever hope to. My experience is that "compressed text with XML-type formatting" is typically still much smaller then "the plain text alone", and at best, your binary format can only aspire to such 0-byte overhead. Letting a compression algorithm do the work can result in your shiny new "binary" format having negative overhead.

Compressed text is almost never as fast or space efficient as the compressed form of a good binary format. XML is particularly inefficient. Consider the test results using a binary form of XML, versus a compressed text form: http://www.w3.org/TR/2009/WD-exi-evaluation-20090407/ . No one knows how many coal fired power plants have been built to supply the power for the extra processing required by inefficient formats such as XML.

{Are you, by chance, defining "good binary format" as one that is designed for high-performance interchange (as is EXI)? Most "good" binary formats I've written or used tend to be aimed at high-performance linear indexing, memory-mapping, partial writes (i.e. modifying specific values), and sometimes even garbage-collection, allocation, and durable transactions (e.g. binary database files). Many of them will not compress well because they are not designed for that purpose.}


Random access doesn't necessarily require rereading the entire text file. Consider the Adobe Structured PostScript conventions; the preamble contains the fonts and operator definitions, while individual pages are set off by special comments. It's easy to view a particular page in GhostScript.

Consider a well-written LaTex file; commands are defined in the preamble, and it's easy to render a particular chapter or section.

You still have to serially read through the data, parsing it as you go, to find the special comments. On the other hand, random access in a file with fixed-size records can be implemented by seeking by byte index.

Not entirely true, you can also lseek() over the sequential file and read until you see a page-marking comment, then do next lseek() based on that information. This way, you can implement a binary search.

Then use a subdirectory instead of a single file. You can randomly access every file in the subdirectory.

That depends on the file system. Many Unix file systems do a linear search through a directory for each directory look-up. This becomes a problem when you have thousands of files in a directory. This is why databases are implemented as files containing fixed-size records, or bypass the file system and access the disk directly.

-- Then don't have thousands of files in a single directory. Have ten subdirectories with a hundred files in each, or ten subdirectories with ten sub-subdirectories in each and ten files in each sub-subdirectory. Write your program's file search system so that the subdirectories are transparent (no text file when naming another needs to be aware of the subdirectory system in use). On second thought, maybe this is too complicated.

Ummm... It would seem to me that the power of binary files to allow random access comes entirely from forcing fixed length records. The same can be achieved using plain text as well, it's merely a matter of providing padding. Further, for the case of variant-size records, databases don't handle those too well, while plain text can. Above and beyond those arguments though is the fact that if random access is an important requirement, it can be designed into plain text file formats as easily as binary file formats. I could choose to divide up my file into fixed length blocks and create a plain text b-tree representation. In fact, if speed of access was really an issue, I'd do the same thing people do with databases, i.e. have index files optimized for speed of search and access. Indexes, by virtue of being a controlled subset with a specific purpose, can be designed to be plain text with fixed length fields. The only difference would be that current binary index formats probably encode file offsets into the database as a 4-byte binary integer representation, and the proposed alternative would use a 10-character text integer representation. -- AnonymousDonor


random access of text files: email

Some email software keeps data in 2 completely different kinds of files, gaining the benefits of both Plain Text and random-access data. All the real data is in the "mailbox" files. If the program ever suspects that something smells fishy (say, the file modification date of the "mailbox" file is later than the file modification date of the corresponding index file), then the program deletes the index file. If I'm moving/archiving the mailboxes, I can delete the index files.

Whenever the program needs the index file I just deleted, (for example, when I ask it to search for something), the email program regenerates it from the corresponding mailbox file (this takes a tiny bit longer than just searching the mailbox file directly, but then the *next* time I do a search, the index file will make that search run much faster).

EditHint: Surely there's a less verbose way to say this. What was my point again? -- DavidCary

For more details on handling megabytes of email blazingly fast, and the importance of keeping *all* the data in plain-text mbox files (although we might *cache* some of that data in binary summary files), see "mail summary files" by Jamie Zawinski http://www.jwz.org/doc/mailsum.html


If you apply the AlternateHardAndSoftLayers methodology (and you know you should), then a suitable PlainText format might simply be your soft language. Think of Emacs, where the configuration file simply consists of elisp code. This has several additional advantages: LuaLanguage was also designed with this in mind - the compiler has been optimized for reading large files of textual data. -- ScottVokes


LessIsMore -- MatthewTheobalds


However,

We should also be open to the possibility that the traditional 7-bit AsciiCode might be preventing us from moving forward. For example, I used to use Nisus (a Macintosh word processor that stored text in one fork, formatting commands in the other) as a source code editor, and found the ability to annotate source code with bold, italic, even colors, did wonders for source code. Which got me to imagining other, more automated, uses for the extra bits. For example, imagine a programming language in which recent changes 'automatically' changed color to glow red hot, and 'automatically' "cooled" to black over time.

Also, experience with getting X-windows, Apache, and JServ configuration/properties files to do the right thing make me wonder if such files aren't the problem instead of the solution. I feel far more confident changing real code in a programming language like Java or Perl than I do changing properties files, if only because languages tend to do a better job with error messages than property file readers do. Maybe wider adoption of XML will help, but I tend to doubt it.... Tomcat's XML configuration process is about as fragile as JServ's property files in my experience.

Finally, I find the command completion stuff in the VisualAge for Java editor extremely powerful compared to programming in an ASCII editor like vi or emacs. -- BradCox

I was about to comment that the NeXT/OPENSTEP toolchain compiled .rtf files, then I saw the signature and figured you already knew that :)

I believe Tomcat's configuration fragility is due to poor error detection and handling.

Emacs' command and text completion facilities are powerful, fairly consistent and pervade the UI. Unfortunately they're not obvious when you're accustomed to typing things in full. -- MatthewAstley

Brad, Emacs can already do something a little like your cooling suggestion, but it's a general text feature, called highlight-tail, which colorizes the most recent text, and as time passes, slowly shifts the color towards the regular text color. I got my copy of it from http://groups.google.com/group/gnu.emacs.sources/msg/b50d53424e7ca0cd?output=gplain and it's worked well for me. --maru dubshinki

[Regarding confidence in "getting X-windows, Apache, and JServ files to do the right thing": the solution here (unless it's dead simple of course, maybe even then) is to write config files in a RealProgramingLanguage?. EmacsEditor and AwesomeTheWindowManager do this, for example (with Lisp (EmacsLisp) and LuaLanguage, respectively).]


discussion of changing preferences using "preference editor" vs. "text editor" moved to PreferenceEditorPattern


I'd say that PlainText is even more important for ProtocolDesign? than for storing data.

Mistake #1: Newbies at protocol design frequently jump straight into binary representations thinking that bandwidth must be conserved at all cost, whereas if bandwidth is a concern the protocol will be far better off if it is designed to use plain text but compressed before going on the wire.

Mistake #2: Those who have designed programming language APIs frequently design a protocol the same way, and then send '0' or '1' instead of 'true' and 'false'. Now, a protocol and an API are very similar constructs, but the differences are enough to lead designers easily astray. Protocols must usually be supportable by many platforms.

Reasons for designing protocols in PlainText and using expressive language:

-- LisaDusseault

This depends on the protocol, and how its used. If you're dealing with something like IP, UDP, TCP, ICMP, DNS, ARP, ... that are frequently manipulated in hardware, then its important to use a good representation from which hardware can easily extract information. When dealing with fixed size frames other details are important. For example, protocols that are framed by IP will place important identification information in the first 64 bits (e.g. port numbers, sequence ID) because ICMP error notifications include these first 64 bits in their packet.

However, in any protocol design, the abstract sequence of events and state machines can, and should, be verified independently of the representation. -- DaveWhipp

You better believe that bandwidth is an extremely important part of communication protocol design and you will never find any ASCII text compression technique that will come within an order of magnitude of the level of efficiency in using binary codes. Aside from dedicated office LANs, most communications media suffer the problem that as the bandwidth needed for a channel increases, the number of channels available must decrease. -- WayneMack

Can we please differentiate between WireProtocols? and ApplicationProtocols?? All of the protocols mentioned by Dave are WireProtocols?. The PowerOfPlainText for application protocols is amply demonstrated by the RequestForComments. Having written a couple of application protocols for toy applications, I can testify to this. To appeal to authority, I recommend people have a look at BXXP, the new meta-protocol(application) designed by MarshallRose?. -- AndraeMuys

you will never find any ASCII text compression technique that will come within an order of magnitude of the level of efficiency in using binary codes.

I found one over at ItsTheLatencyStupid.

In the example given, sending the ASCII text "YesNoYesNo?" takes 102.4ms. Sending a single binary bit (the fastest possible binary code) takes 100 ms.

Yes, bandwidth is important, but latency is even more important.


Consider the WindowsRegistry? for an instance of moving away from KeepKnowledgeInPlainText?, in a misguided attempt to solve a problem with the latter pattern. Worked great, eh?

There is nothing wrong with the idea of the registry - it does what it should (provide fast access to configuration variables). Where Microsoft got it wrong was in keeping the native form of the registry in something other than plain text. Microsoft could have kept the on-disk representation of the registry as plain text, and kept the in-memory representation in whatever form facilitated fast access. This would then require rewriting the registry when it changed.

[NOTE: this would be pretty slow, too. But that's just Our Good Buddy Windows for you, eh?]

It should be noted that Microsoft does provide a means to export and import the registry in a textual form. And there are third-party tools that do this as well. Microsoft also supports updates to the registry that are encoded in a textual form. Here's an example of a ".reg" file's format:

 REGEDIT4

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Java VM\Security] "EditCustomPermissions"=hex:01,00,00,00

Yup, that's perfectly BassAckwards.

[It would be kind of bad to have the registry floating around as something readable on NT or 2000. Not very secure. Not like the registry was the best idea re: security to begin with.]

But the registry is already readable through either tools like regedit or a simple program using documented API calls. The binary format is only trivially more secure than a text version would be.

Eh? Who said anything about security? Binary provides absolutely no security improvements over PlainText; in fact, the reverse is often the case, as revision control and versioning are much harder problems for binary formats then text. Security for configuration files is a function of a) FileLevelSecurity? in the file system, and b) cryptographic protection.

I think the distinction here is that in NT, the registry file is maintained by the operating system, and security can be set on different parts of it given that non-disaster-recovery access is exclusively through the registry API. Therefore it really doesn't matter what it is behind the scenes... -- MattBehrens

[The problem with the registry is that it's a big black hole just like the windows/System directory. One of the main problems is readability and accessibility. Half of the things in the registry we don't know what are for unless you sit there and study it for long periods of time. Sometimes the registry is used to hide things like security keys because of this fact of it being a black hole. People take advantage of the fact that it is a black hole. It's not a debate of whether or not it is a black hole since people have proven it useful to hide things in. Unfortunately, it's disadvantages are used as an advantage by commercial developers.]


The problems with the Windows Registry are not because of it's in-system representation. If it were all text, it would still be a huge pain. The problems are multiple: PowerOfText? wouldn't be necessary if there was a solid universal scripting system. The argument for text is always "I can script it" - well, if I could pop open a console and say "help(myapp)", review the api, and then say "foreach doctype in myapp.settings.doctype doctype.showicon := false" I would be a damn sight happier than I would be with a text editor. Windows tried to do this with the Windows Scripting Host, but we all know what a fiasco that was. I guess I've been spoiled by Python, where the lines between a module, data, program, and documentation are pleasantly fuzzy.


I strongly oppose the use of binary configuration files, even ones (suggested above) created by serializing/picking objects in Python or Perl, or via some gory C++ solution.

Things I love about text files that were mentioned above, such as editing in an editor, I will not rehash, but I didn't see anyone point out that verification of the correctness of the contents of a text file can be done more easily than the verification (visually, in an editor or some other tool) that the contents of the file have not GoneToHellInAHandbasket.

Who wants to tell us campfire stories about RegistryCorruption?, HackingDbfFilesManually? or any other binary recovery nightmares?

I use CommaSeparatedValues files for flat tables which are small enough to fit in memory all at once.

I keep SQL syntax formatted dumps of all my SQL database tables, that can be executed as scripts, to repopulate my databases if they are corrupted.

When using Python, I use pprint (pretty printer) to dump out my data dictionaries, which can be saved to text files, and parsed (with a small overhead), using the eval() builtin, except where security/trust issues prevent me.

When using results of queries on databases locally, I keep the query results locally in CSV or XML format, in memory inside components constructed for the purpose, or on disk. If I am ever worried about system state, I can dump everything to a a temp directory, zip it up, and I have a record of my entire application state, including lookup tables, etc.

Text is wonderful. Long live text. WPostma.


Proposition:

PlainText : Power Unix User :: MicrosoftExcel : GUI-minded office worker

A plain-text file (CommaSeparatedValues, Space-Separated Variable, or TabDelimitedTables) can be edited using emacs, Notepad, or Excel.


(comments on preference editors moved to PreferenceEditorPattern)
See WorseIsBetter, YamlAintMarkupLanguage, PragmaticProgrammer, UnixWay, TheRealStrengthOfXml


Good points, but with regard to the portability issue of RDBMS, it is often quoted but never in my experience actually an issue. Have any of you ever heard of a non-trivial database being switched from one RDBMS to another? Once a company has invested in an RDBMS, they are exceedingly unlikely to change. -- JamesRadvan?
I've noticed that the iPod stores contact, calendar and note files as plain text. Because the iPod mounts as a drive, I can just "build" the set of text files I want to take with me. I used OutPod? to convert my outlook contacts folder into a set of (plaintext) vCard files, and my calendars into a set of iCal files. (Obviously I deleted the contacts and calendar folders in outlook because of OnceAndOnlyOnce.)

I like this because: I got tired of all the intricate, proprietary addressbook and calendaring systems out there for small gadgets, so it's really nice that the iPod uses iCal, vCard and plain text notes.


I've been considering switching from CVS to Subversion, and the fact that Subversion does not use plain text is the one thing which I really weigh against the benefits.

I've been using Subversion for about a year now, and haven't lost any data. And you can always get a plaintext dump of your repository.


Notice how almost all program source code is PlainText and usually edited with an unrestricted text editor. (Exception: MicroSoft VisualBasicForApplications presents a modal error box when syntax is violated.) I could imagine a structured editor for C++. Would it be useful? pleasant to work with?


The primary problem with plain text is that there really is no such thing anymore (if there ever was). For starters, there are encoding issues, but let's assume that you use one encoding (e.g. UTF-8) everywhere.

Problems start at a very fundamental level if you want to keep things simple enough for most people reading/editing your files to accurately mentally model the format and not get bitten by stupid things (such as invisible characters being significant). Invisible characters are e.g. white space at the end of a line or white space not recognized as white space by the software in the middle of white space that is.

Unicode 4.0 has 26 white space characters, some of which are produced simply by pressing spacebar in some input systems. (CJK, their characters are of fixed size and take more screen space than Latin characters, so they have their own space character) ALT+space produces \xa0 (non-breakable space) on many systems, and it looks just like a space. Most software doesn't consider it as white space, which results in confusion. When some software does and some doesn't, there's even more confusion (similar to broken HTML "just working" until you try to look at the page using your mobile phone's browser). Not surprisingly, many popular formats don't get white space quite right.

There are many end of line conventions, and XML 1.1's got patched because 1.0 apparently didn't get them right enough. Python has significant white space, and it's commonly accepted that mixing spaces and tabs when indenting is bad due to different tab widths in different editors. Mixing them is allowed by Python and is considered to be a misfeature. Makefiles allow tabs only, which is considered brain-damaged (and annoying) by many. (see TabsVersusSpaces).

When people see white space, they often expect it not to be significant, as it's sometimes impossible to visually distinguish from lack of characters (at end of line or end of file). They also often expect different white space characters to have the same meaning because they are hard to distinguish from each other. Programs, however, have problems in determining what exactly is white space, and of what kind, and it can be expected that many programs' idea of white space is " \t\r\n", and assuming anything else will cause compatibility problems. Even this set of white space characters isn't safe, though. (carriage return and line feed handling is broken in programs on various platforms, make barfs on spaces, YAML parsers barf on tabs)

When people see plain text, they expect different things depending on their experiences. Programs that process plain text expect wildly different things, and as many of them are being patched to work with multiple encodings their behavior becomes even more unpredictable (and dependent on the locale) due to white space, sorting order, etc. sometimes depending on the locale. In contrast, when you have a binary format, nobody usually expects anything, so you get to make its specification clean and concise.

No wonder many projects just use XML. All the more or less arbitrary decisions are already made for you by a more or less arbitrary decision process, but at least the format is in common use and well defined. Yet white space haunts us even here: non-validating parsers can't tell the difference between all significant and insignificant white space, so they just give them all to you. The problem is much simplified, though, and you can always do a validating parse. But is all the complexity of the XML really necessary for most things? And how is e.g. YAML any simpler, the spec is rather long? How do you know any random YAML parser is correct, given that YAML is used much less than XML and the parser implementations are moderately complex stuff?

OK, so that was kinda long :-) But this page depicted too rosy a view of plain text formats so I had to complain a bit. I myself like plain text, but would like people to standardize on something simple, unambiguous and not obviously broken (backwards compatibility be damned, having a zillion different end of line markers is just drain-bamaged (BrainDamage)).


Type-free and to a lesser extent dynamic typing sort of borrows many of the ideas of PowerOfPlainText. It is easier to transfer data between systems if the source and destination are not as picky about types. Fiddling with type conversions between databases often consumes a lot of tedious busy-work time. For example, validation must happen before the load process if you use required types. However, type--free allows one to load stuff first, and then validate it after loading. This can simplify a lot of validation algorithms. (Related: DynamicRelational)

(Somebody has been screwing with my handle, perhaps out of anger for use of the word "anal", so I removed it and the handle to avoid further edit-wars. Next time, please ask.)

What The World Needs, Next:

It seems to me, that Computer Kings need to mark their Carriage Returns on all Computer Keyboards, better, in order to reduce the confusion to Printers, who seem to guess if the typist wants a Single Spaced Carriage Return, or a Double Spaced Carriage Return, which often results in a Random Spaced Carriage Return in most my perfect e-mailed letters.

Actually, some computers THINK that my Carriage Returns are Paragraph Markers, which causes a lot of headaches to many, like me, who type with only two fingers, and who don’t need “Default’s” help, and find it extremely difficult to type an Upper Case Carriage Return in order to get any Carriage to Return, on most Computers.

So, I hope someone creates Dictation Software, soon, which types as fast as me, with no more Carriage Return errors!

James H. Armistead (age 478), or Shakespirit@gmail.com 1200 W. 4th Street, Apt. 203 Centralia, IL. 62801 (618) 292-0129

FacePalm


See Also: UnixWay, CrossToolTypeAndObjectSharing

EditText of this page (last edited November 26, 2013) or FindPage with title or text search