Page: [root]/code/delimiters-must-die | src | faq | css

On delimiters

Escaping from backslash hell

OK, so you want to use simple comma-separated format to store your data.

one,two,three,four
five,six,seven,eight

Good! Simple and clean, human readable too. You separate entries with , and series with newlines (\n). But then, you need to store comma inside of one of values. So you decide to escape commas with \, like all decent people do.

one\,two,thr\,ee,four
five,six,seven,eight

Blech! Your parser cannot simply split lines by ",". It must check if \ doesn't precede it, and if it does, then strip \. But that still works.

Now, you see that \ can be encountered too, and maybe even directly before ,. So, let's escape it with itself.

one\\\,two,thr\,ee,four
five,six,seven,eight

Three slashes! Isn't that fancy? Not much changed for your parser, you just tell him to strip one backslash from \\.

Now to think of it, newlines can be encountered inside entries too. So, let's make it \n (and \r for these obscene OSes).

one\\\,two,thr\,ee,four
five,\\nsix,se\nven,eight

Senven? What the fuck.

Well, if you made it similar so far, congrats, you are a decent man. If not, you might have used quotes.

one,"two three",four
"fi
ve",six,"seven,eight"

Quotes are nice, but you may need to use them too! Not to employ another character, you may want to represent a quote inside the quote by two quotes.

one,"two"" three",four
"fi
ve",six,"seve""n,eight"

Ugly and irregular.

Now, back to slashes. Imagine you would want to pack all this inside of another CSV entry. You get something like this:

one\\\\\\\,\"two thr\\\,ee\"\,four\nfive\,\\\\nsix\,se\\nven,e\\"ig\\"ht

Well, I made up this example, but try coding in shell (which involves), and you'll understand all this.

I Will Never Encounter This Set Of Bytes

Let's make up a bizarre, totally random string. It will never ever appear in our data, I'm assuring you. We'll start our entry with it and end with it.

%%%%%%%%%%DATA BOUNDARY srfg345632rfefh56t34freg56y43rffgmy/dev/urandomsays hello#$^#$%TR%%%%%%%%%%%
SRgwerg24yg!#RG@2365u246jh4fgb345ik54y245g56u234rgfw43r8ty2348we9fuhg309ekxc09w3fu8tu32598jf03928qrg2938rhy093rjg293riyjg92384fj8934rjhg28975y	10wejmwodkvnn32w9048hjfq 3984hf9q38hf 398rh 93q8r hg98q2hr 9g813h9rthg9 3rhf98h219hgf1923gh9	qhf91jhgh1
%%%%%%%%%%DATA END srfg345632rfefh56t34freg56y43rffgmy/dev/urandomsays hello#$^#$%TR%%%%%%%%%%%

Know what? IT FUCKING WILL APPEAR. And if you want your system not to fail miserably, you have to scan through all this data and make sure it's not there. Not worth it. And anyway, scanning for this string is rather complicated.

Another good example (besides HTTP multipart boundary mocked above) is CDATA.

Taboo delimiter

People will never ever need ASCII 0 in their strings! I assure you! Let's use it as delimiter. No other options.

Bytesize delimiting.

Ok, we have four bytes for that. We won't ever need more.

or...

Ok, we have four-byte length field here. Why would ever anybody want to delimit more than 4294967296 bytes? And if human looks at our format, we can tell him to go fuck himself.

Tolerable delimiting

Is implemented in JSON. It uses backslashes plus very limited set of what can follow them. The format is quite readable and writeable by humans and parser-friendly. And also its page has nice graphics, I'd like to be able to make such myself.

(in search of) Perfect delimiting

If you need simple strings, that will not encounter one character, you can delimit with that character. But for god's sake, do not try to allow strings do contain this character escaped.

very long value

There is \n at end.

If you need byte strings that can contain any byte, specify length before data.

64 �d��W��uu&f(�69��須��?K4{u�
�@�����Ӌ*�yT��O;��|ÑZT}����Kn�
52 �d��W��uu&�d��
��uu&�d��W��uu&�d��W��uu&
�d��W��uu&}

Lines start with \n, there is single space after numbers, numbers consist of 0-9.

So, to summarize it: very strict format, NO escaping, taboo OR bytesize delimiting with no fixed lengths.

This kind of escaping is implemented in my serialization format called transfer.