|
My book discusses the advantages of TLV over text-based encodings. The latter typically require more space and processing time. Readability is touted as an advantage of text, but it's often as you say.
Text's advantage is in interoperability between big and little endian systems. If that's a requirement, TLV is a non-starter unless all the fields are the same length. A protocol standard has to consider this, but a proprietary system can standardize on one endianism and use TLV more freely, although it still has to maintain protocol backward compatibility unless it's OK to shut down the entire network during an upgrade.
|
|
|
|
|
Text representation does not completely evade endianism - at least not with UTF-16!
If you consider UTF-8 an alternative: The multibyte encoding of a code point is just a compression method for an integer. You can use that for the Tag and Length fields as well - that could save you a few bytes when tags are few and lenghts short, and it solves endianness equally well for 32 bit tags and lengths as it does for text files. I have been considering this solution, but there hasn't been any need for it yet.
You obviously still have an endiannes-issue if the value field contains any binary numeric value at all (including UTF-16 characters). A large group of application/data formats are mainly targeted at user environments where CPUs of one given endianness is dominant. Defining that as The byte order for your format, and clearly indicate to those readers / writers in the opposite endianness that they have to flip bytes (some CPUs have special instructions for that!) is, in my opinion, a far better solution than converting everything to text.
Text doesn't solve all format problems either, unless you define one of many alternate formats as The format (analogous to defining the endianness of the format). How do you represnent dates? 05/19 is unambiguous (but must be converted to e.g. ISO standard before presenting to a Norwegian user). A week ago, 05/12, is ambiguous unless the representation is explicitly defined. Time: AM/PM is virtually unknonwn in many languages/cultures. Numerics: Is 1,500 one and a half, or fifteen hundred?
Text: How do you represent characters beyond ASCII? 8859-1? 8859-x, with x specified in metadata? UTF-16? UTF-8? Maybe you will stick to ASCII and use QP, or Base64? HTML charcter entities? (named, # or either?) Backslash escapes? (hex, decimal, octal, or any of those?) URL percent-encoding? Which characters do not need to be escaped? How is newline and end of string represented - is NUL accepted as a fill byte, in accordance with ISO standards?
And so on and so on. Text representation certainly doesn't solve all problems. (I'd say that binary encoding solves more!)
In the days when I was working with ASN.1 and BER, a BER string had to be inspected using a BER reader (which should have access to the ASN.1 to provide symbolic names). The readability was a lot better than with XML! When I went from BER to XML, I was considering making a similar XML reader to make it readable; I never got around to do that.
Today, most systems for displaying plain text have some facilities for improving readability, starting with collapsing inner structures, then highlighting of tags, and so on. You could say that such functions illustrate that the plain text format is not good enough. If I need a display tool that parses and transforms XML or whatever into something readable, it might as well transform some TLV format into something readable.
There is one issue that still remains, though: How self-describing the file should be. TLV tags are usually opaque, just some integer number. When you see an XML "p" tag, you know that it may have to do with a person, a product, a paragaph or something associated with the "p" (usually as the initial letter). At one presentation of handling of arbitrary XML documents, I had a sami colleague give me Northern Sami terms for chapter, section, picture and so on, for me to use in the examples: The tags were just for illustration (something like Ipsum lorem), but for the audience to realate to this as a document was difficult
I made one TLF format a few years ago: The file contained zero or more tag name tables, providing symbolic tags for presentation purposes; each table was headed by a language code. For simplicity, in that format, tags were unique. If partial structures could have had "locally defined" tags (as allowed e.g. in ASN.1), a more complex scheme would be required, easily growing into a complete scheme representation. In this case, that would be overkill; global tags was far easier and fully acceptable.
Such issues to not arise at all with textual tags; they are at least at some level self-describing. An they rise issues of e.g. case significance, allowed character set, and a bunch of other issues that a numeric tag evades.
When ASN.1/BER was in war with other alternatives, the lack of symbolic tag names in BER, mandating the receiver to have access to the ASN.1 scheme for interpretation, was one of the strongest critisisms of BER (/DER/CER). Later, we got XML and JSON encoding rules, encoding symbolic names from the ASN.1 scheme into the stream, but this was only a half-way solution: Matching (and keeping in synchronization) an ASN.1 scheme to an XML scheme is, for all practical purposes impossible, certainly over time. So it mostly served as to poor mans BER reader
I see a lot of areas where computer guys are rather unwilling to seriously assess the commonly used solutions, asking critically if they really are the best. Textual encoding is one of those. We use it because that's the way we do it. Because textual encoding is there, not because it came out with the highest evaluation score. Sure, it is there, we have to accept it when exchanging data with others. But in "local" contexts (such as private files for an application), I tend to use other alternatives.
|
|
|
|
|
Member 7989122 wrote: I see a lot of areas where computer guys are rather unwilling to seriously assess the commonly used solutions, asking critically if they really are the best. Textual encoding is one of those. We use it because that's the way we do it. Because textual encoding is there, not because it came out with the highest evaluation score. The .NET framework works by default with UTF16. We need not think about limiting to ASCII, because we're no longer limited by the space on a floppy. Not much to gain there, and hardly worth the money for the time spent on it.
And no, you don't go back questioning the design of the screws if you're building a car. You take the industry standard, take a brief glance at other screws, and try realize there's a reason why it is the current standard.
Bastard Programmer from Hell
If you can't read my code, try converting it here[^]
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
Eddy Vluggen wrote: And no, you don't go back questioning the design of the screws if you're building a car. You take the industry standard, take a brief glance at other screws, and try realize there's a reason why it is the current standard. That is certaily true. Sometimes there are reasons for that component design that you do not realize, and if you try to "improve" it, you may be doing the opposite. When a partial solution is given, it is given.
Textual encoding may be that way, in particular when you are exchanging data with others.
But when you are not bound to one specific solution, e.g. you are defining a storage format for the private data of an application, or you have several alternatives to choos from, e.g. 8 bits text is given but you need to select an escape mechanism either for extended characters or characters with special semantics, then you should know the plusses and minuses for the alternatives.
"Because we used it in that other product" is not an assessment Yet, I often have the feeling that we are arguing like that. We should spend some of our efforts on learning why these othere alternatives were developed at all. There must be some reason why someone preferred it another way! Maybe those reasons will pop up in some future situation; then you should not select an inferior solution because "that is what we always do".
What I (optimistically) excect from my colleagues is that they are prepared to realate to the advantages and disadvantages of text and binary encoding. If they are network guys: That the know enough to explain the greatness of IP routing vs. virtual circuit routing, the advantage over layer-3 routing rather than layer 1 switching. Application developers should relate to explicit heap management vs. automatic garbage collection, use of threads vs. processes, semaphores vs. critical regions. And so on.
Surprisingly often, developers know well the solution they have chosen - but that is the only alternative they know well. They cannot give any (well) qualified explanation why other alternavtives were rejcected. I think it is important (in any field, both engineering ones and others) to be capable of defending the rejection of other alternatives as it is to defend the selected one. If you cannot, then I get the impression that you have not really considered the alternatives, just ignored them. And that is what worries me.
For UTF16: yes, that is given, as an internal working format. Yet you should consider what you will be using an external format: UTF-8 is far more widespread for interchange of text info. When is it more appropriate? If you go for UTF-16, will you be prepared to read both big- and little-endian variants, or assume that you will exchange files only with other .net-based applications? Will you be prepared to handle characters outside the Basic Multilingual Plane, i.e. with code points >64Ki?
Even if your response is: We will assume little-endian, we will assume that we never need to handle non-BMP-characters, we will assume that 640K is enough for everyone, these should be deliberate decisions, not made by defaulting.
When Bill Gates was confronted with the 640k-quote, he didn't positively confirm it, but certainly didn't deny it: He might very well have made that remark in the discussion of how to split the available 1 Mbyte among the OS and user processes. Given that 1 MB limit, giving 384 kB to the OS and 640 kB to application code should be a big enough share for the applications, otherwise the OS will be cramped in too little space. 640k is enough for everyone. - In such a context, where the reasoning is explained, the quote suddenly makes a lot more sense. Actually, it is quite reasonable!
That is how I like it. Knowing why you make the decisions you do, when there is a decision to make. Part of this is includes awareness of when there is a decision to make - do not ignore that you actually do have a choice between your default alternative and something else.
|
|
|
|
|
Member 7989122 wrote: We should spend some of our efforts on learning why these othere alternatives were developed at all. 8 bit is not developed as an alternative. ASCII is not an alternative for UTF16.
Member 7989122 wrote: If you cannot, then I get the impression that you have not really considered the alternatives, just ignored them. And that is what worries me. What worries me is that you see improvements of the wheel (with a documented history) as alternatives for more modern standards.
Member 7989122 wrote: or assume that you will exchange files only with other .net-based applications? No, you don't assume; you define an exchange-protocol in specific text-encoding. Should be part of the specs.
Member 7989122 wrote: the quote suddenly makes a lot more sense. The quote that's not his, you mean?
Member 7989122 wrote: That is how I like it. Knowing why you make the decisions you do, when there is a decision to make. Aw, can't argue with that. I assume all your databases are in BCNF?
Bastard Programmer from Hell
If you can't read my code, try converting it here[^]
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
Member 7989122 wrote: Isn't the whole bunch of them wheel reinventions? No, they're refinements of said wheel.
Member 7989122 wrote: yet I think that what humans should not mess up, should not be made available for messing up Reading and writing aren't the same thing; making human validation impossible does not help with ensuring a correct write - after all, your application might have a bug and write the wrong stuff. The only thing that making it unreadable does, is prevent a human validation.
Bastard Programmer from Hell
If you can't read my code, try converting it here[^]
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
A binary format certainly does not mean that the information and its structure cannot be inspected at all! You do have a tool for inspecting e.g. a binary ASN.1/BER format that let you navigate in the structure, detect format errors (and the reader should support you in that!) etc.
As I mentioned in another post: I made an XML document example using tags in Nortern Sami, making no sense to the audience (nor to me - I got the Sami terms from a collague). Then, there is very little value in the "textual" format, when all you know is that "something" is nested within "something else". I also used an example with a "p" tag, where "p" represented a person p (in one part of the scheme), ordering a product p (in another part), and in the payment information, p indicated a paragrap in the text. Understanding the XML record properly suffers from the use of seemingly readable, but highly amibiguous tag names.
You may limit your application or data format to English format, just to ensure that you as an English speaker can make sense of it. But please state that explicitly as a limitation, then! "This data specification format should not be used in any non-English context". That could be valid for softare development tools used by IT professionals only, but certainly not in a general document context. Administration, business. Home use. Educational material... Be prepared for Chinese macro names. Russian XML tags. ÆØÅ in variable names. Dates in ISO format and 24 hour clock. Those are more or less absolute requirements as soon as you move your application out of the computer lab.
For multi-lingual applications, binary formats give a lot of flexibilty compared to text formats. Of course you can translate on-the-fly, but using a plain integer as an index into a language table is a lot easier than word-to word translation. And you may supply extra info in that language table, e.g. indicated plural forms, gender etc. giving a much better translation.
|
|
|
|
|
Member 7989122 wrote: Be prepared for Chinese macro names. Russian XML tags. ÆØÅ in variable names. We are, since we're no longer limited to ASCII.
Member 7989122 wrote: Dates in ISO format and 24 hour clock. Date-formats are another topic; you should save in ISO, but display nicely in the format that the user has set as his preference in Windows. That's not a suggestion, nor is there a discussion.
Member 7989122 wrote: For multi-lingual applications, binary formats give a lot of flexibilty compared to text formats. Ehr.. no. You could have ASCII in binary, with a completely useless date format.
Member 7989122 wrote: Of course you can translate on-the-fly, but using a plain integer as an index into a language table is a lot easier than word-to word translation. And you may supply extra info in that language table, e.g. indicated plural forms, gender etc. giving a much better translation. We use keys, not integers, and resource-files.
You started with a wheel, now you're also including a dashboard and breaks. I have no idea what you are trying to say
Bastard Programmer from Hell
If you can't read my code, try converting it here[^]
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
Eddy Vluggen wrote: We are, since we're no longer limited to ASCII. I was primarily thinking of readability and comprehension, not representation. If you are receiving a support request or error report, and all supporting documentation uses characters that make no sense to you, you may have great difficulties in interpreting the bug report or error request.
And: The alternative to UTF-16 (which is hardly used at all in files) is UTF-8, not ASCII. In the Windows world, you may still see some 8859-x (x given by the language version of the 16-bit Windows), but to see 7-bit ASCII, you must go to legacy *nix applications. Some old *nix-based software and old compilers may still be limited to ASCII - I have had .ini files that did not even allow 8859-1 in comments! But you must of course be prepared for 8859 when you read plain text files from an arbitrary source (and ASCII is the lower half of 8859).
you should save in ISO, but display nicely in the format that the user has set as his preference in Windows Then we are talking about not reading a text representation as as text file, but using an interpreter program to present the information. Just as you would do with a binary format file.
Ehr.. no. You could have ASCII in binary, with a completely useless date format. I am not getting this "ASCII in binary". Lots of *nix files with binary data use Unix epoch to store date and time. If your data is primarily intended for the Windows market, you might choose to store it as 100 ns ticks since 1601-01-01T00:00:00Z - then you can use standard Windows functions to present it in any format. Conversion to Unix epoch is one subtraction, one division. If you insist on ISO 8601 character format, you may store it in any encoding you want, all the way down to 5-bit baudot code
You started with a wheel, now you're also including a dashboard and breaks. Did you ever roll snowballs to make a snowman when you were a kid?
I have no idea what you are trying to say One major point is that binary data file formats, as opposed to a character representation, is underestimated; most programmers are stuck in the *nix style of representing all sorts of data in a character format, where a binary format would be more suitable. (The same goes for network protocols!) I am surprised that you haven't discovered that point.
|
|
|
|
|
Member 7989122 wrote: I was primarily thinking of readability and comprehension, not representation. Readability can't be without representation.
Member 7989122 wrote: If you are receiving a support request or error report, and all supporting documentation uses characters that make no sense to you, you may have great difficulties in interpreting the bug report or error request. No, I mail the provider of said and burn them for not documenting.
Member 7989122 wrote: And: The alternative to UTF-16 (which is hardly used at all in files) is UTF-8, not ASCII. That's not an alternative. One is a more limited version of wheel then the other.
Member 7989122 wrote: But you must of course be prepared for 8859 No, in general I'm not; the specs specify what I should support, and outdated isn't supported.
Member 7989122 wrote: Then we are talking about not reading a text representation as as text file, but using an interpreter program to present the information. Just as you would do with a binary format file. Bin nor text need an interpreter.
Member 7989122 wrote: I am not getting this "ASCII in binary". Lots of *nix files with binary data use Unix epoch to store date and time. ASCII is a text-representation that is stored as bits. Unix epoch has nothing to do with any discussion of text-formats.
Member 7989122 wrote: Did you ever roll snowballs to make a snowman when you were a kid? No. What's the use of that?
Member 7989122 wrote: One major point is that binary data file formats, as opposed to a character representation, is underestimated A representation is not a format. They're all stored as bytes. Google for an ASCII-table, it shows what bytes are used for the character.
Member 7989122 wrote: I am surprised that you haven't discovered that point. I deduce you're not asking a question, but trying to make a point. Mixing text-encodings and date-encodings, trying to prove that not human readable binary is somehow superiour.
You fail to give a simple example to prove so, and your explanation isn't helping me.
Bastard Programmer from Hell
If you can't read my code, try converting it here[^]
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
While binary format described by you is interesting it's not what I asked about.
I'll try creating one in the future nevertheless.
|
|
|
|
|
XML is very verbose and JSON doesn't have extendable types.
|
|
|
|
|
XML existed before JSON.
And data interchange formats benefit from being verbose. Due to readability; it's not a binary format.
Come to the point please.
Bastard Programmer from Hell
If you can't read my code, try converting it here[^]
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
How does "XML existed before JSON" relate to either "XML is very verbose" or "JSON doesn't have extendable types"?
In which ways do "data interchange formats benefit from being verbose"?
Most users today do not read the raw data interchange format directly, as-is - they process it by software that e.g. highlights labels, closing tag etc, and allow collapsing of substrucures. When you pass it through software anyway, what impact on readability does the format of the input to this display processor have? With semantically identical information, but binary coded, as input to the display processor, why would the readabilty be better with a character encoding of the information rather than by a binary encoding?
|
|
|
|
|
Semantical bullshit, aka wordsmithing. I been on that train before.
You trying to do as if binary is the solution to formats; it's not. Anything, text or date, is stored as bits, and is thus in binary. ASCII is a representation of that, UTF is a better form of ASCII. Dates are stored as floats.
I don't care what university. You can either learn or be rediculed. And damn right I will, at every opportunity.
And yes, being "kind"
Bastard Programmer from Hell
If you can't read my code, try converting it here[^]
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
If you really want me to explain to you the difference between storing an integer, say, as a 32 bit binary number vs. storing it as a series of digit characters, bedayse "ASCII is bits, hence digital", then I give up. Sorry.
|
|
|
|
|
Member 7989122 wrote: If you really want me to explain to you the difference between storing an integer, say, as a 32 bit binary number vs. storing it as a series of digit characters I didn't say that; and not going to explain either. I've no need to, nor any desire.
Member 7989122 wrote: then I give up. Sorry.
Good timing. And please do.
Bastard Programmer from Hell
If you can't read my code, try converting it here[^]
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
They are not good enough so I won't use it.
|
|
|
|
|
They might not be efficient to you; but lots of us use them, both, where appropriate.
Try to explain why XML isn't good enough, and to how many floppy-discs you're limited to that you need that optimization.
Do elaborate, please.
Bastard Programmer from Hell
If you can't read my code, try converting it here[^]
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
You have several times in this thread more or less insisted on relating to (7-bit) ASCII and floppy disks. Noone else her care about either of those. If they are your frame of reference, then refer your experience to them. I don't care to. And I don't think it the effort to explain why not will be justified.
I am not (and I guess there are a few others agreeing) are not demanding of you that you critically assess you choice of data formats and other solutions. You may go on as you please, with the formats that pleases you, with or without any critical evaluation. You are welcome.
|
|
|
|
|
Not with or without critical evaluation, but an education.
One expects that a developer knows the different text-formats (and encodings, which is the same to you), data-formats, and date-formats. One who mixes those in a semantical bullshit argument gets called out.
So damn right I will. Either play your cards or fold.
Bastard Programmer from Hell
If you can't read my code, try converting it here[^]
"If you just follow the bacon Eddy, wherever it leads you, then you won't have to think about politics." -- Some Bell.
|
|
|
|
|
I don't mean that I won't use XML/JSON. I think they are not good enough so I still want to create my data notation. It's just me saying that this is off topic (I used stackexchange sites before) and I just don't want to discuss it any farther (as it doesn't bring anything to my first question).
|
|
|
|
|
What does that have to do with anything? I merely pointed out that there are two existing, well tried and widely supported systems for data interchange. You can use them or not as you choose.
|
|
|
|
|
Well, pointing XML/JSON was off topic as well.
|
|
|
|
|
nedzadarek wrote: I want to create data notation (like JSON is used). So your mention of JSON in your original question was off topic?
|
|
|
|
|