Discussion:
[Yaml-core] Next YAML: drop equality definition
Osamu TAKEUCHI
2016-03-04 13:50:13 UTC
Permalink
Hi there,

There were long arguments how we should treat equality of nodes in YAML.

https://sourceforge.net/p/yaml/mailman/search/?q=%22%5BYAML-core%5D+equality%22&mail_list=all&sort=posted_date%20desc

Currently, equality of nodes is used in two purposes. One is to
reject a mapping with duplicate keys in the YAML 1.2 spec.
The spec says a mapping with duplicate keys should be rejected
by a YAML parser. The other is for allowing a library to represent
some equal scalar nodes by an single identical node to save memory
consumption.

https://sourceforge.net/p/yaml/mailman/message/23572250/

It sounds straightforward at first but not in the reality.
The equality of nodes involves issues when anchor/alias and
implicit tag resolution are involved.

To solve the problems, Oren proposed the following.

https://sourceforge.net/p/yaml/mailman/message/24061658/
Do not specify YAML equality rules. Eliminate most of the discussion of
equality, canonical formats etc. and replace it by a stating that
implementations "may" reject mappings that have "equal" keys, according to
their own *implementation-specific* definition of equality. Constrain this
to say that nodes with equal tags and equal content are always equal and
hence "must" be rejected as duplicates. The problem here is that { 1: "int",
"1" : "string" } would work in Python and not in Javascript. Arguably,
anyone defining a cross-platform schema would be able to "easily" avoid such
issues (e.g., by requiring all keys of the mapping to have the same tag,
which is pretty trivial). But there's no longer a universal cross-platform
validity guarantee.
But the statement
nodes with equal tags and equal content are always equal and hence
"must" be rejected as duplicates
was still controversial.


I propose to completely stop defining equality in YAML spec and leave
the choice of accepting or rejecting a mapping with possibly duplicate
keys to the specific applications.



[[Point 1]] Equality of complex nodes with anchors and aliases

It is really difficult to strictly judge equality of complex nodes
with anchors and aliases in accordance with YAML 1.2 specification.

To strictly evaluate equality, a parser must implement a complex
graph comparison algorithm. For example, a parser must distinguish
two objects in the following.

%YAML 1.2
---
&A { *A }
---
&A { { *A } }
...

They are not equal to each other in YAML 1.2 specification
because the node graph topology is different with each other.

https://sourceforge.net/p/yaml/mailman/message/23572250/
https://sourceforge.net/p/yaml/mailman/message/23576035/

Similarly, the next document should be accepted,

- &A [ *A ]
- &B [ *A ]
- { *A : 1, *B : 2 }

while the next should be rejected unless some specific schema
is applied. (Implicit tag resolution allows a parser to give
different tags to *A and *B according to the path of the node
from the root.)

- &A [ *A ]
- &B [ *B ]
- { *A : 1, *B : 2 }

So, almost no library have implemented YAML1.2's equality strictly.
In other words, the equality definition of mapping and sequence
in YAML spec has been almost always neglected so far.

I myself implemented it long time ago but I don't think it was
beneficial to anyone because no one needs strict comparison
of complex graph topology for all mappings in YAML documents.

The underlining issue is that the equality definition of native
hash and arrays in different languages are all slightly different
from each other and also from that of the mapping and sequence in
YAML as discussed. If YAML specifies the equality strictly, the
data model of YAML will not fit any native data model in almost
all languages and libraries.

https://sourceforge.net/p/yaml/mailman/message/23591366/



[[Point 2]] Implicit tag resolution

Oren strongly pushed rejection of duplicate keys in YAML spec for
being able to create generic "schema-blind" YAML tools.
However, it is impossible, anyway.

https://sourceforge.net/p/yaml/mailman/message/24073088/
https://sourceforge.net/p/yaml/mailman/message/24075274/

In the next example, a schema-blind parser can never judge
if *A and *a are equal or not.

- !People
- &A { name: Mike }
- !Cats
- &a { name: Mike }
- !Favorites
*A: beaf steak
*a: canned tuna

In this case, I expect *A should be resolved to !Person while *a
should be !Cat. Note that it is allowed to resolve unspecified tags
from the path of the node from the root. Since the tags are different,
*A and *a are different regardless of their values. The mapping
should not be rejected but no schema-blind parser can judge it
properly.

The point is, equality of nodes should be determined by the schema,
not by the YAML spec.

Kirill gave a good example where a scalar node could be equal to
a collection node under a realistic schema.

https://sourceforge.net/p/yaml/mailman/message/24073575/

x: 1.0i
y: { re: 0.0, im: 1.0 }
z: { rho: 1.0, phi: 1.5707963267948966 }

The three nodes 'x', 'y', and 'z' are probably equal. You don't know
unless you know the schema. Another example is from his own application,
where nodes

db1: postgres://localhost:5432/mydb
db2: mydb
db3: { engine: postgres, host: localhost, port: 5432, database: mydb }
db4: { database: mydb }

will generate the same object.

Again, the equality should be determined by the schema, not by the
YAML spec.



[[Point 3]] A mapping with clearly duplicate keys.

According to my understanding, the next mappings definitely have
duplicated keys under any schema in YAML 1.2 because the spec do
not allow tag resolution from the order of key appearance in a
mapping.

- { a: true, a: false }
- { !SomeTag b: true, !SomeTag b: false }
- { [1,2]: true, [1,2]: false }
- { {a:1,[a,b]:2}: true, {[a,b]:2,a:1}: false }

However, implementing rejection feature is even tougher with arbitrary
schema in mind because it have to check that two complex nodes with
anchors and aliases with unspecific tags will never be different under
any schema in addition to the complex graph comparison.

I don't think any library will implement such equality evaluation
regardless what is written in the spec.



[[Point 4]] Simplicity.

If we just state
Implementations "may" reject mappings that have "equal" keys,
according to their own *implementation-specific* definition
of equality.
we can drop all description about equality from YAML spec.
It will make the spec simpler.



[[Point 5]] Value-based platform vs Identity-based platform

If YAML spec stops defining equality, some application might
generate a YAML document that is not easily manipulated
on a *value-based* platforms.

For example,

https://sourceforge.net/p/yaml/mailman/message/24075274/

USA:
Presidents: !Presidents
- &PR1
name: George Washington
- &PR2
name: John Adams
(snip)
- &PR41
name: George Bush
- &PR42
name: William Clinton
- &PR43
name: George Bush
- &PR44
name: Barack Obama
(snip)
Parties: !Parties
- &PA1
name: Republican
- &PA2
name: Democratic
(snip)
PresidentToParty:
(snip)
*PR41: *PA1
*PR42: *PA2
*PR43: *PA1
*PR44: *PA2
(snip)

*PR41 and *PR43 have exactly the same properties probably with
the same tag !President. So, YAML 1.2 parser will judge they
are equal to each other and reject the mapping at the end of
the document. But if the YAML parser do not reject it, the
document will be meaningful on an *identity-based* platform.

Oren pointed out that, on a value-based platform, this type of
document will not be manipulated easily.

In my opinion, it can still be manipulated on a value-based
platform if the YAML library adds some identity field to each
node internally. Otherwise, anyway, a YAML library in
a value-based platform will fail to treat valid YAML documents
like the next.

- &A [ *A ]
- &B [ *A ]
- { *A: 1, *B: 2 }



[[Point 6]] JSON do not forbid to have duplicate keys in a mapping



[[Conclusion]]

So, I propose to completely quit defining node equality in YAML spec.

It should just state
Implementations "may" reject mappings that have "equal" keys,
according to their own *implementation-specific* definition
of equality.
possibly with some warnings like

https://sourceforge.net/p/yaml/mailman/message/24075274/
Note that some languages have unique definitions for equality.
For example, !!int 1 and !!str "1" are equal mapping keys in
JavaScript.
[[Another Possible Solution]] Only defining equality of scalar nodes

Another possible solution is, to define the equality of scalar nodes
only. Without thinking of mappings and sequences, judging equality
is straightforward except for the case where an alias is used as
a mapping key which points a scalar node with implicit tag resolution.

This will not make the spec too complicated for reading and for
implementing if it is very much beneficial in some cases.


Thank you for reading the long email.

Best,
Osamu Takeuchi
Zenaan Harkness
2016-03-04 14:44:04 UTC
Permalink
Hi Osamu, excellent post with lots of links and all! Thank you. One
question below.
Post by Osamu TAKEUCHI
There were long arguments how we should treat equality of nodes in YAML.
https://sourceforge.net/p/yaml/mailman/search/?q=%22%5BYAML-core%5D+equality%22&mail_list=all&sort=posted_date%20desc
Currently, equality of nodes is used in two purposes. One is to
reject a mapping with duplicate keys in the YAML 1.2 spec.
The spec says a mapping with duplicate keys should be rejected
by a YAML parser. The other is for allowing a library to represent
some equal scalar nodes by an single identical node to save memory
consumption.
https://sourceforge.net/p/yaml/mailman/message/23572250/
It sounds straightforward at first but not in the reality.
The equality of nodes involves issues when anchor/alias and
implicit tag resolution are involved.
To solve the problems, Oren proposed the following.
https://sourceforge.net/p/yaml/mailman/message/24061658/
Do not specify YAML equality rules. Eliminate most of the discussion
of equality, canonical formats etc. and replace it by a stating that
implementations "may" reject mappings that have "equal" keys,
according to their own *implementation-specific* definition of equality.
Constrain this
to say that nodes with equal tags and equal content are always
equal and hence "must" be rejected as duplicates.
Could both of these, the "may reject" option, and the "constrain ...
and must reject" option, be part of the schema, or specified by
command line option/ side-channel specification of at the appropriate
layer?

"Implementation dependent" doesn't feel like enough control.

Thanks again,
Zenaan
Oren Ben-Kiki
2016-03-04 15:40:09 UTC
Permalink
Ah, key equality all over again...

I think the spec is actually OK here.

It reasonably asks for a canonical form for scalars - this is equivalent to
the application providing operator== which is a basic requirement for keys
and in general (but see below).

It allows but does does not require the implementation to do very limited
"early detection" of key equality. For example, two scalar keys of the same
mapping which have exactly the same content must resolve to "equal" keys,
regardless of the tag resolution rules and the application. So this *can*
be flagged early. But there's no such requirement for such early detection
in the spec. In practice, many implementations will enforce this due to
using simple hash tables to implement partial representations.

It allows a YAML processor to "intern" scalars (e.g. strings), or use
shared_ptr to them, etc. This may improve performance. But it does not
require such a mechanism, allowing for simpler implementations.

It forbids a YAML processor from "intern"-ing/sharing complex nodes
(sequences, mappings). Which seems like a good conservative approach.

I'm not averse to tweaking the wording on a YAML1.3/2.0/whatever. For
example, relaxing the equality detection requirement to happen only during
construction, so a complete representation may still include equal keys.
That is, we can apply a schema to assign tags to the nodes, but still not
have any clue what these tags _mean_. That seems like a cleaner cut-off
point between a complete representation and the native data. What the spec
calls now a complete representation would be a canonical representation -
something which is useful, but not required in many cases.

But that issue doesn't seem to be the one raised here; the problem seems to
be with the "early duplicates detection". Which, again, is explicitly not
required by the spec. So I'm not certain what the actual problem is with
the current rules. Is this a case of the perceived rules being different
from the actual rules in the spec?

Oren.
Osamu TAKEUCHI
2016-03-04 17:01:47 UTC
Permalink
Oren,
Post by Oren Ben-Kiki
I'm not averse to tweaking the wording on a YAML1.3/2.0/whatever.
For example, relaxing the equality detection requirement to happen
only during construction, so a complete representation may still
include equal keys. That is, we can apply a schema to assign tags
to the nodes, but still not have any clue what these tags _mean_.
That seems like a cleaner cut-off point between a complete
representation and the native data. What the spec calls now a
complete representation would be a canonical representation -
something which is useful, but not required in many cases.
I totally concur.

Let me confirm one point. Does this allow identity-based
comparison of mapping nodes with some specific tags?
Post by Oren Ben-Kiki
But that issue doesn't seem to be the one raised here;
the problem seems to be with the "early duplicates detection".
Which, again, is explicitly not required by the spec.
So I'm not certain what the actual problem is with the current
rules. Is this a case of the perceived rules being different
from the actual rules in the spec?
If you don't see any problem to relax the equality detection
as above, neither do I.

Thank you for your consideration!

Best,
Osamu Takeuchi


------------------------------------------------------------------------------
Oren Ben-Kiki
2016-03-04 17:04:57 UTC
Permalink
Post by Osamu TAKEUCHI
Let me confirm one point. Does this allow identity-based
comparison of mapping nodes with some specific tags?
Not sure what you mean. In YAML each mapping node has its own identity
which is different from all other nodes. The only way for two mapping nodes
to have the same identity is via an alias (anchor and reference).
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
But that issue doesn't seem to be the one raised here;
the problem seems to be with the "early duplicates detection".
Which, again, is explicitly not required by the spec.
So I'm not certain what the actual problem is with the current
rules. Is this a case of the perceived rules being different
from the actual rules in the spec?
If you don't see any problem to relax the equality detection
as above, neither do I.
There's no need to "relax" it, it is "relaxed" already.

Oren.
Osamu TAKEUCHI
2016-03-04 17:49:11 UTC
Permalink
Oren,
Post by Osamu TAKEUCHI
Let me confirm one point. Does this allow identity-based
comparison of mapping nodes with some specific tags?
Not sure what you mean. In YAML each mapping node has its own
identity which is different from all other nodes. The only way
for two mapping nodes to have the same identity is via an alias
(anchor and reference).
My question was: Can we evaluate mapping nodes' equality with
their identity instead of their property values?

This question arose from your old post:
https://sourceforge.net/p/yaml/mailman/message/23576035/
Post by Osamu TAKEUCHI
At first, I want to confirm that, a description
"two nodes are equal only when some condition is met" means
"two nodes are not equal when some condition is not met."
Yes, it is an if-and-only-if, at least as far as the spec is concerned.
According to this rule, the description in the current spec
does not allow identity-base comparison.
Post by Osamu TAKEUCHI
Two mappings are equal only when they have the same tag and
an equal set of keys, and each key in this set is associated
with equal values in both mappings.
It tells that the equality of mapping nodes must be done by
property values regardless of the equality operator of the
native object. It does not allow us to compare mapping nodes
by identity even if it is the native object's equality.

Best,
Osamu Takeuchi


------------------------------------------------------------------------------
Oren Ben-Kiki
2016-03-04 18:12:19 UTC
Permalink
Post by Osamu TAKEUCHI
My question was: Can we evaluate mapping nodes' equality with
their identity instead of their property values?
If they are identical, they are equal. If they are not identical, they
_may_ be equal.
Post by Osamu TAKEUCHI
https://sourceforge.net/p/yaml/mailman/message/23576035/
From this old post: "... it is Ok for a library to punt and pass the
burden to the native data type implementation (which most YAML libraries
do)...". The amount of effort a YAML processor puts into earlier detection
of duplicates is entirely up to the implementation. It is considered
well-behaved to make at least a cursory effort - in particular, complain
about character-for-character identical scalar keys.

Two mappings are equal only when they have the same tag and
Post by Osamu TAKEUCHI
Post by Osamu TAKEUCHI
an equal set of keys, and each key in this set is associated
with equal values in both mappings.
It tells that the equality of mapping nodes must be done by
property values regardless of the equality operator of the
native object. It does not allow us to compare mapping nodes
by identity even if it is the native object's equality.
Ok, I finally get what you are getting at (sorry for being slow). The
question is, what happens if operator== of the native data types is defined
to compare _only_ the identity of the objects. In this case, the YAML
processor would be wrong to do any form of early duplication detection.

Good point.

Note, this only applies to complex keys. The spec clearly says identity is
not preserved for scalars. So the YAML processor _would_ be allowed to do
early detection of _scalar_ keys. In fact it _should_ do so: if two scalar
keys have the same content and same tag, they are equal. This is pretty
simple and covers the cases we really care about (duplicate string keys).

As for complex keys... On the one hand, using these sort of data type as a
key - shudder. But, on the other hand, it is technically possible.... If we
want to be able to serialize "everything", then you may have a good reason
here that YAML processors must not compare complex keys for equality at
all. Ugh.

Nice catch... I wonder what Ingy and Clark think, but I think you may have
killed early detection of equality for complex keys for equality. That
should make implementations (and the spec) simpler.

Oren.
Osamu TAKEUCHI
2016-03-04 19:51:42 UTC
Permalink
Oren,

Thank you for your understanding of my poor description.
Post by Oren Ben-Kiki
Note, this only applies to complex keys.
The spec clearly says identity is not preserved for scalars.
The exact statement is
Post by Oren Ben-Kiki
A YAML processor may treat equal scalars as if they were
identical.
So, YAML processor may also preserve scalars' identity.
Is this true?

Then, this may apply to scalars, too.


Actually, As Kirill pointed out, a scalar node could be
equal to a collection node under a realistic schema as
below.

x: 1.0i
y: { re: 0.0, im: 1.0 }
z: { rho: 1.0, phi: 1.5707963267948966 }

Similarly, a class object that has its string representation
can also be stored as a scalar node, which may have its own
weird comparator with customized hash generator. It may be
an identity-based comparison. It may ignore some fields for
comparison. If YAML library ignores the native comparator
for such scalar nodes, users will be surprised.

This is from my experience implementing C# library.
Many of C# classes have so-called TypeConverter that converts
the native object to/from its string representation for
serializing purpose. In such cases, it is natural to store
such an object in a scalar node in a YAML document, with
an explicit/implicit tag to represent the data type.
So, a scalar node with an explicit/implicit tag may
no more be a simple scalar in the native object form.

It is like:

Greeting:
- "Hello world!"
- !System.Drawing.Point 100,100
- !System.Drawing.Font Times New Roman, 14pt

With such a use case in mind, distinguishing scalar nodes
from collection nodes in the preservation of identity is
not natural for me. I prefer to preserve scalars'
identity in my library and to apply native comparison
method to both of scalar node and collection node,
if the YAML spec allows.

I know it is very rare to have such objects as keys
of a mapping and the definition of equality causes
any problem. At the same time, I don't see any benefit
to define objects' equality in YAML spec in different
manner from the native one.

In my opinion, the definition of equality should belong
to the domain specific data type, not to the serialization
language.

Best,
Osamu Takeuchi


------------------------------------------------------------------------------
Oren Ben-Kiki
2016-03-05 05:23:44 UTC
Permalink
There is no requirement that early detection catch 100% of the duplicate
keys. There is a requirement that early detection will never incorrectly
flag different keys as being duplicate.

So: A YAML processor _need not_ preserve the identity of scalars. Therefore
it is allowed to (correctly) flag equal scalar keys. Because an application
_must not_ rely on scalar identity to compare scalar keys.

The fact that "1+2i" might be equal to { r: 1, i: 2 } just means that early
detection isn't perfect. Which is OK since the final definitive equality
check is done by the application anyway.

What we really want is to _allow_ YAML processors to flag { a: 1, a : 2 }
as a duplicate key, because this is the most common type of error. The
rules _allow_ early flagging of these errors, without waiting for the
application, and a well-behaved processor _should_ flag this as an error,
as early as the entry to the compose stage.

Oren.
Post by Osamu TAKEUCHI
Oren,
Thank you for your understanding of my poor description.
Note, this only applies to complex keys.
Post by Oren Ben-Kiki
The spec clearly says identity is not preserved for scalars.
The exact statement is
Post by Oren Ben-Kiki
A YAML processor may treat equal scalars as if they were
identical.
So, YAML processor may also preserve scalars' identity.
Is this true?
Then, this may apply to scalars, too.
Actually, As Kirill pointed out, a scalar node could be
equal to a collection node under a realistic schema as
below.
x: 1.0i
y: { re: 0.0, im: 1.0 }
z: { rho: 1.0, phi: 1.5707963267948966 }
Similarly, a class object that has its string representation
can also be stored as a scalar node, which may have its own
weird comparator with customized hash generator. It may be
an identity-based comparison. It may ignore some fields for
comparison. If YAML library ignores the native comparator
for such scalar nodes, users will be surprised.
This is from my experience implementing C# library.
Many of C# classes have so-called TypeConverter that converts
the native object to/from its string representation for
serializing purpose. In such cases, it is natural to store
such an object in a scalar node in a YAML document, with
an explicit/implicit tag to represent the data type.
So, a scalar node with an explicit/implicit tag may
no more be a simple scalar in the native object form.
- "Hello world!"
- !System.Drawing.Point 100,100
- !System.Drawing.Font Times New Roman, 14pt
With such a use case in mind, distinguishing scalar nodes
from collection nodes in the preservation of identity is
not natural for me. I prefer to preserve scalars'
identity in my library and to apply native comparison
method to both of scalar node and collection node,
if the YAML spec allows.
I know it is very rare to have such objects as keys
of a mapping and the definition of equality causes
any problem. At the same time, I don't see any benefit
to define objects' equality in YAML spec in different
manner from the native one.
In my opinion, the definition of equality should belong
to the domain specific data type, not to the serialization
language.
Best,
Osamu Takeuchi
Osamu TAKEUCHI
2016-03-06 17:00:45 UTC
Permalink
Oren,

I expect we share the same thinking that the
definition of equality belongs to the domain
specific data type, not to the serialization
language. So, unless it makes the serialized
documents much more readable or portable, a
serialization language should not determine
its own equality or identity definition.
Incoherence of the data semantics between
the native data and serialized document will
make users surprised and reduce the
readability.

I believe nobody is willing to define node
equality in JSON or XML specification to
improve the readability or portability.
Nobody compliments YAML on being more
readable and portable than JSON and XML
by giving its own definition of node equality
and providing non-preservation of identity
for scalar nodes.


In the reality, the equality defined in the
YAML spec is almost always neglected and the
equality evaluation is done by the native
equality evaluators in the existing YAML
libraries and applications. I believe the
situation will not change in the future
because people do not see large benefit to
compare nodes under YAML's standard instead
of the data's own standard.

I agree duplicate key should be detected by
YAML processors because we do not want users
to use duplicate keys for overwriting the
values of predefined keys. The key order in
a YAML mapping should not have meaning.
This part of the specification is valuable to
have better readability and portability of a
YAML document. If YAML libraries detect such
misusage and raise errors or warnings, such
misusage can be effectively avoided.

But it can be done without defining equality
in YAML spec. YAML processor can use native
equality evaluator of the data at its
construction stage and it should do so.
If the layered structure of the YAML processor
do not allow it, the layered structure itself
should be revised. I don't see how the layered
structure is related to the current topic,
though.


Similarly, I do not want to forbid PHP users
to store a PHP's native key-order-aware hash
into a key-order-unaware YAML mapping unless
the specific application really place
importance on the key order. If the key order
actually matters, they should store it in a
!!omap as the YAML spec advises. But otherwise,
they are allowed to store it in a !!map. Then,
the document will express the meaning of the
data more correctly. The data semantics belongs
to the data itself, not to the programing
language nor to the serialization language.


Meaningfulness of the data identity should
also belong to the specific data types. As
shown by the previous examples, the difference
in the semantics of a scalar node and that of
a complex node is not always clear. I imagine
that it is much unclearer than what was
expected in the current YAML spec. If we allow
equality evaluation by identity for collection
nodes, it is more coherent to allow it also for
scalar nodes. Actually, without declaring
possible non-preservation of identity for
scalars, nobody will think a data with an
identity-based equality evaluation must be
stored as a collection node and must not as a
scalar node. It brings some surprise to users.

Such restriction will improve YAML's readability
and portability very little if any. Actually, I
believe the restriction is currently not known
widely and very few libraries and applications
have ever utilized it. I don't think many
existing YAML document loose its meaning if we
drop the restriction. I don't see any kind of
problems will be caused in *realistic* use cases
by dropping. I do not expect it will be widely
utilized for building libraries, storing data
and understanding documents in the future,
regardless of what is written in the spec.


So, let's make the spec simpler by dropping the
definition of YAML's own equality and identity
preservation.

What I want to say in the spec is:
A well-behaved processor _should_ detect a
duplicate key and flag it as an error if it
can correctly evaluate equality of nodes.
It _must_ aware that a data with some specific
tags may have some custom comparison algorithms,
including the one based on the data identity.
Namely, two YAML nodes of same values and same
tags can be evaluated to be unequal by an
identity-based evaluator, while two YAML nodes
of different values and even different tags can
be evaluated to be equal by some specific
evaluators. Note that javascript do not
natively distinguish an integer 0x01 with a
sequence [1] as mapping keys.
var test = { 0x01: "Find me!" };
if(test[[1]] === "Find me!")
alert("Surprised?");
It is also warned that tags of nodes can be
implicitly specified by the path of the node
from the root. So, a schema-blind YAML processor
can never know how to resolve a tag for any
tag-unspecified node. A well-behaved YAML
processor _must_ be schema aware, which refers
to the schema to resolve the tag correctly
and to evaluate the equality correctly from the
resolved tag.

Best,
Osamu Takeuchi
There is no requirement that early detection
catch 100% of the duplicatekeys. There is a
requirement that early detection will never
incorrectlyflag different keys as being duplicate.
So: A YAML processor _need not_ preserve the
identity of scalars. Therefore it is allowed to
(correctly) flagequal scalar keys. Becausean
application _must not_ rely on scalar identity
to compare scalar keys.
The fact that "1+2i" might be equal to
{ r: 1, i: 2 } just means thatearly detection
isn't perfect. Which is OK since the final definitive
equality check is done by the application anyway.
What we really want is to _allow_ YAML processors
to flag { a: 1, a : 2 }as a duplicate key, because
this is the most common type of error. Therules
_allow_ early flagging of these errors, without
waiting for theapplication, and a well-behaved
processor _should_ flag this as an error,as early
as the entry to the compose stage.
Oren.
Oren,
Thank you for your understanding of my poor description.
Note, this only applies to complex keys.
The spec clearly says identity is not preserved for scalars.
The exact statement is
A YAML processor may treat equal scalars as if they were
identical.
So, YAML processor may also preserve scalars' identity.
Is this true?
Then, this may apply to scalars, too.
Actually, As Kirill pointed out, a scalar node could be
equal to a collection node under a realistic schema as
below.
x: 1.0i
y: { re: 0.0, im: 1.0 }
z: { rho: 1.0, phi: 1.5707963267948966 }
Similarly, a class object that has its string representation
can also be stored as a scalar node, which may have its own
weird comparator with customized hash generator. It may be
an identity-based comparison. It may ignore some fields for
comparison. If YAML library ignores the native comparator
for such scalar nodes, users will be surprised.
This is from my experience implementing C# library.
Many of C# classes have so-called TypeConverter that converts
the native object to/from its string representation for
serializing purpose. In such cases, it is natural to store
such an object in a scalar node in a YAML document, with
an explicit/implicit tag to represent the data type.
So, a scalar node with an explicit/implicit tag may
no more be a simple scalar in the native object form.
- "Hello world!"
- !System.Drawing.Point 100,100
- !System.Drawing.Font Times New Roman, 14pt
With such a use case in mind, distinguishing scalar nodes
from collection nodes in the preservation of identity is
not natural for me. I prefer to preserve scalars'
identity in my library and to apply native comparison
method to both of scalar node and collection node,
if the YAML spec allows.
I know it is very rare to have such objects as keys
of a mapping and the definition of equality causes
any problem. At the same time, I don't see any benefit
to define objects' equality in YAML spec in different
manner from the native one.
In my opinion, the definition of equality should belong
to the domain specific data type, not to the serialization
language.
Best,
Osamu Takeuchi
------------------------------------------------------------------------------
Oren Ben-Kiki
2016-03-06 18:27:56 UTC
Permalink
Post by Osamu TAKEUCHI
Oren,
I expect we share the same thinking that the
definition of equality belongs to the domain
specific data type, not to the serialization
language.
Pretty much.
Post by Osamu TAKEUCHI
So, unless it makes the serialized
documents much more readable or portable, a
serialization language should not determine
its own equality or identity definition.
That's a bit "unless".
Post by Osamu TAKEUCHI
I agree duplicate key should be detected by
YAML processors because we do not want users
to use duplicate keys for overwriting the
values of predefined keys.
Yes.
Post by Osamu TAKEUCHI
The key order in
a YAML mapping should not have meaning.
Very strong yes.
Post by Osamu TAKEUCHI
But it can be done without defining equality
in YAML spec. YAML processor can use native
equality evaluator of the data at its
construction stage and it should do so.
No. You assume there _is_ a construction stage. There need not be one.
Post by Osamu TAKEUCHI
If the layered structure of the YAML processor
do not allow it, the layered structure itself
should be revised. I don't see how the layered
structure is related to the current topic,
though.
It is crucial to the discussion. We all agree that key duplication
detection _must_ be done at the application layer, but the point is that a
_limited_ form of key duplication detection _may_ and _should_ be done,
especially in YAML processors that do not even _have_ an application layer.
This is because, as you put it, "it makes the serialized documents much
more readable or portable".
Post by Osamu TAKEUCHI
Similarly, I do not want to forbid PHP users
to store a PHP's native key-order-aware hash
into a key-order-unaware YAML mapping
The problem is, how can you tell whether this is/not safe to do? When
dumping such a hash table to YAML, the application needs to provide some
hint to the YAML processor whether this is actually safe. By default, it is
_not_ safe, so without an explicit hint, the YAML processor _should_ do the
safe thing and emit it as an !!omap.
Post by Osamu TAKEUCHI
Meaningfulness of the data identity should
also belong to the specific data types. As
shown by the previous examples, the difference
in the semantics of a scalar node and that of
a complex node is not always clear.
Looks pretty clear to me. Scalars are "values". They have _no_ identity,
they have _only_ content. Complex nodes have identity, and as you pointed
out, this means their actual content may be irrelevant (for comparison).
The current spec gets that last point wrong.

I think the core issue here is identity of scalars. You seem to assume that
a YAML processor _must_ preserve the identity of scalars. That is, it _must
not_, for example, use interned strings for keys. The current spec says the
opposite. A YAML processor _need not_ preserve scalar identity and it _may_
use interned strings and other similar tricks. It is definitely not
required to keep the identity of, say, integer scalars!

... without declaring
Post by Osamu TAKEUCHI
possible non-preservation of identity for
scalars, nobody will think a data with an
identity-based equality evaluation must be
stored as a collection node and must not as a
scalar node. It brings some surprise to users.
Really? People would be very surprised to hear that { a: 1, a: 2 } is
actually OK because some application somewhere _may_ decide it wants scalar
string keys to use "identity-based equality".

So *No*.

For increased readability and portability, the above _must_ be allowed to
be flagged as a duplicate by a YAML processor regardless of what the
application is. And _should_ be flagged so by "well behaved" processors.
Even if they do _not_ have an application layer.
Post by Osamu TAKEUCHI
Such restriction will improve YAML's readability
and portability very little if any.
We'll have to agree to disagree, I'm afraid...
Post by Osamu TAKEUCHI
Actually, I
believe the restriction is currently not known
widely and very few libraries and applications
have ever utilized it.


A pity.
Post by Osamu TAKEUCHI
I don't think many
existing YAML document loose its meaning if we
drop the restriction.
No valid YAML documents will, that's for sure ;-) But that's besides the
point.
Post by Osamu TAKEUCHI
So, let's make the spec simpler by dropping the
definition of YAML's own equality and identity
preservation.
There's no such thing as not addressing the issue of identity and equality
in the spec. Either you _require_ a YAML processor to preserve the identity
of scalars (including, horribly, simple integers), or you do not. Either
way it needs to be stated in the spec.

We chose to say a YAML processer _need not_ preserve the identity of
scalars. Given this, then an application _must not_ use scalar identity for
equality comparisons. Given this, then _regardless_ of the application's
definition of equality, we can predict with 100% certainty that { a: 1, a:
2 } contains a duplicate key.
Post by Osamu TAKEUCHI
A well-behaved processor _should_ detect a
duplicate key and flag it as an error if it
can correctly evaluate equality of nodes.
So far so good.
Post by Osamu TAKEUCHI
It _must_ aware that a data with some specific
tags may have some custom comparison algorithms,
including the one based on the data identity.
Yes, the current spec gets the identity point wrong.
Post by Osamu TAKEUCHI
Namely, two YAML nodes of same values and same
tags can be evaluated to be unequal by an
identity-based evaluator,
_Only_ if these are complex nodes.
Post by Osamu TAKEUCHI
while two YAML nodes
of different values and even different tags can
be evaluated to be equal by some specific
evaluators. Note that javascript do not
natively distinguish an integer 0x01 with a
sequence [1] as mapping keys.
You keep conflating false positives with false negatives. False negatives
are _fine_. It is OK for the processor to miss some cases of key
duplication. In fact it is expected. The application is the final arbiter
of key equality. You can keep on piling as many examples of "the processor
can't detect keys in <some example> as duplicated" as you want. OF COURSE
there are such cases.

But this does not mean in any way shape or form that we allow false
positives. A YAML processor must not ever complain about key duplication
when such a duplication does not exist. Now, in JavaScript, PHP, Perl,
Ruby, Python, C++, and any other valid YAML system, { a: 1, a: 2 } _does_
have a duplicate key. So a processor _is_ allowed and _should_ complain
about this, _regardless_ of the application-defined equality operator.

It is also warned that tags of nodes can be
Post by Osamu TAKEUCHI
implicitly specified by the path of the node
from the root. So, a schema-blind YAML processor
can never know how to resolve a tag for any
tag-unspecified node.
The path to both the "a" keys in { a: 1, a: 2 } is, by definition, the same
(the path to all keys in the same mapping is, by definition, identical). So
whatever tag is assigned to one of them, by definition, the same tag must
be assigned to the other as well. The application _can't_ use different
tags to distinguish between them. It _can't_ use their identity to
distinguish between them because the YAML processor need not give them
different identities. It _can't_ use their content to distinguish between
them because they have the same content. So, the application _must_
consider them equal - there's just no other possible choice.

So we _can_ complain about them being equal at an earlier processing stage.
We do not _require_ a YAML processor to do so, but we _allow_ and
_encourage_ it to do so.
Post by Osamu TAKEUCHI
A well-behaved YAML
processor _must_ be schema aware,
Now this is just plain wrong. YamlReference is a YAML processor. It
implements the parsing stage. It has no clue whatsoever what schema is
used. Schema-blind YAML processing is, for me, an important use case.

And if every possible schema in the universe _must_ decree that two keys
are equal, then we don't need to know the _specific_ schema, because
whatever it is, it will also _have_ to declare them equal.

Oren.
Osamu TAKEUCHI
2016-03-06 22:14:37 UTC
Permalink
Oren,

Thank you for your suggestive reply.
You seem to assume that a YAML processor _must_ preserve
the identity of scalars.That is, it _must not_, for example,
use interned strings for keys.
Yes. But IMO, even with the current spec, a YAML processor
can not use a simple interned string to represent a
tag-unresolved scalar node. Note that a YAML processor
must internally represent the scalar node by an instance of
some class that at least contains the information that the
node appeared without an explicit tag. Otherwise, the
processor can not distinguish it from node with an explicit
!!str tags. It is necessary for the tag resolution at the
later stage. Once the processor somehow resolves the tag to
!!str, it can safely use interned string for the scalar
because it knows how to evaluate equality of !!str.
It is crucial to the discussion. We all agree that key
duplication detection _must_ be done at the application
layer, but the point is that a _limited_ form of key
duplication detection _may_ and _should_ be done,
especially in YAML processors that do not even _have_
an application layer. This is because, as you put it,
"it makes the serialized documents much more readable
or portable".
I will not say "must." Instead, I say "should be done as
far as it is possible." I know a processor without
construction stage can do very little in this sense when
my proposal is accepted. But I do not take it so seriously.
The anyway-imperfect key-duplication detection is just for
some educational purpose by nature. It is not _directly_
beneficial to users as opposed to my proposal.
... without declaring
possible non-preservation of identity for
scalars, nobody will think a data with an
identity-based equality evaluation must be
stored as a collection node and must not as a
scalar node. It brings some surprise to users.
Really? People would be very surprised to hear that
{ a: 1, a: 2 } is actually OK because some application
somewhere _may_ decide it wants scalar string keys to
use "identity-based equality".
I missed to state the point clearly.
See the next example:

? !element '<a href="index.html">Top</a>'
: insert it before #the-box in red
? !element '<a href="index.html">Top</a>'
: insert it after #the-box in blue

I think the document above is readable enough unless
a schema-blind YAML processor tries to judge the data
equality and identity for unknown data type. At least
for me, it is as readable as the next.

? !element {from_html: '<a href="index.html">Top</a>'}
: insert it before #the-box in red
? !element {from_html: '<a href="index.html">Top</a>'}
: insert it after #the-box in blue

As is expected, the constructed object will not have
a from_html property. The fragment of html is given to
the builder method of !element to construct an DOM
element.

Then, will it be so bad to allow construction of a
class object from a scalar node?

It is not strange for me to allow the former because
we allow the latter.

There are many use cases similar to the above. If a
class has a builder function that accepts a string,
a number or whatever that can be stored in a YAML's
scalar node, it is natural to represent the class
object by a scalar value in a YAML document with
appropriate tag on it either explicitly or implicitly.
So, let's make the spec simpler by dropping the
definition of YAML's own equality and identity
preservation.
There's no such thing as not addressing the issue of
identity and equality in the spec. Either you _require_
a YAML processor to preserve the identity of scalars
(including, horribly, simple integers), or you do not.
Either way it needs to be stated in the spec.
A schema-blind YAML processor can not judge if a
tag-unspecified scalar node indeed expresses a
simple integer until tag resolution is completed.
It is allowed to resolve

{ 1: 2, 1: 3 }

as

{ !not_integer 1 : 2, !not_integer 1 : 3 }

Although the current spec do not allow (!not_integer 1)
to be different from (!not_integer 1), I do not see
strong reason to forbid such an evaluation, with the
above example in my mind.
while two YAML nodes
of different values and even different tags can
be evaluated to be equal by some specific
evaluators. Note that javascript do not
natively distinguish an integer 0x01 with a
sequence [1] as mapping keys.
You keep conflating false positives with false negatives.
False negatives are _fine_. It is OK for the processor
to miss some cases of key duplication. In fact it is
expected. The application is the final arbiter of key
equality. You can keep on piling as many examples of
"the processor can't detect keys in <some example> as
duplicated" as you want. OF COURSE there are such cases.
This part was just a generic reminder for readers that
the equality evaluation is strongly dependent on the
specific applications. So, it is not theoretically
needed but hopefully will help readers' understanding.
Now, in JavaScript, PHP, Perl, Ruby, Python, C++, and
any other valid YAML system, { a: 1, a: 2 } _does_ have
a duplicate key. So a processor _is_ allowed and _should_
complain about this, _regardless_ of the application-defined
equality operator.
Again, as you explained in detail, a schema-blind YAML
system can not determine that the keys are of !!str,
even in this simple case. It might be resolved as

{ !not_string a : 1, !not_string a : 2}.

So, the equality is not defined by the programing languages
but only defined by the YAML spec.
A well-behaved YAML
processor _must_ be schema aware,
Now this is just plain wrong. YamlReference is a
YAML processor. It implements the parsing stage.
It has no clue whatsoever what schema is used.
Schema-blind YAML processing is, for me, an
important use case.
I should have written this as

A well-behaved YAML processor that tries to detect key
duplication _must_ be schema aware.

Then, YamlReference is ok because it do not detect key
duplication at all, although it should do so if it is a
well-behaved processor. ;)


BTW, ambiguity of data semantics such as the one between
{ a: 1 } and { !not_string a: 1 } is only for the
schema-blind processors. For the readers of the document
and for the real tools to manipulate it, the explicit or
implicit schema makes the ambiguity clear.

So, even if

!element '<a href="index.html">Top</a>'

constructs a class object, it will not surprise a user
who is really interested in the document.

Best,
Osamu Takeuchi


------------------------------------------------------------------------------
Osamu TAKEUCHI
2016-03-07 13:24:58 UTC
Permalink
Oren,
The answer is that we have learned from XML that these
lead to madness and _must_ be avoided.
After all, we went back to the start point.
But now, I see you really think it makes YAML better than XML.

I understand you want to make all valid YAML document to be
able to be processed by all YAML libraries on all platforms
with the proper schema given. To do so, the use of YAML nodes
must be restricted in some degree. I understood your dream
for some degree.


If we go that way, as you wrote, we must forbid the use
of collection nodes as mapping keys when they are only
different in identity.


I still do not see why each tag must apply to exactly one
kind. It forbid the use case of !complex in the previous
example. Does it provide any more portability?


Javascript library will not handle a valid YAML mapping
correctly if it maps !!map to the javascript's native object.
Then, the library _must_ provide its own data structure that
can distinguish !!str "1" from !!int 1 because YAML specification
allow users to assume that a !!map node distinguishes them.


I still think the portability is not very much reduced by
relaxing equality definition in the spec but I need some time
to make up the argument. I may come back to this point, later.

Best,
Osamu Takeuchi
Osamu TAKEUCHI
2016-03-08 04:25:55 UTC
Permalink
Oren,
Post by Osamu TAKEUCHI
I missed to state the point clearly.
? !element '<a href="index.html">Top</a>'
: insert it before #the-box in red
? !element '<a href="index.html">Top</a>'
: insert it after #the-box in blue
If we allow this thing then we need to require all processors
to keep track of it and provide some API to distinguish between
these two entries. We explicitly and intentionally do _not_ r
equire YAML processors to do this. Among other reasons, this gives
implementation greater freedom to use simpler APIs and more efficient
internal data structures, such as using actual native hash tables -
even for the intermediate partial representations.
Therefore, this document is not safe. Passing it through some tools
will cause data to disappear, or processing to fail. So it is _not_
valid YAML.
I still think they should.

I give another example.
How do you read the next document?

- !div
- !textNode "Hello! "
- !textNode "Hello! "

I expect two text nodes are inserted in a div.
Post by Osamu TAKEUCHI
var d = document.createElement('div');
var t1 = document.createTextNode('Hello! ');
var t2 = document.createTextNode('Hello! ');
d.appendChild(t1);
d.appendChild(t2);
document.body.appendChild(d);
Current spec allow YAML processor to generate only one !textNode
from the file and insert it twice into the div. Namely, the spec
allows a processor to treat the above file as if the next file is
given.

- !div
- &T !textNode "Hello! "
- *T
Post by Osamu TAKEUCHI
var d = document.createElement('div');
var t = document.createTextNode('Hello! ');
d.appendChild(t);
d.appendChild(t);
document.body.appendChild(d);
On the contrary the next works fine under the current spec
because the identity of collection nodes are always preserved.

- !div
- !textNode text: "Hello! "
- !textNode text: "Hello! "

I believe this surprises users.

YAML processors should only be allowed to discard the identities
of scalars _with standard tags_ but not those _with custom tags_.
How the equality and identity are evaluated for a specific data
must be determined by tags.

Actually, I don't see how a YAML library can _unintentionally_
discard identities of scalar nodes with custom tags. Could you
show an example?

Osamu Takeuchi
Osamu TAKEUCHI
2016-03-08 11:07:48 UTC
Permalink
Oops,
So, we should try to solve all unsafeties of such use
cases.
should have been
So, we should _not_ try to solve all unsafeties of such use
cases.
In addition, "!float" should have been "!!float".

Sorry for the mistypes.

Best
Osamu Takeuchi
Oren,
Post by Osamu TAKEUCHI
Actually, I don't see how a YAML library can _unintentionally_
discard identities of scalar nodes with custom tags. Could you
show an example?
I found an example by myself.
A YAML document,
- &A 3.141592653589793238462
- *A
is usually loaded as
[!float 3.141592653589793238462, !float 3.141592653589793238462]
in many YAML systems. Here, the identity of &A is lost.
The problem is, this document can mean
- &A !not_float 3.141592653589793238462
- *A
by an implicit tag resolution with some customized schema.
If a schema-blind tool loads the document and save as
[3.141592653589793238462, 3.141592653589793238462]
and !not_float requires identity preservation, the document
is broken.
I see what is unsafe to store some data that needs identity
preservation into a YAML scalar.
As far as we only have this example, the remaining issue
is how much we expect for schema-blind tools.
If a schema-blind tool load and save the document like
above, the user will not be happy anyway, even if the
node is indeed of !float. If the document is written
with anchors and aliases, the user will not want any
schema-blind third-party YAML tools to destroy the
identity anyway.
Another point is, if a schema-blind tool loads the value
as !float and save it as
- 3.14159265359
- 3.14159265359
the document is also broken. Note that the trailing digits
are rounded due to the limited precision of native float
type.
If a schema-blind tool loads and saves a YAML document,
there are so many unsafeties besides scalar's identity.
So, we should try to solve all unsafeties of such use
cases.
I appreciate your comments.
Osamu Takeuchi
Oren Ben-Kiki
2016-03-08 18:40:21 UTC
Permalink
I think you are missing the point of what is _allowed_ and what is
_required_.

Take your div example:

div:
- !textDomElement foo
- !textDomElement foo

As opposed to:

div:
- &A !textDomElement foo
- *A

The YAML processor is free in level 2 to use the same abstract graph node
in both cases.

The application is free to construct two different-identity DOM elements -
again, in both cases.

There is no conflict.

That is, the application should not _rely_ on being given a different
abstract nodes graph in the two case above. This does _not_ force it to use
the identical DOM element.

You are taking a restriction of level 2 and projecting it to layer 3 - like
I said, most of this thread is just applying a rule of one layer to another
layer where it does not apply.

Hope this help,

Oren.
Post by Osamu TAKEUCHI
Oops,
So, we should try to solve all unsafeties of such use
cases.
should have been
So, we should _not_ try to solve all unsafeties of such use
cases.
In addition, "!float" should have been "!!float".
Sorry for the mistypes.
Best
Osamu Takeuchi
Oren,
Post by Osamu TAKEUCHI
Actually, I don't see how a YAML library can _unintentionally_
discard identities of scalar nodes with custom tags. Could you
show an example?
I found an example by myself.
A YAML document,
- &A 3.141592653589793238462
- *A
is usually loaded as
[!float 3.141592653589793238462, !float 3.141592653589793238462]
in many YAML systems. Here, the identity of &A is lost.
The problem is, this document can mean
- &A !not_float 3.141592653589793238462
- *A
by an implicit tag resolution with some customized schema.
If a schema-blind tool loads the document and save as
[3.141592653589793238462, 3.141592653589793238462]
and !not_float requires identity preservation, the document
is broken.
I see what is unsafe to store some data that needs identity
preservation into a YAML scalar.
As far as we only have this example, the remaining issue
is how much we expect for schema-blind tools.
If a schema-blind tool load and save the document like
above, the user will not be happy anyway, even if the
node is indeed of !float. If the document is written
with anchors and aliases, the user will not want any
schema-blind third-party YAML tools to destroy the
identity anyway.
Another point is, if a schema-blind tool loads the value
as !float and save it as
- 3.14159265359
- 3.14159265359
the document is also broken. Note that the trailing digits
are rounded due to the limited precision of native float
type.
If a schema-blind tool loads and saves a YAML document,
there are so many unsafeties besides scalar's identity.
So, we should try to solve all unsafeties of such use
cases.
I appreciate your comments.
Osamu Takeuchi
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Yaml-core mailing list
https://lists.sourceforge.net/lists/listinfo/yaml-core
Osamu TAKEUCHI
2016-03-09 05:55:39 UTC
Permalink
Post by Oren Ben-Kiki
That is, the application should not _rely_ on being given a different
abstract nodes graph in the two case above. This does _not_ force it
to use the identical DOM element.
I know the layer issues. I know the current spec do not allow the
application to rely on it. My concern is what the restriction is for.
Post by Oren Ben-Kiki
You are taking a restriction of level 2 and projecting it to layer 3 -
like I said, most of this thread is just applying a rule of one layer
to another layer where it does not apply.
I am not taking the restriction of level 2 and projecting it to layer 3.
I'm talking the possibility of changing restrictions in the future.


I want to know the reason why non-preservation of scalar identity is
allowed in the spec. As I wrote, I do not see any practical use cases
in which applications or libraries can make use of the part of the spec
for some valuable purpose. You said it allows a schema-blind tool to
load tag-unresolved scalars into native scalar variables and put them
into native collection variables. Will it be really allowed? AFAIU,
such a tool can never write back the data tree into a YAML document
safely, as I wrote in the previous post. Then, what can be the task of
the tools? Do we really have to put much importance on such use cases?
These are my questions. Note that I of course allow libraries to discard
identities of real scalars such as !!str and !!int
_after complete tag resolution_.

Thank you for your patience.
Osamu Takeuchi
Post by Oren Ben-Kiki
I think you are missing the point of what is _allowed_ and what is _required_.
- !textDomElement foo
- !textDomElement foo
- &A !textDomElement foo
- *A
The YAML processor is free in level 2 to use the same abstract graph node in both cases.
The application is free to construct two different-identity DOM elements - again, in both cases.
There is no conflict.
That is, the application should not _rely_ on being given a different abstract nodes graph in the two case above. This does _not_ force it to use the identical DOM element.
You are taking a restriction of level 2 and projecting it to layer 3 - like I said, most of this thread is just applying a rule of one layer to another layer where it does not apply.
Hope this help,
Oren.
Oops,
So, we should try to solve all unsafeties of such use
cases.
should have been
So, we should _not_ try to solve all unsafeties of such use
cases.
In addition, "!float" should have been "!!float".
Sorry for the mistypes.
Best
Osamu Takeuchi
Oren,
Post by Osamu TAKEUCHI
Actually, I don't see how a YAML library can _unintentionally_
discard identities of scalar nodes with custom tags. Could you
show an example?
I found an example by myself.
A YAML document,
- &A 3.141592653589793238462
- *A
is usually loaded as
[!float 3.141592653589793238462, !float 3.141592653589793238462]
in many YAML systems. Here, the identity of &A is lost.
The problem is, this document can mean
- &A !not_float 3.141592653589793238462
- *A
by an implicit tag resolution with some customized schema.
If a schema-blind tool loads the document and save as
[3.141592653589793238462, 3.141592653589793238462]
and !not_float requires identity preservation, the document
is broken.
I see what is unsafe to store some data that needs identity
preservation into a YAML scalar.
As far as we only have this example, the remaining issue
is how much we expect for schema-blind tools.
If a schema-blind tool load and save the document like
above, the user will not be happy anyway, even if the
node is indeed of !float. If the document is written
with anchors and aliases, the user will not want any
schema-blind third-party YAML tools to destroy the
identity anyway.
Another point is, if a schema-blind tool loads the value
as !float and save it as
- 3.14159265359
- 3.14159265359
the document is also broken. Note that the trailing digits
are rounded due to the limited precision of native float
type.
If a schema-blind tool loads and saves a YAML document,
there are so many unsafeties besides scalar's identity.
So, we should try to solve all unsafeties of such use
cases.
I appreciate your comments.
Osamu Takeuchi
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Yaml-core mailing list
https://lists.sourceforge.net/lists/listinfo/yaml-core
Oren Ben-Kiki
2016-03-09 07:41:50 UTC
Permalink
Post by Osamu TAKEUCHI
I want to know the reason why non-preservation of scalar identity is
allowed in the spec.
The design goals for YAML are, in decreasing priority:

1. YAML is easily readable by humans.
Having scalars-with-identity is not what humans expect. Humans _tend_ to
perceive scalars as "just values" and collections as "something with
identity".

2. YAML data is portable between programming languages.

Different applications can apply a different schema to the same document.
If these schemas differ on assigning identity then you cause all sort of
sticky issues.

3. YAML matches the native data structures
<http://www.yaml.org/spec/1.2/spec.html#native data structure//> of agile
languages.

Which tend to make scalars immutable. Even Ruby is starting to see the
error of its ways here so it is moving to make at least some strings be
immutable (literals, "frozen", etc.).

4. YAML has a consistent model to support generic tools.

I guess the question is, how strong is that consistent model? What does it
_allow_ generic tools to do? The rules about scalar identity give generic
tools the ability to do more than they could do if scalar identity had to
be preserved. Note that a human with a text editor can also be a generic,
schema-blind tool.

5. YAML supports one-pass processing.

That's irrelevant to the topic at hand. I think.

6. YAML is expressive and extensible.

You could argue that not preserving scalar identity requires a more
cumbersome expression of some native data (e.g. wrapping a scalar in a
collection "just because" you want to ensure its identity is preserved -
similarly to having to use an !!omap instead of the cleaner map syntax for
PHP dictionaries). "Everything" still _can_ be expressed, though.

7. YAML is easy to implement and use.

I don't see it applies here. At any rate, being the last goal, it tends to
lose out - which is the main issue people have with YAML.

So... it seems to be a reasonable decision given our goals.

One of the reason we explicitly listed the goals, _in order_, was to break
ties when different goals pushed us in different directions. Order the
goals differently, and you'll get a different spec. I think you would end
up with JSON if you ordered them in a different way. Or even, god help us,
XML ;-)

Oren.
Osamu TAKEUCHI
2016-03-10 07:50:32 UTC
Permalink
Thanks, Oren,
Post by Osamu TAKEUCHI
I want to know the reason why non-preservation of
scalar identity is allowed in the spec.
1. YAML is easily readable by humans.
Having scalars-with-identity is not what humans expect.
Humans _tend_ to perceive scalars as "just values" and
collections as"something with identity".
I see your point and I agree with it for some degree.

But, at the same time, I also expect implicit/explicit
schema should help people largely to read a document.
Namely, if some scalar nodes really need identities,
the meaning of those nodes are almost always obvious
from the content and purpose of the document, at least
for its users. In contrast, it may not be obvious for
machines.

Another point is about the purpose of each document.
If a document is with full of implicit custom tags,
it will not be readable or editable to everybody and
also not meant to be. Probably it is only expected
that it is only written and read by a specific
application. But, it is still much more readable in
YAML than in XML. I don't think people need hyper
readability on a YAML document with many custom tags.

On the other hand, if we remove all the tag feature
from YAML, any YAML document will be easy to read
and edit for everyonebut we do not want to do it.

We have to optimize the balance between readability
and functionality, and maybe simplicity of the spec.
Post by Osamu TAKEUCHI
2. YAML data is portable between programming languages.
Different applications can apply a different schema to
the same document.If these schemas differ on assigning
identity then you cause all sort ofsticky issues.
What do you mean by "can"? Could you give us any example
where different applications _can_ apply a different
schema to the same document? It sounds really strange
for me.

IMO, any document can not stand without a valid schema.
A schema-blind application must not apply any
possibly-wrong schema to any document. Namely, it can
merely build partial representation graph and can not do
any more. If it go any further, it will very easily break
documents even without the identity issue, as partially
shown by my previous example with !!float tags.
Post by Osamu TAKEUCHI
3. YAML matches the native data structures
<http://www.yaml.org/spec/1.2/spec.html#native data structure//>
ofagile languages.
I agree that YAML nodes with standard tags match the
native data structures of agile languages pretty well.
But what about nodes with custom tags? How do they
matches the native data structures and how we can make
of the matching? Coud you show us some example use cases?
Post by Osamu TAKEUCHI
Which tend to make scalars immutable. Even Ruby is
starting to see the error of its ways here so it is
moving to make at least some strings be immutable
(literals, "frozen", etc.).
I repeat, I do not want to keep identities of !!str
nodes. If you are confident that the data indeed do not
need identity preservation, you can discard identities.

My question is what we will gain by being allowed to
discard identities of nodes _with possibly unknown tags_.
Post by Osamu TAKEUCHI
4. YAML has a consistent model to support generic tools.
that you
I guess the question is, how strong is that consistent
model?What does it _allow_ generic tools to do? The
rules aboutscalaridentity give generic tools the
ability to do morethan they coulddo if scalar identity
had to be preserved.Note that a human with a text
editor can also be a generic,schema-blind tool.
Ok, this seems a good example. So, what is additionally
allowed to the guy with a text editor by being allowed not
to preserve scalar identity? I don't see much more than
aliasing and unaliasing the seemingly-equal nodes. Do you
see any more? If they are the only things it allows, I
don't see much importance on it.

In addition, I don't want any guy who is unaware of the
schema to edit my YAML documents. When people want to read
or edit YAML documents, they are almost always aware of
the proper schema of the target document. When and why do
you read and edit a YAML document without the knowledge of
its schema?
Post by Osamu TAKEUCHI
5. YAML supports one-pass processing.
That's irrelevant to the topic at hand. I think.
I agree.
Post by Osamu TAKEUCHI
6. YAML is expressive and extensible.
You could argue that not preserving scalar identity
requiresa morecumbersome expression of some native
data (e.g. wrappinga scalarin a collection "just
because" you want to ensure itsidentity ispreserved -
similarly to having to use an !!omapinstead of the
cleaner map syntax for PHP dictionaries).
"Everything" still _can_be expressed, though.
If the benefit is larger than the labor, I will accept
the restriction. For !!omap issue, I am about neutral.
I see it can prevent people unintentionally breaking a
YAML document by swapping the key order. I evaluate this
happens much more easily than someone aliases or unaliases
unexpected nodes.

On the other hand, if we see much more useful use cases
by allowing users to store key-order-aware hash or
key-duplication-unaware objects in mapping nodes, I
would be convinced to discard that restriction as Zanaan
is trying to.
Post by Osamu TAKEUCHI
7. YAML is easy to implement and use.
I don't see it applies here. At any rate, being the
lastgoal,it tends to lose out - which is the main
issue peoplehave with YAML.
I agree.
Post by Osamu TAKEUCHI
So... it seems to be a reasonable decision given our
goals.
One of the reason we explicitly listed the goals,
_in order_,was to break ties when different goals
pushed us in differentdirections. Order the goals
differently, and you'll get adifferent spec. I think
you would end up with JSON if youorderedthem in a
different way. Or even, god help us, XML ;-)
I strongly agree with this statement and seemingly
the order itself is nice, too. But as I commented on
the first item, the order can not be super strict.
We always have to optimize the balance of conflicting
ways to the goals.

Another point is, I feel that most of the YAML documents
in your mind are those that can be stored in JSON but
those in my mind are not always. YAML documents with
full of custom tags are not easily stored in JSON.
If we give up YAML, the next choice will indeed be XML.
Actually, my library is build to replace XML serializer
that is provided by C#. Since we want YAML to be able
to serialize both kinds of documents, we have to think
the best balance.

Best,
Osamu Takeuchi
Zenaan Harkness
2016-03-10 13:01:07 UTC
Permalink
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
2. YAML data is portable between programming languages.
Different applications can apply a different schema to
the same document.If these schemas differ on assigning
identity then you cause all sort ofsticky issues.
What do you mean by "can"? Could you give us any example
where different applications _can_ apply a different
schema to the same document?
Such as application/ layer 4 "application" with application specific
schema, and schema-blind middleware, and in this case, "If these
schemas differ on assigning identity then you cause all sort of sticky
issues."
Post by Osamu TAKEUCHI
It sounds really strange for me.
Exactly because it is likely to cause problem.

Either
- use default YAML schema (previous emails) and then a "schema blind"
"YAML middleware" processor might make sense, but who knows - this is
up to developers of YAML middleware and or 'end user' applications,
- or use application specific schema, and you are very unlikely to be
able to mix in "schema blind" YAML middleware.

I think we all agree on these basics.
Post by Osamu TAKEUCHI
IMO, any document can not stand without a valid schema.
If we say "schema less document" is actually YAML document relying on
"default schema" (with scalars working as users expect them etc), then
I agree with you.

I like the "default schema" where "schema less" YAML documents work as
I expect them to. I use YAML mainly with human (me) generated YAML
files, and if the default schema were to change to require "scalars
have identity" then the first thing I would require of any YAML tool I
use, is that it support the "old style" schema.
Post by Osamu TAKEUCHI
A schema-blind application must not apply any
possibly-wrong schema to any document. Namely, it can
I think part of the problem is that we sort of lack a real world
example of a practical for regular use "schema blind" tool. I think
the examples of scalars needing identity have so far been a bit
contrived - if '!not_float 3.14159' must have separate identity / be
not equal to '!not_float 3.14159', that to my eyes looks like a
decidedly application specific schema. I am still struggling to
comprehend how that could be useful, even though I can certainly
accept that it's possible it might be useful in some situation to
someone.
Post by Osamu TAKEUCHI
merely build partial representation graph and can not do
any more. If it go any further, it will very easily break
documents even without the identity issue, as partially
shown by my previous example with !!float tags.
If an application requires such behaviour ("schema"), then it, of
course, could not rely upon the default schema. This sounds sensible,
yes?
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
3. YAML matches the native data structures
<http://www.yaml.org/spec/1.2/spec.html#native data structure//>
ofagile languages.
I agree that YAML nodes with standard tags match the
native data structures of agile languages pretty well.
But what about nodes with custom tags? How do they
matches the native data structures and how we can make
of the matching? Coud you show us some example use cases?
How is that relevant to the discussion?

Some languages will have layer 3/4 YAML constructors to support
"arbitary" custom tags/ schemas - other languages might be a bit
limited some how, although if such languages exist, then I think that
language would be not very popular - we have relatively high "minimum
standards" these days... I don't think BrainFuck is going to be at the
top of the list for YAML implementations, although apparently BF -is-
turing complete...
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
Which tend to make scalars immutable. Even Ruby is
starting to see the error of its ways here so it is
moving to make at least some strings be immutable
(literals, "frozen", etc.).
I repeat, I do not want to keep identities of !!str
nodes. If you are confident that the data indeed do not
need identity preservation, you can discard identities.
My question is what we will gain by being allowed to
discard identities of nodes _with possibly unknown tags_.
I am not a YAML library implementer/ programmer. That said, I think
your question is too theoretical - if an application requires that any
YAML middleware preserve scalar identity, then that application is
naturally going to have to be pretty specific about which "YAML
middleware" is allowed to be used in its processing chains.

In a decade or more, I have not seen the doors of this mailing list
being broken down with questions such as "why is my YAML middleware
not correctly processing my YAML communication channel between my YAML
endpoints?"
(Hint: we've never, ever, seen such a question.)

Even a well-grounded "real world hypothetical" would move the
discussion onwards at this point. But we don't even have a "real world
hypothetical", let alone a real world problem.
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
4. YAML has a consistent model to support generic tools.
that you
I guess the question is, how strong is that consistent
model? What does it _allow_ generic tools to do? The
rules about scalar identity give generic tools the
ability to do more than they could do if scalar identity
had to be preserved.Note that a human with a text
editor can also be a generic,schema-blind tool.
Ok, this seems a good example. So, what is additionally
allowed to the guy with a text editor by being allowed not
to preserve scalar identity?
Common sense and "default expectation" - Linus Torvalds has a similar
saying about kernel to user land interfaces, where POSIX has
occasionally been allowed to be violated by linux, or a bug has been
solved in one out of a number of ways, because of historical precedent
and or user expectation.

Scalars having identity breaks the expectations of the man with the
editor. And if me editing a particular YAML document, do depend on
scalar identity, I expect that that would require awareness of the
schema for this document (which presumably I have, since I am manually
editing the document).

This is actually an example which favours the default expectation of
users, which is, that scalar identity is not preserved, strings can be
internet, etc.

If a particular YAML "middleware" that I, the man with the editor, am
using, let's say a YAML pretty printer, unexpectedly preserves scalar
identiy, then I'll tell the dang pretty printer where to go.

If I am programming some transactional multi layered software, and
some middleware layer needs to work in a particular way (preserving
identiy, or not preserving identiy), I'm simply going to make sure the
YAML tool I use supports the mode of operation that I require.

I think you might be trying to solve a non existent problem?
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
6. YAML is expressive and extensible.
You could argue that not preserving scalar identity
requiresa morecumbersome expression of some native
data (e.g. wrappinga scalarin a collection "just
because" you want to ensure itsidentity ispreserved -
similarly to having to use an !!omapinstead of the
cleaner map syntax for PHP dictionaries).
"Everything" still _can_be expressed, though.
This sounds intuitive to me.
Post by Osamu TAKEUCHI
If the benefit is larger than the labor, I will accept
the restriction.
For starters, your !!not_float example seems artificial to me - like
trying to solve a non existing problem. Changing an "intuitive to
humans" aspect of the current YAML spec (/ default schema), in order
to solve a non real world problem, would be a step backwards for YAML.
Post by Osamu TAKEUCHI
For !!omap issue, I am about neutral.
I see it can prevent people unintentionally breaking a
YAML document by swapping the key order. I evaluate this
happens much more easily than someone aliases or unaliases
unexpected nodes.
Map and OrderedMap are well defined (mathematically) concepts. YAML
supports both, map by default schema, any other type of map by
alternate schema/ tags. Any middleware will only ever be employed in a
processing pipeline where it makes sense to use that. There is no real
world problem that has knocked on the door.
Post by Osamu TAKEUCHI
On the other hand, if we see much more useful use cases
by allowing users to store key-order-aware hash or
key-duplication-unaware objects in mapping nodes, I
would be convinced to discard that restriction as Zanaan
is trying to.
My goal has been trying to understand YAML on a deeper/ more precise
level, so thank you for entertaining my slow understanding. My
examples were put to help me understand your question, so I could
understand YAML better - I am not trying to discard a restriction of
YAML.

Perhaps the question is: should the 'default schema' "hash map
concept" be !omap or !map? As long as I know which the default schema
is, I personally don't have attachment either way.

In my initial tests of beginning to rewrite my little "learning YAML
with Java beans" project without tags, all strings are "cookies" and
all numbers are "values" - i.e. they have no identity, and if they
did, that would be a problem for me - I would be immediately asking
around for a library supporting "old/ original YAML" schema.

For me, where say a string has to have identity, that would be because
it represents a class name (Java bean name) - and of course, every
other appearance of that exact sequence of characters (e.g. "T e s t B
e a n" (without the spaces)), represents exactly the same entity/
thing/ field/ class/ bean! A sane hypothetical where the opposite
needs to be true, completely eludes me.
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
One of the reason we explicitly listed the goals,
_in order_,was to break ties when different goals
pushed us in differentdirections. Order the goals
differently, and you'll get adifferent spec. I think
you would end up with JSON if youorderedthem in a
different way. Or even, god help us, XML ;-)
I strongly agree with this statement and seemingly
the order itself is nice, too. But as I commented on
the first item, the order can not be super strict.
Au contraire! I support a "super strict" order, so YAML design
decisions are consistent over the years. This has been the case - one
of the really nice things about YAML. Now we have "-layers-, onion
boy" (with apologies to Shrek), so I'm even happier. Design
consistency in YAML is truly awesome. And the order of design
priorities is very appealing to me personally. Did I mention I like
YAML?
Post by Osamu TAKEUCHI
We always have to optimize the balance of conflicting
ways to the goals.
Of course, when a convincing argument can be put that a chosen order
of priorities, or to pick a random example, rules applying in a
particular logical layer of the system, ought be changed, then that
makes for a great discussion.
Post by Osamu TAKEUCHI
Another point is, I feel that most of the YAML documents
in your mind are those that can be stored in JSON but
those in my mind are not always. YAML documents with
full of custom tags are not easily stored in JSON.
I disagree. Every tag can be transformed into a two element list,
where the first element is the tag, and the second element is the node
content that was tagged. Or instead of a two element list, think a two
element map, e.g.:
-
tag: omap
content:
blah
blah
Post by Osamu TAKEUCHI
If we give up YAML, the next choice will indeed be XML.
Nope. The next choice will be YAML 2. Then the next choice will be
JSON 1. Then the next choice will be JSON 2. Then the next choice will
be native serialization in your language of choice. Then custom binary
serialization. Then a continually permutating algorithmic mixing
stream, just for laughs. There are infinite alternatives to XML and if
worse comes to worst, I suggest hard transcoding your data into COBOL
statements and serializing those in Base63.

Nowhere would one willingly choose XML.

Hell, I'd choose HTML if I was ordered to use XML on a project, just
to make sure the next guy who touches that code knows that the
serialization format has to be changed.

Manager: "Z, did you finish that, what's it called again? Yeah, the
XML serialization?"

Me: "Sure did! And it's Netscape 3.1 compatible too." <snigger>HTML<snigger>

Manager: <used to work at Best Buy, does not understand>"Oh, cool!
That's just great! I knew I'd finish on time, and I'll tell marketing
our new name too, 'NewScope compatible' - has a great ring to it.
Great ring! Good job Z, you'll go places you know. Go places in this
world!"
Post by Osamu TAKEUCHI
Actually, my library is build to replace XML serializer
that is provided by C#. Since we want YAML to be able
to serialize both kinds of documents, we have to think
the best balance.
Should be easy. Anything's more enjoyable than XML... whoops, there I
go again...
:)
Osamu TAKEUCHI
2016-03-11 02:06:14 UTC
Permalink
Zenaan,

Thank you for your comment.
It let me know that I should have shown how I use custom
tags and custom tag resolutions in my apps.

Indeed, the C# library I developed uses the class
definitions as the custom schema.

Namely, if a YAML document

!MyClass { a: true, b: true }

is given to the library and MyClass is defined as

class MyClass
{
public boolean a { get; set; }
public string b { get; set; }
}

the library read the first true as boolean, the second
true as string and construct a MyClass object from the
document, correctly. It refers to the class definitions
by using C#'s reflection ability.

So, even a document

!<!MyClass[]>
- { a: true, b: true }
- { a: false, b: truf }
- { a: true, b: trug }

can be processed correctly without needing !MyClass
tags explicitly on all the items of the sequence
because the library can see !MyClass[] is an array
of !MyClass. (Surrounding !< > is just needed to
escape special characters of '[' and ']' in the
tag name)

Moreover, if you let the library know what is the
expected tag for the root node, the only tag in
the document can be also omitted as:

- { a: true, b: true }
- { a: false, b: truf }
- { a: true, b: trug }

This document can be read safely and correctly as
an array of MyClass with the given _schema_. In this
sense, my YAML library is fully _schema-aware_.


This type of tag resolution is allowed in the
current YAML spec. With this library, users can
handle YAML document with their own custom schema
very easily.

In particular, it works nicely to serialize/
deserialize a complex object graph that consists of
variety of native/custom classes into a YAML document.
YAML's tag system perfectly fits for the purpose.

I admit the document is not readable for
schema-blind people or tools. They will not see that
the properties a and b are of different type. But
it is not evil. When one wants to read or edit any
YAML document, they must be familiar with the document.
They must know the purpose of the document and types
of all nodes referring to the exact schema with which
the document is written.

This is how the schema works in my apps.

Do you think my library is exceptional? Such a library
can be build with any programing languages with the
reflection ability. If such libraries are provided
for Ruby, javascript, Scala and etc., YAML will be
more popular as serializing language.

With such use cases in my mind, I give comments below.
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
2. YAML data is portable between programming languages.
Different applications can apply a different schema to
the same document.If these schemas differ on assigning
identity then you cause all sort ofsticky issues.
What do you mean by "can"? Could you give us any example
where different applications _can_ apply a different
schema to the same document?
Such as application/ layer 4 "application" with application specific
schema, and schema-blind middleware, and in this case, "If these
schemas differ on assigning identity then you cause all sort of sticky
issues."
Post by Osamu TAKEUCHI
It sounds really strange for me.
Exactly because it is likely to cause problem.
As you know, the schema-blind middleware _must not_ apply
wrong schema to a document. If it does, the fact is a
serious accident by itself.
Post by Zenaan Harkness
Either
- use default YAML schema (previous emails) and then a "schema blind"
"YAML middleware" processor might make sense, but who knows - this is
up to developers of YAML middleware and or 'end user' applications,
- or use application specific schema, and you are very unlikely to be
able to mix in "schema blind" YAML middleware.
I think we all agree on these basics.
I partially agree. A middleware that can only work with
the default schema is almost useless for handling documents
with custom tags and custom tag resolutions regardless of
the identity issue.

But if we have a _schema-aware middleware_ and the end
user provides a correct schema to the middleware, everything
works fine.

Or, alternatively, schema-blind middleware can simply
provide some API that let end users to access directly to
the "partial representation graph of the document". (This
term is defined in the YAML spec.) With such APIs, end
users or applications can manipulate the document.
Although it will not be very easy for users, it is still
possible. My library also provides such APIs to let users
manipulate a YAML document that do not corresponds to any
known class objects.
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
IMO, any document can not stand without a valid schema.
If we say "schema less document" is actually YAML document relying on
"default schema" (with scalars working as users expect them etc), then
I agree with you.
Yes, that is exactly what I meant.
Post by Zenaan Harkness
I like the "default schema" where "schema less" YAML documents work as
I expect them to. I use YAML mainly with human (me) generated YAML
files, and if the default schema were to change to require "scalars
have identity" then the first thing I would require of any YAML tool I
use, is that it support the "old style" schema.
Makes sense.
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
A schema-blind application must not apply any
possibly-wrong schema to any document. Namely, it can
I think part of the problem is that we sort of lack a real world
example of a practical for regular use "schema blind" tool. I think
the examples of scalars needing identity have so far been a bit
contrived - if '!not_float 3.14159' must have separate identity / be
not equal to '!not_float 3.14159', that to my eyes looks like a
decidedly application specific schema. I am still struggling to
comprehend how that could be useful, even though I can certainly
accept that it's possible it might be useful in some situation to
someone.
Actually I do not think of any practical example for regular
use of schema-blind tool with documents with custom schema.

If the spec allows, I can serialize a class object, that is
provided with a builder function that accepts a single string
argument, into a scalar node as the users of my library expect.
Actually, in C#, many of the classes are provided with so-called
TypeConverter function that converts native class objects
from/to string. Then, the YAML documents look something like:

- !Complex 3-2i
- !ElementSize 14em,300pt
- !MyClass magical spell to generate the class object

I do not say my users always put importance on the identities
of such objects. But I can not say they will never do. It is
safer to keep the identity of the scalar nodes in this use
case.

I agree I can just forbid my users to store objects with
identities in this manner. But I think it helps nobody in
reality. It just makes the YAML document formally valid.
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
merely build partial representation graph and can not do
any more. If it go any further, it will very easily break
documents even without the identity issue, as partially
shown by my previous example with !!float tags.
If an application requires such behaviour ("schema"), then it, of
course, could not rely upon the default schema. This sounds sensible,
yes?
Exactly.
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
3. YAML matches the native data structures
<http://www.yaml.org/spec/1.2/spec.html#native data structure//>
of agile languages.
I agree that YAML nodes with standard tags match the
native data structures of agile languages pretty well.
But what about nodes with custom tags? How do they
matches the native data structures and how we can make
of the matching? Coud you show us some example use cases?
How is that relevant to the discussion?
Some languages will have layer 3/4 YAML constructors to support
"arbitary" custom tags/ schemas - other languages might be a bit
limited some how, although if such languages exist, then I think that
language would be not very popular - we have relatively high "minimum
standards" these days... I don't think BrainFuck is going to be at the
top of the list for YAML implementations, although apparently BF -is-
turing complete...
I hope I have shown a nice use case to use custom tags
to store _user-defined_ class objects in YAML. They are
native objects but not of _native_ data structures,
do they? Only in javascript, they may be?
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
Which tend to make scalars immutable. Even Ruby is
starting to see the error of its ways here so it is
moving to make at least some strings be immutable
(literals, "frozen", etc.).
I repeat, I do not want to keep identities of !!str
nodes. If you are confident that the data indeed do not
need identity preservation, you can discard identities.
My question is what we will gain by being allowed to
discard identities of nodes _with possibly unknown tags_.
I am not a YAML library implementer/ programmer. That said, I think
your question is too theoretical - if an application requires that any
YAML middleware preserve scalar identity, then that application is
naturally going to have to be pretty specific about which "YAML
middleware" is allowed to be used in its processing chains.
Regardless of identity issue, you need a schema-aware
middleware to manipulate a document with tons of custom
tags and custom tag resolutions. If you call a schema-aware
middleware as "pretty specific", I agree with you.
But anyway, I do not think such a document is evil only
by the fact.
Post by Zenaan Harkness
In a decade or more, I have not seen the doors of this mailing list
being broken down with questions such as "why is my YAML middleware
not correctly processing my YAML communication channel between my YAML
endpoints?"
(Hint: we've never, ever, seen such a question.)
Even a well-grounded "real world hypothetical" would move the
discussion onwards at this point. But we don't even have a "real world
hypothetical", let alone a real world problem.
Do you say I have invented a _new way_ of using custom tags?
I think the YAML spec writers intentionally designed YAML tag
system to allow such usage. Anyway, my library is already in
the real world and working pretty well. I will not be sad to
see none of schema-blind YAML tools can handle the YAML
documents that my library generates. If the users of my
application can read and write the generated document by their
eyes and hands, it is ok for me. IMO, the document is a nicely
written YAML document except for the identity issue.
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
4. YAML has a consistent model to support generic tools.
that you
I guess the question is, how strong is that consistent
model? What does it _allow_ generic tools to do? The
rules about scalar identity give generic tools the
ability to do more than they could do if scalar identity
had to be preserved.Note that a human with a text
editor can also be a generic,schema-blind tool.
Ok, this seems a good example. So, what is additionally
allowed to the guy with a text editor by being allowed not
to preserve scalar identity?
Common sense and "default expectation" - Linus Torvalds has a similar
saying about kernel to user land interfaces, where POSIX has
occasionally been allowed to be violated by linux, or a bug has been
solved in one out of a number of ways, because of historical precedent
and or user expectation.
Scalars having identity breaks the expectations of the man with the
editor. And if me editing a particular YAML document, do depend on
scalar identity, I expect that that would require awareness of the
schema for this document (which presumably I have, since I am manually
editing the document).
This is actually an example which favours the default expectation of
users, which is, that scalar identity is not preserved, strings can be
internet, etc.
If a particular YAML "middleware" that I, the man with the editor, am
using, let's say a YAML pretty printer, unexpectedly preserves scalar
identiy, then I'll tell the dang pretty printer where to go.
If I am programming some transactional multi layered software, and
some middleware layer needs to work in a particular way (preserving
identiy, or not preserving identiy), I'm simply going to make sure the
YAML tool I use supports the mode of operation that I require.
I think you might be trying to solve a non existent problem?
Besides the identity issue, a YAML document with custom tags
and custom tag resolutions are already surprising to those
who has limited knowledge on the schema. Identity preservation
of scalar node do not make it very much worse.

And I repeat, even you mistake the need for identity preservation,
it will not cause big trouble in reading/editing a YAML document
with text editor unless you intentionally alias or unalias the
nodes. If you know the meaning of the document well, you will
never do it. If you do not know the meaning of the document well,
you will not want to do it, neither, hopefully.
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
6. YAML is expressive and extensible.
You could argue that not preserving scalar identity
requiresa morecumbersome expression of some native
data (e.g. wrappinga scalarin a collection "just
because" you want to ensure itsidentity ispreserved -
similarly to having to use an !!omapinstead of the
cleaner map syntax for PHP dictionaries).
"Everything" still _can_be expressed, though.
This sounds intuitive to me.
Post by Osamu TAKEUCHI
If the benefit is larger than the labor, I will accept
the restriction.
For starters, your !!not_float example seems artificial to me - like
trying to solve a non existing problem. Changing an "intuitive to
humans" aspect of the current YAML spec (/ default schema), in order
to solve a non real world problem, would be a step backwards for YAML.
I hope the example above explains the problem.
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
For !!omap issue, I am about neutral.
I see it can prevent people unintentionally breaking a
YAML document by swapping the key order. I evaluate this
happens much more easily than someone aliases or unaliases
unexpected nodes.
Map and OrderedMap are well defined (mathematically) concepts. YAML
supports both, map by default schema, any other type of map by
alternate schema/ tags. Any middleware will only ever be employed in a
processing pipeline where it makes sense to use that. There is no real
world problem that has knocked on the door.
I'm afraid I do not catch your point correctly.
Do you say people never accidentally swap key orders of
!!omap if they are familiar with the meaning of the document?
I almost agree with you then. In that sense I am about neutral.
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
On the other hand, if we see much more useful use cases
by allowing users to store key-order-aware hash or
key-duplication-unaware objects in mapping nodes, I
would be convinced to discard that restriction as Zanaan
is trying to.
My goal has been trying to understand YAML on a deeper/ more precise
level, so thank you for entertaining my slow understanding. My
examples were put to help me understand your question, so I could
understand YAML better - I am not trying to discard a restriction of
YAML.
Perhaps the question is: should the 'default schema' "hash map
concept" be !omap or !map? As long as I know which the default schema
is, I personally don't have attachment either way.
In my initial tests of beginning to rewrite my little "learning YAML
with Java beans" project without tags, all strings are "cookies" and
all numbers are "values" - i.e. they have no identity, and if they
did, that would be a problem for me - I would be immediately asking
around for a library supporting "old/ original YAML" schema.
For me, where say a string has to have identity, that would be because
it represents a class name (Java bean name) - and of course, every
other appearance of that exact sequence of characters (e.g. "T e s t B
e a n" (without the spaces)), represents exactly the same entity/
thing/ field/ class/ bean! A sane hypothetical where the opposite
needs to be true, completely eludes me.
Preservation of identity of nodes with custom tags is not
likely to hurt you in reality. As I wrote, I'm not saying to
preserve identity of nodes with standard tags. Unless it is
required by tags, the middlewares will not always preserve
identities. I only say "let the tags determine discarding or
preserving identities." Unless the middleware has a good
support on the custom tags, it is nonsense to discus such
things because we have no real ways to do it anyway. Without
good middlewares, we will not be able to handle such documents
regardless of identity issue. So, it only matters when a
middleware provides any API for a custom schema to require
identity preservation for some tags. If this does not happen
around you unfortunately, the change of the spec will have
nothing to do with you.

The implementer of a YAML library has to think of identity
preservation only when it can accept custom schema.

Technically, it is easy to preserve the identity of scalar
nodes in their representation graph unless the library
intentionally discard the identity. Note that the scalar
nodes are not represented by native scalars in the graph.
They are almost always represented by some class objects
that naturally preserve their own identities.
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
One of the reason we explicitly listed the goals,
_in order_,was to break ties when different goals
pushed us in differentdirections. Order the goals
differently, and you'll get adifferent spec. I think
you would end up with JSON if youorderedthem in a
different way. Or even, god help us, XML ;-)
I strongly agree with this statement and seemingly
the order itself is nice, too. But as I commented on
the first item, the order can not be super strict.
Au contraire! I support a "super strict" order, so YAML design
decisions are consistent over the years. This has been the case - one
of the really nice things about YAML. Now we have "-layers-, onion
boy" (with apologies to Shrek), so I'm even happier. Design
consistency in YAML is truly awesome. And the order of design
priorities is very appealing to me personally. Did I mention I like
YAML?
Ok then, drop custom tags and custom tag resolutions
from the spec completely. Use only standard tags.
Use only standard schema.This will make YAML documents
more readable. Does It reduce the functionality of YAML?
Don't care. Functionality is given a lower priority.
If readability matters, forget about anything else.

If you say GO SUPER STRICT!, nobody can oppose this
decision.

I'm afraid you may indeed decide restricting the use
of custom tags and custom tag resolution. I understand
it can be one way to go, but...
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
We always have to optimize the balance of conflicting
ways to the goals.
Of course, when a convincing argument can be put that a chosen order
of priorities, or to pick a random example, rules applying in a
particular logical layer of the system, ought be changed, then that
makes for a great discussion.
Post by Osamu TAKEUCHI
Another point is, I feel that most of the YAML documents
in your mind are those that can be stored in JSON but
those in my mind are not always. YAML documents with
full of custom tags are not easily stored in JSON.
I disagree. Every tag can be transformed into a two element list,
where the first element is the tag, and the second element is the node
content that was tagged. Or instead of a two element list, think a two
-
tag: omap
blah
blah
It may depend on personal standards.
I do not see this easy.
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
If we give up YAML, the next choice will indeed be XML.
Nope. The next choice will be YAML 2. Then the next choice will be
JSON 1. Then the next choice will be JSON 2. Then the next choice will
be native serialization in your language of choice. Then custom binary
serialization. Then a continually permutating algorithmic mixing
stream, just for laughs. There are infinite alternatives to XML and if
worse comes to worst, I suggest hard transcoding your data into COBOL
statements and serializing those in Base63.
Nowhere would one willingly choose XML.
Hell, I'd choose HTML if I was ordered to use XML on a project, just
to make sure the next guy who touches that code knows that the
serialization format has to be changed.
Manager: "Z, did you finish that, what's it called again? Yeah, the
XML serialization?"
Me: "Sure did! And it's Netscape 3.1 compatible too." <snigger>HTML<snigger>
Manager: <used to work at Best Buy, does not understand>"Oh, cool!
That's just great! I knew I'd finish on time, and I'll tell marketing
our new name too, 'NewScope compatible' - has a great ring to it.
Great ring! Good job Z, you'll go places you know. Go places in this
world!"
I agree. The next choice for me will indeed be
_my own language_ that differs from YAML only at some
small differences: equality, identity and what else? ;)


I hope I have convinced you guys that at least some of
the benefits of allowing non-preservation of scalars with
custom tags are kind of imagination. Letting tags to
choose preservation of node identity will hurt you very
little if any. Letting tags to choose the way of evaluating
equality, too. Especially, if you have no middlewares that
understands custom schema around you, they have almost
nothing to do with you, anyway.


Best,
Osamu
Zenaan Harkness
2016-03-11 05:20:53 UTC
Permalink
Post by Osamu TAKEUCHI
Zenaan,
Thank you for your comment.
It let me know that I should have shown how I use custom
tags and custom tag resolutions in my apps.
Indeed, the C# library I developed uses the class
definitions as the custom schema.
Namely, if a YAML document
!MyClass { a: true, b: true }
is given to the library and MyClass is defined as
class MyClass
{
public boolean a { get; set; }
public string b { get; set; }
}
...
Post by Osamu TAKEUCHI
This is how the schema works in my apps.
Do you think my library is exceptional?
Sounds normal layer 4 library to me. Same as others. I've only ever
used Java YAML, but all 3 worked similarly.

...
Post by Osamu TAKEUCHI
Post by Zenaan Harkness
If a particular YAML "middleware" that I, the man with the editor, am
using, let's say a YAML pretty printer, unexpectedly preserves scalar
identiy, then I'll tell the dang pretty printer where to go.
If I am programming some transactional multi layered software, and
some middleware layer needs to work in a particular way (preserving
identiy, or not preserving identiy), I'm simply going to make sure the
YAML tool I use supports the mode of operation that I require.
I think you might be trying to solve a non existent problem?
Besides the identity issue, a YAML document with custom tags
and custom tag resolutions are already surprising to those
who has limited knowledge on the schema. Identity preservation
of scalar node do not make it very much worse.
It's just too theoretical. A "surprised YAML document editor" better
check out the schema for the document s/he is editing!
Post by Osamu TAKEUCHI
And I repeat, even you mistake the need for identity preservation,
It's about "default presumption" - if two nodes, which look the same,
are spelled the same, have the same capitalization and the same
spacing, have different identity without any special schema, that
sounds not intuitive. It's not about the need, it's about
expectations. That's why it is so good that the order of design
priorities for YAML has been set, along with clear definition of
layers of processing. "What is likely user expectation at -this-
layer?" for example.
Post by Osamu TAKEUCHI
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
6. YAML is expressive and extensible.
You could argue that not preserving scalar identity
requiresa morecumbersome expression of some native
data (e.g. wrappinga scalarin a collection "just
because" you want to ensure itsidentity ispreserved -
similarly to having to use an !!omapinstead of the
cleaner map syntax for PHP dictionaries).
"Everything" still _can_be expressed, though.
This sounds intuitive to me.
Post by Osamu TAKEUCHI
If the benefit is larger than the labor, I will accept
the restriction.
For starters, your !!not_float example seems artificial to me - like
trying to solve a non existing problem. Changing an "intuitive to
humans" aspect of the current YAML spec (/ default schema), in order
to solve a non real world problem, would be a step backwards for YAML.
I hope the example above explains the problem.
I still don't see a problem. You use tags, at least for the document
root, which means your users are specifying a schema - that's just
fine. YAML supports your usage AFAICT.

Sorry, I really have been known to be slow, but I'm not seeing your
usage of YAML violating the specification. Use any "schema" tags that
are suitable for your application - seems you are doing this
successfully.
Post by Osamu TAKEUCHI
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
For !!omap issue, I am about neutral.
I see it can prevent people unintentionally breaking a
YAML document by swapping the key order. I evaluate this
happens much more easily than someone aliases or unaliases
unexpected nodes.
Map and OrderedMap are well defined (mathematically) concepts. YAML
supports both, map by default schema, any other type of map by
alternate schema/ tags. Any middleware will only ever be employed in a
processing pipeline where it makes sense to use that. There is no real
world problem that has knocked on the door.
I'm afraid I do not catch your point correctly.
Do you say people never accidentally swap key orders of
!!omap if they are familiar with the meaning of the document?
I almost agree with you then. In that sense I am about neutral.
I remember using the wrong Java map class when "deserializing" a YAML
document, and realising I needed an ordered map type. The "schema" was
implicit/ in my mind (since I wrote the YAML file by hand), so I
changed the type and bingo, all was fine. There is no problem I see
with the YAML spec - once I realised my misunderstanding I said to
myself "oh, that makes sense" and the solution to my need was
immediately self evident (now my Java bean elements (fields, methods
etc) are laid out in the same sequence in the bean/ class file, as in
the YAML files I type up - it wasn't even a functional issue, just a
'visual expectation' issue).

...
Post by Osamu TAKEUCHI
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
Post by Oren Ben-Kiki
One of the reason we explicitly listed the goals,
_in order_,was to break ties when different goals
pushed us in differentdirections. Order the goals
differently, and you'll get adifferent spec. I think
you would end up with JSON if youorderedthem in a
different way. Or even, god help us, XML ;-)
I strongly agree with this statement and seemingly
the order itself is nice, too. But as I commented on
the first item, the order can not be super strict.
Au contraire! I support a "super strict" order, so YAML design
decisions are consistent over the years. This has been the case - one
of the really nice things about YAML. Now we have "-layers-, onion
boy" (with apologies to Shrek), so I'm even happier. Design
consistency in YAML is truly awesome. And the order of design
priorities is very appealing to me personally. Did I mention I like
YAML?
Ok then, drop custom tags and custom tag resolutions
from the spec completely. Use only standard tags.
Use only standard schema.This will make YAML documents
more readable. Does It reduce the functionality of YAML?
Don't care. Functionality is given a lower priority.
If readability matters, forget about anything else.
If you say GO SUPER STRICT!, nobody can oppose this
decision.
No, that's not what I say - it's about "choose priorities, only
override them if a convincing argument is put" - I thought I said that
somewhere.
Post by Osamu TAKEUCHI
I'm afraid you may indeed decide restricting the use
of custom tags and custom tag resolution. I understand
it can be one way to go, but...
Of course not. As I also said, it's currently about the only way I use
YAML (when I have time to tinker with programming), although my most
recent experiment has been how to do what I wanted to do, without
tags.

Layers. The answer is clear. Schema-dependent YAML end points, cannot
use a schema blind YAML middleware - this is not a problem anyone has
ever said they have, and also is not a problem at all that I can see
in my mind. I grant I may not see things very clearly though...
Post by Osamu TAKEUCHI
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
Another point is, I feel that most of the YAML documents
in your mind are those that can be stored in JSON but
those in my mind are not always. YAML documents with
full of custom tags are not easily stored in JSON.
I disagree. Every tag can be transformed into a two element list,
where the first element is the tag, and the second element is the node
content that was tagged. Or instead of a two element list, think a two
-
tag: omap
blah
blah
It may depend on personal standards.
I do not see this easy.
If your library supports lists and maps, but not tags, then this is
your option. I'm imagining a limited language and someone trying to
implement a YAML library under severe restrictions. "It is easy"
referred to how to get around a language limitation, and that YAML
syntax supports more than one work around, trivially. Frankly, a map,
is simpler than a map and tags. It just means that construction must
be completed by the end user application - but the whole point of this
example was to demonstrate a simple workaround for a theoretical
problem (how to represent tags in JSON), so I think I am way off topic
at this point.

...
Post by Osamu TAKEUCHI
I hope I have convinced you guys that at least some of
the benefits of allowing non-preservation of scalars with
custom tags are kind of imagination. Letting tags to
choose preservation of node identity will hurt you very
little if any. Letting tags to choose the way of evaluating
equality, too. Especially, if you have no middlewares that
understands custom schema around you, they have almost
nothing to do with you, anyway.
This is a bit beyond my ability to comment properly sorry - back over
to you library implementer people :)

Regards,
Zenaan

Zenaan Harkness
2016-03-07 09:45:10 UTC
Permalink
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
If the layered structure of the YAML processor
do not allow it, the layered structure itself
should be revised. I don't see how the layered
structure is related to the current topic,
though.
It is crucial to the discussion. We all agree that key duplication
detection _must_ be done at the application layer, but the point is that a
_limited_ form of key duplication detection _may_ and _should_ be done,
especially in YAML processors that do not even _have_ an application layer.
Where the YAML processor (perhaps some streaming processor?), -is- the
application.
Post by Oren Ben-Kiki
This is because, as you put it, "it makes the serialized documents much
more readable or portable".
I guess I don't understand Osamu's "it makes the serialized documents
much more readable or portable". As long as the spec does not limit a
streaming processor from either flag an error or not flag, depending
on what the application (user) wants, that's fine AFAICT.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Similarly, I do not want to forbid PHP users
to store a PHP's native key-order-aware hash
into a key-order-unaware YAML mapping
The problem is, how can you tell whether this is/not safe to do? When
dumping such a hash table to YAML, the application needs to provide some
hint to the YAML processor whether this is actually safe. By default, it is
_not_ safe, so without an explicit hint, the YAML processor _should_ do the
safe thing and emit it as an !!omap.
ACK!

I imagine the problem would arise if an application requires key
ordering, uses one library for saving, then loads with a library that
does not preserve key ordering. Or two applications, and one saves
without specifying that key order is required, where it is a different
application doing the loading.

...
Post by Oren Ben-Kiki
Really? People would be very surprised to hear that { a: 1, a: 2 } is
actually OK because some application somewhere _may_ decide it wants scalar
string keys to use "identity-based equality".
So *No*.
Ack.
Post by Oren Ben-Kiki
So we _can_ complain about them being equal at an earlier processing stage.
We do not _require_ a YAML processor to do so, but we _allow_ and
_encourage_ it to do so.
Sounds fine. As long as it's not required, a processor can be
configured to complain or not.
Ingy dot Net
2016-03-07 15:27:20 UTC
Permalink
Oren, Osamu,

My concern with this thread is that you seem to be limiting your meaning of
"YAML processors" to Loader/Dumper processors (processors that carry data
from text to native programming in-memory objects, and back). ie Where the
"Application" level uses native objects.

While that is certainly the normal case, YAML conjecture should also be
weighed against processors that stop after the parser (at the event
stream). ie Where the application is a streaming filter/mapper. ie No graph
construction or native objects ever happen.

YAML conjecture should also be weighed against non-terminating (or
extremely huge) YAML nodes (documents, mappings, sequences).

Oren, if you are suggesting that {a: 1, a: 2} *must* be detected at the
parser level, then I'd have to disagree from the streaming standpoint,
since the 2 keys may be lightyears away from each other.

FWIW, libyaml doesn't reason on keys and shouldn't. libyaml makes a great
streaming parser/emitter.

Key order can't really even be discussed at this level. Key order *can't* be
changed by a parser, so of course you can make it meaningful if you decide
to. You just have to be aware that your meaningfulness expires should your
YAML be used in a graphing context.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Oren,
I expect we share the same thinking that the
definition of equality belongs to the domain
specific data type, not to the serialization
language.
Pretty much.
Post by Osamu TAKEUCHI
So, unless it makes the serialized
documents much more readable or portable, a
serialization language should not determine
its own equality or identity definition.
That's a bit "unless".
Post by Osamu TAKEUCHI
I agree duplicate key should be detected by
YAML processors because we do not want users
to use duplicate keys for overwriting the
values of predefined keys.
There is no concept of overwriting in a streaming processor. You need a
graph to do that.
Post by Oren Ben-Kiki
Yes.
Post by Osamu TAKEUCHI
The key order in
a YAML mapping should not have meaning.
Very strong yes.
Depends on the layer.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
But it can be done without defining equality
in YAML spec. YAML processor can use native
equality evaluator of the data at its
construction stage and it should do so.
No. You assume there _is_ a construction stage. There need not be one.
Oren. I agree, although I think you mean that construction is skipped on
the way to native, where I'm saying that processing might never get to that
level.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
If the layered structure of the YAML processor
do not allow it, the layered structure itself
should be revised. I don't see how the layered
structure is related to the current topic,
though.
It is crucial to the discussion. We all agree that key duplication
detection _must_ be done at the application layer, but the point is that a
_limited_ form of key duplication detection _may_ and _should_ be done,
especially in YAML processors that do not even _have_ an application layer.
This is because, as you put it, "it makes the serialized documents much
more readable or portable".
If the application is not a graph then duplication mustn't and likely can't
be done.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Similarly, I do not want to forbid PHP users
to store a PHP's native key-order-aware hash
into a key-order-unaware YAML mapping
The problem is, how can you tell whether this is/not safe to do? When
dumping such a hash table to YAML, the application needs to provide some
hint to the YAML processor whether this is actually safe. By default, it is
_not_ safe, so without an explicit hint, the YAML processor _should_ do the
safe thing and emit it as an !!omap.
PHP is not the outlier. Most JavaScript implementations have predictable
key order. I use JS all the time to preserve key order when doing things
like converting JSON to (block formatted) YAML.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Meaningfulness of the data identity should
also belong to the specific data types. As
shown by the previous examples, the difference
in the semantics of a scalar node and that of
a complex node is not always clear.
Looks pretty clear to me. Scalars are "values". They have _no_ identity,
they have _only_ content. Complex nodes have identity, and as you pointed
out, this means their actual content may be irrelevant (for comparison).
The current spec gets that last point wrong.
I think the core issue here is identity of scalars. You seem to assume
that a YAML processor _must_ preserve the identity of scalars. That is, it
_must not_, for example, use interned strings for keys. The current spec
says the opposite. A YAML processor _need not_ preserve scalar identity and
it _may_ use interned strings and other similar tricks. It is definitely
not required to keep the identity of, say, integer scalars!
... without declaring
Post by Osamu TAKEUCHI
possible non-preservation of identity for
scalars, nobody will think a data with an
identity-based equality evaluation must be
stored as a collection node and must not as a
scalar node. It brings some surprise to users.
Really? People would be very surprised to hear that { a: 1, a: 2 } is
actually OK because some application somewhere _may_ decide it wants scalar
string keys to use "identity-based equality".
So *No*.
For increased readability and portability, the above _must_ be allowed to
be flagged as a duplicate by a YAML processor regardless of what the
application is. And _should_ be flagged so by "well behaved" processors.
Even if they do _not_ have an application layer.
Whoops, I misread your sentence. "_must_ be allowed to be flagged". Yes I
completely agree with that statement. *If* a parser wants to detect
duplicate content/tag scalar keys it must be *allowed* to. I first read
this to say that parsers must. Since you and I have considered streaming
from the start, I was a bit surprised that you would say that. :)

My other points stand.

...
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Such restriction will improve YAML's readability
and portability very little if any.
We'll have to agree to disagree, I'm afraid...
Post by Osamu TAKEUCHI
Actually, I
believe the restriction is currently not known
widely and very few libraries and applications
have ever utilized it.
A pity.
Post by Osamu TAKEUCHI
I don't think many
existing YAML document loose its meaning if we
drop the restriction.
No valid YAML documents will, that's for sure ;-) But that's besides the
point.
Post by Osamu TAKEUCHI
So, let's make the spec simpler by dropping the
definition of YAML's own equality and identity
preservation.
There's no such thing as not addressing the issue of identity and equality
in the spec. Either you _require_ a YAML processor to preserve the identity
of scalars (including, horribly, simple integers), or you do not. Either
way it needs to be stated in the spec.
We chose to say a YAML processer _need not_ preserve the identity of
scalars. Given this, then an application _must not_ use scalar identity for
equality comparisons. Given this, then _regardless_ of the application's
2 } contains a duplicate key.
Post by Osamu TAKEUCHI
A well-behaved processor _should_ detect a
duplicate key and flag it as an error if it
can correctly evaluate equality of nodes.
So far so good.
Post by Osamu TAKEUCHI
It _must_ aware that a data with some specific
tags may have some custom comparison algorithms,
including the one based on the data identity.
Yes, the current spec gets the identity point wrong.
Post by Osamu TAKEUCHI
Namely, two YAML nodes of same values and same
tags can be evaluated to be unequal by an
identity-based evaluator,
_Only_ if these are complex nodes.
Post by Osamu TAKEUCHI
while two YAML nodes
of different values and even different tags can
be evaluated to be equal by some specific
evaluators. Note that javascript do not
natively distinguish an integer 0x01 with a
sequence [1] as mapping keys.
You keep conflating false positives with false negatives. False negatives
are _fine_. It is OK for the processor to miss some cases of key
duplication. In fact it is expected. The application is the final arbiter
of key equality. You can keep on piling as many examples of "the processor
can't detect keys in <some example> as duplicated" as you want. OF COURSE
there are such cases.
But this does not mean in any way shape or form that we allow false
positives. A YAML processor must not ever complain about key duplication
when such a duplication does not exist. Now, in JavaScript, PHP, Perl,
Ruby, Python, C++, and any other valid YAML system, { a: 1, a: 2 } _does_
have a duplicate key. So a processor _is_ allowed and _should_ complain
about this, _regardless_ of the application-defined equality operator.
It is also warned that tags of nodes can be
Post by Osamu TAKEUCHI
implicitly specified by the path of the node
from the root. So, a schema-blind YAML processor
can never know how to resolve a tag for any
tag-unspecified node.
The path to both the "a" keys in { a: 1, a: 2 } is, by definition, the
same (the path to all keys in the same mapping is, by definition,
identical). So whatever tag is assigned to one of them, by definition, the
same tag must be assigned to the other as well. The application _can't_ use
different tags to distinguish between them. It _can't_ use their identity
to distinguish between them because the YAML processor need not give them
different identities. It _can't_ use their content to distinguish between
them because they have the same content. So, the application _must_
consider them equal - there's just no other possible choice.
So we _can_ complain about them being equal at an earlier processing
stage. We do not _require_ a YAML processor to do so, but we _allow_ and
_encourage_ it to do so.
Post by Osamu TAKEUCHI
A well-behaved YAML
processor _must_ be schema aware,
Now this is just plain wrong. YamlReference is a YAML processor. It
implements the parsing stage. It has no clue whatsoever what schema is
used. Schema-blind YAML processing is, for me, an important use case.
And if every possible schema in the universe _must_ decree that two keys
are equal, then we don't need to know the _specific_ schema, because
whatever it is, it will also _have_ to declare them equal.
Oren.
------------------------------------------------------------------------------
_______________________________________________
Yaml-core mailing list
https://lists.sourceforge.net/lists/listinfo/yaml-core
Andrey Somov
2016-03-07 15:42:16 UTC
Permalink
Post by Ingy dot Net
Oren, if you are suggesting that {a: 1, a: 2} *must* be detected at the
parser level, then I'd have to disagree from the streaming >standpoint,
since the 2 keys may be lightyears away from each other.

I completely support that {a: 1, a: 2} must NOT be detected at the parser
level.

I must admit that I do not follow the discussion here because it is too
much,
but I agree with Osamu.

Cheers,
Andrey
Post by Ingy dot Net
Oren, Osamu,
My concern with this thread is that you seem to be limiting your meaning
of "YAML processors" to Loader/Dumper processors (processors that carry
data from text to native programming in-memory objects, and back). ie Where
the "Application" level uses native objects.
While that is certainly the normal case, YAML conjecture should also be
weighed against processors that stop after the parser (at the event
stream). ie Where the application is a streaming filter/mapper. ie No graph
construction or native objects ever happen.
YAML conjecture should also be weighed against non-terminating (or
extremely huge) YAML nodes (documents, mappings, sequences).
Oren, if you are suggesting that {a: 1, a: 2} *must* be detected at the
parser level, then I'd have to disagree from the streaming standpoint,
since the 2 keys may be lightyears away from each other.
FWIW, libyaml doesn't reason on keys and shouldn't. libyaml makes a great
streaming parser/emitter.
Key order can't really even be discussed at this level. Key order *can't* be
changed by a parser, so of course you can make it meaningful if you decide
to. You just have to be aware that your meaningfulness expires should your
YAML be used in a graphing context.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Oren,
I expect we share the same thinking that the
definition of equality belongs to the domain
specific data type, not to the serialization
language.
Pretty much.
Post by Osamu TAKEUCHI
So, unless it makes the serialized
documents much more readable or portable, a
serialization language should not determine
its own equality or identity definition.
That's a bit "unless".
Post by Osamu TAKEUCHI
I agree duplicate key should be detected by
YAML processors because we do not want users
to use duplicate keys for overwriting the
values of predefined keys.
There is no concept of overwriting in a streaming processor. You need a
graph to do that.
Post by Oren Ben-Kiki
Yes.
Post by Osamu TAKEUCHI
The key order in
a YAML mapping should not have meaning.
Very strong yes.
Depends on the layer.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
But it can be done without defining equality
in YAML spec. YAML processor can use native
equality evaluator of the data at its
construction stage and it should do so.
No. You assume there _is_ a construction stage. There need not be one.
Oren. I agree, although I think you mean that construction is skipped on
the way to native, where I'm saying that processing might never get to that
level.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
If the layered structure of the YAML processor
do not allow it, the layered structure itself
should be revised. I don't see how the layered
structure is related to the current topic,
though.
It is crucial to the discussion. We all agree that key duplication
detection _must_ be done at the application layer, but the point is that a
_limited_ form of key duplication detection _may_ and _should_ be done,
especially in YAML processors that do not even _have_ an application layer.
This is because, as you put it, "it makes the serialized documents much
more readable or portable".
If the application is not a graph then duplication mustn't and likely
can't be done.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Similarly, I do not want to forbid PHP users
to store a PHP's native key-order-aware hash
into a key-order-unaware YAML mapping
The problem is, how can you tell whether this is/not safe to do? When
dumping such a hash table to YAML, the application needs to provide some
hint to the YAML processor whether this is actually safe. By default, it is
_not_ safe, so without an explicit hint, the YAML processor _should_ do the
safe thing and emit it as an !!omap.
PHP is not the outlier. Most JavaScript implementations have predictable
key order. I use JS all the time to preserve key order when doing things
like converting JSON to (block formatted) YAML.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Meaningfulness of the data identity should
also belong to the specific data types. As
shown by the previous examples, the difference
in the semantics of a scalar node and that of
a complex node is not always clear.
Looks pretty clear to me. Scalars are "values". They have _no_ identity,
they have _only_ content. Complex nodes have identity, and as you pointed
out, this means their actual content may be irrelevant (for comparison).
The current spec gets that last point wrong.
I think the core issue here is identity of scalars. You seem to assume
that a YAML processor _must_ preserve the identity of scalars. That is, it
_must not_, for example, use interned strings for keys. The current spec
says the opposite. A YAML processor _need not_ preserve scalar identity and
it _may_ use interned strings and other similar tricks. It is definitely
not required to keep the identity of, say, integer scalars!
... without declaring
Post by Osamu TAKEUCHI
possible non-preservation of identity for
scalars, nobody will think a data with an
identity-based equality evaluation must be
stored as a collection node and must not as a
scalar node. It brings some surprise to users.
Really? People would be very surprised to hear that { a: 1, a: 2 } is
actually OK because some application somewhere _may_ decide it wants scalar
string keys to use "identity-based equality".
So *No*.
For increased readability and portability, the above _must_ be allowed to
be flagged as a duplicate by a YAML processor regardless of what the
application is. And _should_ be flagged so by "well behaved" processors.
Even if they do _not_ have an application layer.
Whoops, I misread your sentence. "_must_ be allowed to be flagged". Yes I
completely agree with that statement. *If* a parser wants to detect
duplicate content/tag scalar keys it must be *allowed* to. I first read
this to say that parsers must. Since you and I have considered streaming
from the start, I was a bit surprised that you would say that. :)
My other points stand.
...
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Such restriction will improve YAML's readability
and portability very little if any.
We'll have to agree to disagree, I'm afraid...
Post by Osamu TAKEUCHI
Actually, I
believe the restriction is currently not known
widely and very few libraries and applications
have ever utilized it.
A pity.
Post by Osamu TAKEUCHI
I don't think many
existing YAML document loose its meaning if we
drop the restriction.
No valid YAML documents will, that's for sure ;-) But that's besides the
point.
Post by Osamu TAKEUCHI
So, let's make the spec simpler by dropping the
definition of YAML's own equality and identity
preservation.
There's no such thing as not addressing the issue of identity and
equality in the spec. Either you _require_ a YAML processor to preserve the
identity of scalars (including, horribly, simple integers), or you do not.
Either way it needs to be stated in the spec.
We chose to say a YAML processer _need not_ preserve the identity of
scalars. Given this, then an application _must not_ use scalar identity for
equality comparisons. Given this, then _regardless_ of the application's
2 } contains a duplicate key.
Post by Osamu TAKEUCHI
A well-behaved processor _should_ detect a
duplicate key and flag it as an error if it
can correctly evaluate equality of nodes.
So far so good.
Post by Osamu TAKEUCHI
It _must_ aware that a data with some specific
tags may have some custom comparison algorithms,
including the one based on the data identity.
Yes, the current spec gets the identity point wrong.
Post by Osamu TAKEUCHI
Namely, two YAML nodes of same values and same
tags can be evaluated to be unequal by an
identity-based evaluator,
_Only_ if these are complex nodes.
Post by Osamu TAKEUCHI
while two YAML nodes
of different values and even different tags can
be evaluated to be equal by some specific
evaluators. Note that javascript do not
natively distinguish an integer 0x01 with a
sequence [1] as mapping keys.
You keep conflating false positives with false negatives. False negatives
are _fine_. It is OK for the processor to miss some cases of key
duplication. In fact it is expected. The application is the final arbiter
of key equality. You can keep on piling as many examples of "the processor
can't detect keys in <some example> as duplicated" as you want. OF COURSE
there are such cases.
But this does not mean in any way shape or form that we allow false
positives. A YAML processor must not ever complain about key duplication
when such a duplication does not exist. Now, in JavaScript, PHP, Perl,
Ruby, Python, C++, and any other valid YAML system, { a: 1, a: 2 } _does_
have a duplicate key. So a processor _is_ allowed and _should_ complain
about this, _regardless_ of the application-defined equality operator.
It is also warned that tags of nodes can be
Post by Osamu TAKEUCHI
implicitly specified by the path of the node
from the root. So, a schema-blind YAML processor
can never know how to resolve a tag for any
tag-unspecified node.
The path to both the "a" keys in { a: 1, a: 2 } is, by definition, the
same (the path to all keys in the same mapping is, by definition,
identical). So whatever tag is assigned to one of them, by definition, the
same tag must be assigned to the other as well. The application _can't_ use
different tags to distinguish between them. It _can't_ use their identity
to distinguish between them because the YAML processor need not give them
different identities. It _can't_ use their content to distinguish between
them because they have the same content. So, the application _must_
consider them equal - there's just no other possible choice.
So we _can_ complain about them being equal at an earlier processing
stage. We do not _require_ a YAML processor to do so, but we _allow_ and
_encourage_ it to do so.
Post by Osamu TAKEUCHI
A well-behaved YAML
processor _must_ be schema aware,
Now this is just plain wrong. YamlReference is a YAML processor. It
implements the parsing stage. It has no clue whatsoever what schema is
used. Schema-blind YAML processing is, for me, an important use case.
And if every possible schema in the universe _must_ decree that two keys
are equal, then we don't need to know the _specific_ schema, because
whatever it is, it will also _have_ to declare them equal.
Oren.
------------------------------------------------------------------------------
_______________________________________________
Yaml-core mailing list
https://lists.sourceforge.net/lists/listinfo/yaml-core
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Yaml-core mailing list
https://lists.sourceforge.net/lists/listinfo/yaml-core
Ingy dot Net
2016-03-07 16:04:22 UTC
Permalink
Post by Andrey Somov
Post by Ingy dot Net
Oren, if you are suggesting that {a: 1, a: 2} *must* be detected at the
parser level, then I'd have to disagree from the streaming >standpoint,
since the 2 keys may be lightyears away from each other.
I completely support that {a: 1, a: 2} must NOT be detected at the parser
level.
That's not the right wording. Oren said that the detection must be *allowed*,
and I agree.

I wrote that before I got to the part below where I reread what Oren said:

Really? People would be very surprised to hear that { a: 1, a: 2 } is
Post by Andrey Somov
actually OK because some application somewhere _may_ decide it wants scalar
string keys to use "identity-based equality".
So *No*.
For increased readability and portability, the above _must_ be allowed to
be flagged as a duplicate by a YAML processor regardless of what the
application is. And _should_ be flagged so by "well behaved" processors.
Even if they do _not_ have an application layer.
Whoops, I misread your sentence. "_must_ be allowed to be flagged". Yes I
completely agree with that statement. *If* a parser wants to detect
duplicate content/tag scalar keys it must be *allowed* to. I first read
this to say that parsers must. Since you and I have considered streaming
from the start, I was a bit surprised that you would say that. :)
Post by Andrey Somov
I must admit that I do not follow the discussion here because it is too
much,
but I agree with Osamu.
I don't follow you. You agree with Osamu on what?
Post by Andrey Somov
Cheers,
Andrey
Post by Ingy dot Net
Oren, Osamu,
My concern with this thread is that you seem to be limiting your meaning
of "YAML processors" to Loader/Dumper processors (processors that carry
data from text to native programming in-memory objects, and back). ie Where
the "Application" level uses native objects.
While that is certainly the normal case, YAML conjecture should also be
weighed against processors that stop after the parser (at the event
stream). ie Where the application is a streaming filter/mapper. ie No graph
construction or native objects ever happen.
YAML conjecture should also be weighed against non-terminating (or
extremely huge) YAML nodes (documents, mappings, sequences).
Oren, if you are suggesting that {a: 1, a: 2} *must* be detected at the
parser level, then I'd have to disagree from the streaming standpoint,
since the 2 keys may be lightyears away from each other.
FWIW, libyaml doesn't reason on keys and shouldn't. libyaml makes a great
streaming parser/emitter.
Key order can't really even be discussed at this level. Key order *can't* be
changed by a parser, so of course you can make it meaningful if you decide
to. You just have to be aware that your meaningfulness expires should your
YAML be used in a graphing context.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Oren,
I expect we share the same thinking that the
definition of equality belongs to the domain
specific data type, not to the serialization
language.
Pretty much.
Post by Osamu TAKEUCHI
So, unless it makes the serialized
documents much more readable or portable, a
serialization language should not determine
its own equality or identity definition.
That's a bit "unless".
Post by Osamu TAKEUCHI
I agree duplicate key should be detected by
YAML processors because we do not want users
to use duplicate keys for overwriting the
values of predefined keys.
There is no concept of overwriting in a streaming processor. You need a
graph to do that.
Post by Oren Ben-Kiki
Yes.
Post by Osamu TAKEUCHI
The key order in
a YAML mapping should not have meaning.
Very strong yes.
Depends on the layer.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
But it can be done without defining equality
in YAML spec. YAML processor can use native
equality evaluator of the data at its
construction stage and it should do so.
No. You assume there _is_ a construction stage. There need not be one.
Oren. I agree, although I think you mean that construction is skipped on
the way to native, where I'm saying that processing might never get to that
level.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
If the layered structure of the YAML processor
do not allow it, the layered structure itself
should be revised. I don't see how the layered
structure is related to the current topic,
though.
It is crucial to the discussion. We all agree that key duplication
detection _must_ be done at the application layer, but the point is that a
_limited_ form of key duplication detection _may_ and _should_ be done,
especially in YAML processors that do not even _have_ an application layer.
This is because, as you put it, "it makes the serialized documents much
more readable or portable".
If the application is not a graph then duplication mustn't and likely
can't be done.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Similarly, I do not want to forbid PHP users
to store a PHP's native key-order-aware hash
into a key-order-unaware YAML mapping
The problem is, how can you tell whether this is/not safe to do? When
dumping such a hash table to YAML, the application needs to provide some
hint to the YAML processor whether this is actually safe. By default, it is
_not_ safe, so without an explicit hint, the YAML processor _should_ do the
safe thing and emit it as an !!omap.
PHP is not the outlier. Most JavaScript implementations have predictable
key order. I use JS all the time to preserve key order when doing things
like converting JSON to (block formatted) YAML.
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Meaningfulness of the data identity should
also belong to the specific data types. As
shown by the previous examples, the difference
in the semantics of a scalar node and that of
a complex node is not always clear.
Looks pretty clear to me. Scalars are "values". They have _no_ identity,
they have _only_ content. Complex nodes have identity, and as you pointed
out, this means their actual content may be irrelevant (for comparison).
The current spec gets that last point wrong.
I think the core issue here is identity of scalars. You seem to assume
that a YAML processor _must_ preserve the identity of scalars. That is, it
_must not_, for example, use interned strings for keys. The current spec
says the opposite. A YAML processor _need not_ preserve scalar identity and
it _may_ use interned strings and other similar tricks. It is definitely
not required to keep the identity of, say, integer scalars!
... without declaring
Post by Osamu TAKEUCHI
possible non-preservation of identity for
scalars, nobody will think a data with an
identity-based equality evaluation must be
stored as a collection node and must not as a
scalar node. It brings some surprise to users.
Really? People would be very surprised to hear that { a: 1, a: 2 } is
actually OK because some application somewhere _may_ decide it wants scalar
string keys to use "identity-based equality".
So *No*.
For increased readability and portability, the above _must_ be allowed
to be flagged as a duplicate by a YAML processor regardless of what the
application is. And _should_ be flagged so by "well behaved" processors.
Even if they do _not_ have an application layer.
Whoops, I misread your sentence. "_must_ be allowed to be flagged". Yes I
completely agree with that statement. *If* a parser wants to detect
duplicate content/tag scalar keys it must be *allowed* to. I first read
this to say that parsers must. Since you and I have considered streaming
from the start, I was a bit surprised that you would say that. :)
My other points stand.
...
Post by Oren Ben-Kiki
Post by Osamu TAKEUCHI
Such restriction will improve YAML's readability
and portability very little if any.
We'll have to agree to disagree, I'm afraid...
Post by Osamu TAKEUCHI
Actually, I
believe the restriction is currently not known
widely and very few libraries and applications
have ever utilized it.
A pity.
Post by Osamu TAKEUCHI
I don't think many
existing YAML document loose its meaning if we
drop the restriction.
No valid YAML documents will, that's for sure ;-) But that's besides the
point.
Post by Osamu TAKEUCHI
So, let's make the spec simpler by dropping the
definition of YAML's own equality and identity
preservation.
There's no such thing as not addressing the issue of identity and
equality in the spec. Either you _require_ a YAML processor to preserve the
identity of scalars (including, horribly, simple integers), or you do not.
Either way it needs to be stated in the spec.
We chose to say a YAML processer _need not_ preserve the identity of
scalars. Given this, then an application _must not_ use scalar identity for
equality comparisons. Given this, then _regardless_ of the application's
2 } contains a duplicate key.
Post by Osamu TAKEUCHI
A well-behaved processor _should_ detect a
duplicate key and flag it as an error if it
can correctly evaluate equality of nodes.
So far so good.
Post by Osamu TAKEUCHI
It _must_ aware that a data with some specific
tags may have some custom comparison algorithms,
including the one based on the data identity.
Yes, the current spec gets the identity point wrong.
Post by Osamu TAKEUCHI
Namely, two YAML nodes of same values and same
tags can be evaluated to be unequal by an
identity-based evaluator,
_Only_ if these are complex nodes.
Post by Osamu TAKEUCHI
while two YAML nodes
of different values and even different tags can
be evaluated to be equal by some specific
evaluators. Note that javascript do not
natively distinguish an integer 0x01 with a
sequence [1] as mapping keys.
You keep conflating false positives with false negatives. False
negatives are _fine_. It is OK for the processor to miss some cases of key
duplication. In fact it is expected. The application is the final arbiter
of key equality. You can keep on piling as many examples of "the processor
can't detect keys in <some example> as duplicated" as you want. OF COURSE
there are such cases.
But this does not mean in any way shape or form that we allow false
positives. A YAML processor must not ever complain about key duplication
when such a duplication does not exist. Now, in JavaScript, PHP, Perl,
Ruby, Python, C++, and any other valid YAML system, { a: 1, a: 2 } _does_
have a duplicate key. So a processor _is_ allowed and _should_ complain
about this, _regardless_ of the application-defined equality operator.
It is also warned that tags of nodes can be
Post by Osamu TAKEUCHI
implicitly specified by the path of the node
from the root. So, a schema-blind YAML processor
can never know how to resolve a tag for any
tag-unspecified node.
The path to both the "a" keys in { a: 1, a: 2 } is, by definition, the
same (the path to all keys in the same mapping is, by definition,
identical). So whatever tag is assigned to one of them, by definition, the
same tag must be assigned to the other as well. The application _can't_ use
different tags to distinguish between them. It _can't_ use their identity
to distinguish between them because the YAML processor need not give them
different identities. It _can't_ use their content to distinguish between
them because they have the same content. So, the application _must_
consider them equal - there's just no other possible choice.
So we _can_ complain about them being equal at an earlier processing
stage. We do not _require_ a YAML processor to do so, but we _allow_ and
_encourage_ it to do so.
Post by Osamu TAKEUCHI
A well-behaved YAML
processor _must_ be schema aware,
Now this is just plain wrong. YamlReference is a YAML processor. It
implements the parsing stage. It has no clue whatsoever what schema is
used. Schema-blind YAML processing is, for me, an important use case.
And if every possible schema in the universe _must_ decree that two keys
are equal, then we don't need to know the _specific_ schema, because
whatever it is, it will also _have_ to declare them equal.
Oren.
------------------------------------------------------------------------------
_______________________________________________
Yaml-core mailing list
https://lists.sourceforge.net/lists/listinfo/yaml-core
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Yaml-core mailing list
https://lists.sourceforge.net/lists/listinfo/yaml-core
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Yaml-core mailing list
https://lists.sourceforge.net/lists/listinfo/yaml-core
Andrey Somov
2016-03-07 16:08:55 UTC
Permalink
Post by Ingy dot Net
I don't follow you. You agree with Osamu on what?
Next YAML: drop equality definition

(I think the whole conversation is about that)

Andrey
Oren Ben-Kiki
2016-03-07 18:54:56 UTC
Permalink
This has been a long thread...

The way I see it: The spec is what it is :-)

Using the layered approach to describe YAML processing, then:

Layer 0 is the text file (Unicode encoding etc.). It doesn't do anything.

Layer 1 is the parser/scanner/grammer/etc. It doesn't do duplicate key
detection. It does provide key order. It provides indentation levels, and
the way strings are wrapped in lines, and how characters are quoted, and
lots of other "presentation" details. It can be streaming. YamlReference is
an example.

Layer 2 is building an abstract node graph (with either some or all of the
nodes having resolved tags). It _need not_ preserve key order. It _need
not_ preserve identity of scalars. It _may_ do duplicate key detection, but
if it does it _must_ do so in a way that ensures "no false positives", that
is, it may _only_ flag keys as duplicate if the application _must_ consider
them equal, given the YAML data model (thanks to Osamu for finding a gap in
the definition of this data model).

Layer 3 is the application native data structures. They _must not_ depend
on scalar identity. They _must not_ depend on key order in a mapping. They
_must not_ depend on duplicate keys in a mapping. You can have key order
and duplicate keys in an !!omap or something like that.

A concrete system can work at level 0 only (e.g., re-encode a YAML file
from utf-8 to utf-32be). Or level 0/1 only (e.g., a YAML pretty-printer).
Or level 0/1/2 only (e.g., a ypath tool for extracting specific fragments
from a document). Or level /0/1/2/3 (a full application). All are valid.
Each one uses a different (related) data model.

Much of the confusion in this very long thread is due to people applying
the restrictions or data model of a level X to another level Y where they
do not apply. Also, confusing between what is _allowed_ and what is
_required_ in each level.

The above rules allow _control_ over the interoperability of YAML data
between systems. Note - "control", not "guarantee". Nothing can "guarantee"
interoperability between the native data types of completely unrelated
platforms.

The rules also try to minimize the "surprise" people may feel when learning
on how the application actually interprets the data.

Sometimes this means that less-common data needs to be serialized with a
bit more syntax (e.g., !!omap notation vs. !!map notation). Note the fact
we _require_ the use of a less-streamlined syntax (1) does not _prevent_ us
from being able to serialize "anything at all" into YAML and (2) does not
necessarily reduce legibility - in fact, it arguably increases.

Now, you can break any of the above rules, with the understanding that by
doing so you are stepping outside of what YAML provides. In this case your
system may produce unexpected results if valid YAML processors are applied
to the data, and will "surprise" people who expect YAML behavior. So, while
we can't prevent people from doing whatever they want - this is a free
universe - we can require that people doing such such things will not say
"this is a valid YAML system".

BTW: The reason that tags can't apply to different kinds of nodes is due to
the identity issue. Collection identity is guaranteed by YAML. Scalar
identity is not. Applying the same tag to both raises some sticky issues.
That said, we may be able to relax this by using careful wording
(especially given the gap Osamu has found). I'm not certain it is a problem
in practice, though.

So... "it is what it is". I hope the above helps explain why.

Oren.
Ingy dot Net
2016-03-07 19:33:11 UTC
Permalink
Thanks Oren,

I'll try to make all of this clear in the YAML Developers Guide. I can
start by making a document that declares the layers and what properties
apply and such. Then we can have more precise and targeted arguments. :)

My main addition is that the reader/writer transforms between your layer 0
and 1 are responsible for transforming between a unicode encoding and a
stream of unicode code point integers. Do you agree?

Ingy
Post by Oren Ben-Kiki
This has been a long thread...
The way I see it: The spec is what it is :-)
Layer 0 is the text file (Unicode encoding etc.). It doesn't do anything.
Layer 1 is the parser/scanner/grammer/etc. It doesn't do duplicate key
detection. It does provide key order. It provides indentation levels, and
the way strings are wrapped in lines, and how characters are quoted, and
lots of other "presentation" details. It can be streaming. YamlReference is
an example.
Layer 2 is building an abstract node graph (with either some or all of the
nodes having resolved tags). It _need not_ preserve key order. It _need
not_ preserve identity of scalars. It _may_ do duplicate key detection, but
if it does it _must_ do so in a way that ensures "no false positives", that
is, it may _only_ flag keys as duplicate if the application _must_ consider
them equal, given the YAML data model (thanks to Osamu for finding a gap in
the definition of this data model).
Layer 3 is the application native data structures. They _must not_ depend
on scalar identity. They _must not_ depend on key order in a mapping. They
_must not_ depend on duplicate keys in a mapping. You can have key order
and duplicate keys in an !!omap or something like that.
A concrete system can work at level 0 only (e.g., re-encode a YAML file
from utf-8 to utf-32be). Or level 0/1 only (e.g., a YAML pretty-printer).
Or level 0/1/2 only (e.g., a ypath tool for extracting specific fragments
from a document). Or level /0/1/2/3 (a full application). All are valid.
Each one uses a different (related) data model.
Much of the confusion in this very long thread is due to people applying
the restrictions or data model of a level X to another level Y where they
do not apply. Also, confusing between what is _allowed_ and what is
_required_ in each level.
The above rules allow _control_ over the interoperability of YAML data
between systems. Note - "control", not "guarantee". Nothing can "guarantee"
interoperability between the native data types of completely unrelated
platforms.
The rules also try to minimize the "surprise" people may feel when
learning on how the application actually interprets the data.
Sometimes this means that less-common data needs to be serialized with a
bit more syntax (e.g., !!omap notation vs. !!map notation). Note the fact
we _require_ the use of a less-streamlined syntax (1) does not _prevent_ us
from being able to serialize "anything at all" into YAML and (2) does not
necessarily reduce legibility - in fact, it arguably increases.
Now, you can break any of the above rules, with the understanding that by
doing so you are stepping outside of what YAML provides. In this case your
system may produce unexpected results if valid YAML processors are applied
to the data, and will "surprise" people who expect YAML behavior. So, while
we can't prevent people from doing whatever they want - this is a free
universe - we can require that people doing such such things will not say
"this is a valid YAML system".
BTW: The reason that tags can't apply to different kinds of nodes is due
to the identity issue. Collection identity is guaranteed by YAML. Scalar
identity is not. Applying the same tag to both raises some sticky issues.
That said, we may be able to relax this by using careful wording
(especially given the gap Osamu has found). I'm not certain it is a problem
in practice, though.
So... "it is what it is". I hope the above helps explain why.
Oren.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Yaml-core mailing list
https://lists.sourceforge.net/lists/listinfo/yaml-core
Oren Ben-Kiki
2016-03-07 19:44:06 UTC
Permalink
Yes, I see layer 0 as being the "physical layer" - bytes in the file, which
need to be converted to code points so that layer 1, the "presentation
layer" can parse/tokenize/scan it. This is consistent with section 3 of the
spec.
Post by Ingy dot Net
Thanks Oren,
I'll try to make all of this clear in the YAML Developers Guide. I can
start by making a document that declares the layers and what properties
apply and such. Then we can have more precise and targeted arguments. :)
My main addition is that the reader/writer transforms between your layer 0
and 1 are responsible for transforming between a unicode encoding and a
stream of unicode code point integers. Do you agree?
Ingy
Post by Oren Ben-Kiki
This has been a long thread...
The way I see it: The spec is what it is :-)
Layer 0 is the text file (Unicode encoding etc.). It doesn't do anything.
Layer 1 is the parser/scanner/grammer/etc. It doesn't do duplicate key
detection. It does provide key order. It provides indentation levels, and
the way strings are wrapped in lines, and how characters are quoted, and
lots of other "presentation" details. It can be streaming. YamlReference is
an example.
Layer 2 is building an abstract node graph (with either some or all of
the nodes having resolved tags). It _need not_ preserve key order. It _need
not_ preserve identity of scalars. It _may_ do duplicate key detection, but
if it does it _must_ do so in a way that ensures "no false positives", that
is, it may _only_ flag keys as duplicate if the application _must_ consider
them equal, given the YAML data model (thanks to Osamu for finding a gap in
the definition of this data model).
Layer 3 is the application native data structures. They _must not_ depend
on scalar identity. They _must not_ depend on key order in a mapping. They
_must not_ depend on duplicate keys in a mapping. You can have key order
and duplicate keys in an !!omap or something like that.
A concrete system can work at level 0 only (e.g., re-encode a YAML file
from utf-8 to utf-32be). Or level 0/1 only (e.g., a YAML pretty-printer).
Or level 0/1/2 only (e.g., a ypath tool for extracting specific fragments
from a document). Or level /0/1/2/3 (a full application). All are valid.
Each one uses a different (related) data model.
Much of the confusion in this very long thread is due to people applying
the restrictions or data model of a level X to another level Y where they
do not apply. Also, confusing between what is _allowed_ and what is
_required_ in each level.
The above rules allow _control_ over the interoperability of YAML data
between systems. Note - "control", not "guarantee". Nothing can "guarantee"
interoperability between the native data types of completely unrelated
platforms.
The rules also try to minimize the "surprise" people may feel when
learning on how the application actually interprets the data.
Sometimes this means that less-common data needs to be serialized with a
bit more syntax (e.g., !!omap notation vs. !!map notation). Note the fact
we _require_ the use of a less-streamlined syntax (1) does not _prevent_ us
from being able to serialize "anything at all" into YAML and (2) does not
necessarily reduce legibility - in fact, it arguably increases.
Now, you can break any of the above rules, with the understanding that by
doing so you are stepping outside of what YAML provides. In this case your
system may produce unexpected results if valid YAML processors are applied
to the data, and will "surprise" people who expect YAML behavior. So, while
we can't prevent people from doing whatever they want - this is a free
universe - we can require that people doing such such things will not say
"this is a valid YAML system".
BTW: The reason that tags can't apply to different kinds of nodes is due
to the identity issue. Collection identity is guaranteed by YAML. Scalar
identity is not. Applying the same tag to both raises some sticky issues.
That said, we may be able to relax this by using careful wording
(especially given the gap Osamu has found). I'm not certain it is a problem
in practice, though.
So... "it is what it is". I hope the above helps explain why.
Oren.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Yaml-core mailing list
https://lists.sourceforge.net/lists/listinfo/yaml-core
Osamu TAKEUCHI
2016-03-08 00:53:32 UTC
Permalink
Ingy, Oren,

Thank you very much for the comments.
But sorry, I am not interested in the layer structure
very much at least in this thread.

It only matters when you want to detect key duplication
as early stage as you can. The only problem for me
was misusage of the word "YAML parser." From the
beginning, I was trying to say the detection should
be done after or during construction using native
methods as it is currently done in the real world.
At least YAML spec should allow it. If a YAML processor
has no construction stage, it must not do it because
it can not do it correctly. This will not hurt users
too muchas YamlReference does not.

Actually, very few people put importance on the
layered structure. I evaluate it very well designed
consistently, safely and efficiently. But the users
are almost only interested in the application level.
Most of the library maintainers do not, neither.

I ask: How many YAML documents are checked for key
duplication by tools without construction stage but
aware of application-defined tags per day? How many YAML
conjectures are working with non-terminating YAML stream?
How many YAML systems actually implemented canonical
form based comparison of nodes with application-defined
tags? I'm afraid they are almost all imagination.

In contrast, if we have YAML library that can serialize/
deserialize complex native object tree containing full
of different classes into/from YAML file, it will be
widely used. YAML's tag system is very well designed to
suit to build such library. Equality and identity
definitions of YAML spec forbid a part of such use cases
and distort data semantics in such use cases for little
benefit.

I am more interested in talking what kind of portability
we need and what kind of readability wee need.

We should put more importance on realistic use cases.

Osamu Takeuchi
Post by Oren Ben-Kiki
This has been a long thread...
The way I see it: The spec is what it is :-)
Layer 0 is the text file (Unicode encoding etc.). It doesn't do anything.
Layer 1 is the parser/scanner/grammer/etc. It doesn't do duplicate key detection. It does provide key order. It provides indentation levels, and the way strings are wrapped in lines, and how characters are quoted, and lots of other "presentation" details. It can be streaming. YamlReference is an example.
Layer 2 is building an abstract node graph (with either some or all of the nodes having resolved tags). It _need not_ preserve key order. It _need not_ preserve identity of scalars. It _may_ do duplicate key detection, but if it does it _must_ do so in a way that ensures "no false positives", that is, it may _only_ flag keys as duplicate if the application _must_ consider them equal, given the YAML data model (thanks to Osamu for finding a gap in the definition of this data model).
Layer 3 is the application native data structures. They _must not_ depend on scalar identity. They _must not_ depend on key order in a mapping. They _must not_ depend on duplicate keys in a mapping. You can have key order and duplicate keys in an !!omap or something like that.
A concrete system can work at level 0 only (e.g., re-encode a YAML file from utf-8 to utf-32be). Or level 0/1 only (e.g., a YAML pretty-printer). Or level 0/1/2 only (e.g., a ypath tool for extracting specific fragments from a document). Or level /0/1/2/3 (a full application). All are valid. Each one uses a different (related) data model.
Much of the confusion in this very long thread is due to people applying the restrictions or data model of a level X to another level Y where they do not apply. Also, confusing between what is _allowed_ and what is _required_ in each level.
The above rules allow _control_ over the interoperability of YAML data between systems. Note - "control", not "guarantee". Nothing can "guarantee" interoperability between the native data types of completely unrelated platforms.
The rules also try to minimize the "surprise" people may feel when learning on how the application actually interprets the data.
Sometimes this means that less-common data needs to be serialized with a bit more syntax (e.g., !!omap notation vs. !!map notation). Note the fact we _require_ the use of a less-streamlined syntax (1) does not _prevent_ us from being able to serialize "anything at all" into YAML and (2) does not necessarily reduce legibility - in fact, it arguably increases.
Now, you can break any of the above rules, with the understanding that by doing so you are stepping outside of what YAML provides. In this case your system may produce unexpected results if valid YAML processors are applied to the data, and will "surprise" people who expect YAML behavior. So, while we can't prevent people from doing whatever they want - this is a free universe - we can require that people doing such such things will not say "this is a valid YAML system".
BTW: The reason that tags can't apply to different kinds of nodes is due to the identity issue. Collection identity is guaranteed by YAML. Scalar identity is not. Applying the same tag to both raises some sticky issues. That said, we may be able to relax this by using careful wording (especially given the gap Osamu has found). I'm not certain it is a problem in practice, though.
So... "it is what it is". I hope the above helps explain why.
Oren.
------------------------------------------------------------------------------
Transform Data into Opportunity.
Accelerate data analysis in your applications with
Intel Data Analytics Acceleration Library.
Click to learn more.
http://makebettercode.com/inteldaal-eval
_______________________________________________
Yaml-core mailing list
https://lists.sourceforge.net/lists/listinfo/yaml-core
--
武内 修
Zenaan Harkness
2016-03-08 08:19:21 UTC
Permalink
... lots of Acks...
Post by Oren Ben-Kiki
Layer 3 is the application native data structures. They _must not_ depend
on scalar identity. They _must not_ depend on key order in a mapping. They
_must not_ depend on duplicate keys in a mapping. You can have key order
and duplicate keys in an !!omap or something like that.
Ack. This Layer 3 makes perfect sense. Default "application native
YAML map" ("default" schema) does not allow duplicates and must not
depend on key order.

I think of application as perhaps a level above even layer 3? Is this
right? Because at the highest level (e.g. statistics graphing
application), it can save/load as per its own needs, with its own
"default schema", yes?

Or would it still "officially be violating YAML spec" to redefine !map
to be !omap in its "default schema for save files"?

My assumption:
Layer 4 - the user of YAML layer - that code which makes use of a YAML
library, which library provides anything up to and including "YAML
Layer 3".

Layer model is the best thing for clarity of YAML in recent times.
Thank you all!

...
Post by Oren Ben-Kiki
The above rules allow _control_ over the interoperability of YAML data
between systems. Note - "control", not "guarantee". Nothing can "guarantee"
interoperability between the native data types of completely unrelated
platforms.
Massively important distinction. This should be in a spec :D
Post by Oren Ben-Kiki
The rules also try to minimize the "surprise" people may feel when learning
on how the application actually interprets the data.
Sometimes this means that less-common data needs to be serialized with a
bit more syntax (e.g., !!omap notation vs. !!map notation). Note the fact
we _require_ the use of a less-streamlined syntax (1) does not _prevent_ us
from being able to serialize "anything at all" into YAML and (2) does not
necessarily reduce legibility - in fact, it arguably increases.
I think this might answer my question above, but I'm not sure.
Post by Oren Ben-Kiki
Now, you can break any of the above rules, with the understanding that by
doing so you are stepping outside of what YAML provides. In this case your
"Provides by default schema" vs suggests, vs recommends vs allows etc.

With the layer spec, YAML feels to me "mature" and "flexible".
Which = powerful. Great stuff.
Post by Oren Ben-Kiki
system may produce unexpected results if valid YAML processors are applied
to the data, and will "surprise" people who expect YAML behavior. So, while
we can't prevent people from doing whatever they want - this is a free
universe - we can require that people doing such such things will not say
"this is a valid YAML system".
And a streaming schema validating YAML parser could do certain checks
to notify re well formedness etc.
Post by Oren Ben-Kiki
BTW: The reason that tags can't apply to different kinds of nodes is due to
the identity issue. Collection identity is guaranteed by YAML. Scalar
identity is not. Applying the same tag to both raises some sticky issues.
That said, we may be able to relax this by using careful wording
(especially given the gap Osamu has found). I'm not certain it is a problem
in practice, though.
So... "it is what it is". I hope the above helps explain why.
Thank you.
Zenaan Harkness
2016-03-07 09:44:15 UTC
Permalink
Post by Osamu TAKEUCHI
I agree duplicate key should be detected by
YAML processors because we do not want users
to use duplicate keys for overwriting the
values of predefined keys.
By "predefined key" do you mean "a key previously defined in this map"?

Example 1: a statistics application wants to store a 'bunch' of
mappings, of data point name, to value;
in Java, if the YAML library uses a LinkedHashMap (as e.g. SnakeYaml
and I think at least one other Java library does) then the application
can simply serialize all these "histogram" pairs into a YAML map.

A list of maps could be used instead of a map. Knowing Java, this
feels more heavyweight than needed. Additional structure implies
storage and processing cost.


Example 2:
Let's say we have an intended to be user-edited config file.
The user enters duplicate key:value pairs in a YAML map.
The application:
- might be tolerant of this in some way,
- might want to transparently remove the duplicate,
- might want to log that this has happened,
- might want to be noisy about this e.g. "WARNING: Your config file
contains duplicate key/value pairs, namely "...", this violates the
config file schema - correct the error and re-start the application."
- might want to do other things ?

Transparently ignoring or flagging this 'error' takes away options
from the application - is it good to mandate this in the YAML spec, or
is it better to leave this to YAML library implementers?
Post by Osamu TAKEUCHI
The key order in
a YAML mapping should not have meaning.
Ack.

Preservation of key order by the YAML library can be a desirable
attribute for an application, and possibly desirable to be an option
in certain cases (max performance deserialization). (I am not (yet) a
YAML library implementer.)
Osamu TAKEUCHI
2016-03-07 13:25:11 UTC
Permalink
Hi,
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
I agree duplicate key should be detected by
YAML processors because we do not want users
to use duplicate keys for overwriting the
values of predefined keys.
By "predefined key" do you mean "a key previously defined in this map"?
Yes.

I see what you mean.
I myself is actually very close to your side,
while I understand the way YAML currently goes, too.


Let's imagine how we can go your way. Then, we should stop
defining the data models for the three kinds of data nodes:
scalar, mapping, and sequence in the core spec completely.
The spec will just define the syntax of a YAML document and
the data model is solely determined by tags. The data models
for standard tags are given in the schema section so that,
for example, a mapping node with !!map tag will keep
flagging errors for duplicate keys and not being aware of
the key order, while nodes with user defined tags can adopt
any data models as you want.

It allows you to build LinkedHashMap from a mapping node with
duplicate keys if the tag is resolved to !LinkedHashMap.
You can also build a Hash from a mapping node with duplicated
keys with overwriting previously defined keys if the tag is
resolved to !HashOverwriting.


To do so, the YAML library provides API to allow you to build
a native object from the content of the node, with preserving
the key order in mapping nodes and identities of all nodes.

To build an object from a scalar node, your builder function
is callbacked with the formatted content itself as the
arguments.

To build an object from a sequence node, your builder function
is callbacked with a list of already built native objects as
the arguments.

To build an object from a mapping node, your builder function
is callbacked with the list of key-value pairs as nature
objects in the right order as the arguments

You are not restricted to build static value, array or hash
from scalar, sequence and mapping. You can do whatever you want
using the information given from the library. For example, a
sequence node with !sum_of_int may generate an int value with
calculating sum of the child nodes with discarding the child
nodes after the calculation.

This gives much more freedom to us.
It makes the core spec simpler.
It may make YAML more popular.


Actually, this is almost what I did when I wrote a YAML
library with C#. To allow an application to build customized
object from customized tags, the library must have such API.
The difference is, I myself tried not to build any native
object that is too far from YAML's data model. If the spec
do not define the data model, people can do whatever they
want.


Technically speaking, keeping the key order and identity of
nodes in the representation graph is not difficult with any
practical languages. The key-value pairs for a mapping node
can be stored in an array-like structure in the representation
graph, instead of in a hash like structure. Unless the library
try to detect key duplication, it will not need quick key
search for the collection. Scalar nodes will be represented
by some class object in the graph. If we create one instance
for one scalar node, the identity will be preserved automatically.


The remaining concern is, people with less knowledge on the
schema can hardly understand the semantics of the data. But
this is natural. If one do not know the schema, they can not
even know the data type of a scalar node. The information
lost from the document due to the freedom should be gained
from the schema.


It will be said that such a document will be not portable
enough. IMO, portability does not matter unless one try to
construct native objects from nodes with tags that can not
be properly handled in the platform. All the well-designed
YAML libraries must be able to build the representation
graph from such a YAML document with the proper schema given.
Then, this representation graph itself can be manipulated
by the application to deal with the data. In this sense,
the document is still portable. Anyway, no more portability
can be expected unless all the document model defined in the
schemais natively implemented in the platform. This is also
the case for the current spec if a document is full of custom
tags.


Note that, portability of a YAML document without any custom
tags are not affected at all, because the data model of the
standard tags are fully specified in the recommended schema.


It seems consistent for me for now.
I think we can go this way if we decide.

Best,
Osamu Takeuchi
Zenaan Harkness
2016-03-08 07:58:22 UTC
Permalink
Post by Osamu TAKEUCHI
Hi,
Post by Zenaan Harkness
Post by Osamu TAKEUCHI
I agree duplicate key should be detected by
YAML processors because we do not want users
to use duplicate keys for overwriting the
values of predefined keys.
By "predefined key" do you mean "a key previously defined in this map"?
Yes.
I see what you mean.
I myself is actually very close to your side,
while I understand the way YAML currently goes, too.
Let's imagine how we can go your way. Then, we should stop
scalar, mapping, and sequence in the core spec completely.
The spec will just define the syntax of a YAML document and
the data model is solely determined by tags. The data models
for standard tags are given in the schema section so that,
for example, a mapping node with !!map tag will keep
flagging errors for duplicate keys and not being aware of
the key order, while nodes with user defined tags can adopt
any data models as you want.
It allows you to build LinkedHashMap from a mapping node with
duplicate keys if the tag is resolved to !LinkedHashMap.
You can also build a Hash from a mapping node with duplicated
keys with overwriting previously defined keys if the tag is
resolved to !HashOverwriting.
You understood precisely what was in my mind - thank you so much for
putting it into words that YAML people can understand :)
Post by Osamu TAKEUCHI
To do so, the YAML library provides API to allow you to build
a native object from the content of the node, with preserving
the key order in mapping nodes and identities of all nodes.
To build an object from a scalar node, your builder function
is callbacked with the formatted content itself as the
arguments.
To build an object from a sequence node, your builder function
is callbacked with a list of already built native objects as
the arguments.
To build an object from a mapping node, your builder function
is callbacked with the list of key-value pairs as nature
objects in the right order as the arguments
You are not restricted to build static value, array or hash
from scalar, sequence and mapping. You can do whatever you want
using the information given from the library. For example, a
sequence node with !sum_of_int may generate an int value with
calculating sum of the child nodes with discarding the child
nodes after the calculation.
Another good example of the way I was thinking.
Post by Osamu TAKEUCHI
This gives much more freedom to us.
It makes the core spec simpler.
It may make YAML more popular.
Possibly - if the spec is simplified, this may have positive but also
negative consequences - must think through such things.
Post by Osamu TAKEUCHI
Actually, this is almost what I did when I wrote a YAML
library with C#. To allow an application to build customized
object from customized tags, the library must have such API.
YAML spec must still be layered, and support lower layer processors though...
Post by Osamu TAKEUCHI
The difference is, I myself tried not to build any native
object that is too far from YAML's data model. If the spec
do not define the data model, people can do whatever they
want.
Technically speaking, keeping the key order and identity of
nodes in the representation graph is not difficult with any
practical languages. The key-value pairs for a mapping node
can be stored in an array-like structure in the representation
graph, instead of in a hash like structure.
Or have both - like Java's LinkedHashMap or LinkedHashSet - if
Javascript cannot do such a thing "natively", then perhaps Javascript
needs a LinkedHashMap implementation? Or just return a Pair of
(Array,HashMap) - but only in those situations where an application
layer requires this of course.
Post by Osamu TAKEUCHI
Unless the library
try to detect key duplication, it will not need quick key
search for the collection. Scalar nodes will be represented
by some class object in the graph. If we create one instance
for one scalar node, the identity will be preserved automatically.
The remaining concern is, people with less knowledge on the
schema can hardly understand the semantics of the data. But
this is natural. If one do not know the schema, they can not
even know the data type of a scalar node. The information
lost from the document due to the freedom should be gained
from the schema.
This makes intuitive sense to me. Even when there's no schema, there
is (in my mind) an implicit "YAML data model schema" anyway...
Post by Osamu TAKEUCHI
It will be said that such a document will be not portable
enough.
This assertion has never properly made sense to me - "contextless
document" means "schema-less document" which means a YAML document
which no application really knows about. This just does not exist in
the real world, except theoretical.

For example, if we consider certain YAML spec test documents to be
"contextless", there is always some implicit schema, and the context
is the YAML spec itself.

So although there may be a default or implicit schema, "schema less
document" just does not make sense to me, and --therefore--:

talking of "such a document will not be portable enough",
also just does not even make sense

Every real document, has an actual context. That actual context, at
the very least -implies- some schema.
Post by Osamu TAKEUCHI
IMO, portability does not matter unless one try to
construct native objects from nodes with tags that can not
be properly handled in the platform.
"Native object" say from Java, and exporting to YAML, and importing
into Javascript, for example?

Is this merely an implementers problem, not a spec problem? If a
particular programming language is so deficient that it is not
possible to have one set of values accessible by, alternately, both
map and by array or linked list, then that programming language has
some serious limitations...
Post by Osamu TAKEUCHI
All the well-designed
YAML libraries must be able to build the representation
graph from such a YAML document with the proper schema given.
Ack.
Post by Osamu TAKEUCHI
Then, this representation graph itself can be manipulated
by the application to deal with the data. In this sense,
the document is still portable.
Ack.
Post by Osamu TAKEUCHI
Anyway, no more portability
can be expected unless all the document model defined in the
schema is natively implemented in the platform.
"Just a platform issue."
Post by Osamu TAKEUCHI
This is also the case for the current spec if a
document is full of custom tags.
Ack.
Post by Osamu TAKEUCHI
Note that, portability of a YAML document without any custom
tags are not affected at all, because the data model of the
standard tags are fully specified in the recommended schema.
Is "recommended schema" same as "default" schema?
Post by Osamu TAKEUCHI
It seems consistent for me for now.
I think we can go this way if we decide.
Interesting discussion. Thank's all for persisting...
Zenaan
Osamu TAKEUCHI
2016-03-04 16:21:34 UTC
Permalink
Hi Zenaan,
Post by Zenaan Harkness
Could both of these, the "may reject" option, and the "constrain ...
and must reject" option, be part of the schema, or specified by
command line option/ side-channel specification of at the appropriate
layer?
"Implementation dependent" doesn't feel like enough control.
It should be stated that YAML spec do not specify node equality
at all, so as any YAML libraries. The equality should be evaluated
by the native object's == operator. In some system, the equality
of the class instances, which is usually stored as a mapping
node in YAML, are evaluated by identity (pointer address).
In other systems, it is by the value of properties. In some
systems, the equality operator can be overridden to suit the
specific purpose of the object type. In any cases, how to
evaluate equality belongs to the definition of the data type.
It should not be determined by the serialization language.

It seems that the statement should not have been
"Implementation dependent" but something like "schema
dependent" or "data type dependent". I need some better
words...

Best,
Osamu
Post by Zenaan Harkness
Hi Osamu, excellent post with lots of links and all! Thank you. One
question below.
Post by Osamu TAKEUCHI
There were long arguments how we should treat equality of nodes in YAML.
https://sourceforge.net/p/yaml/mailman/search/?q=%22%5BYAML-core%5D+equality%22&mail_list=all&sort=posted_date%20desc
Currently, equality of nodes is used in two purposes. One is to
reject a mapping with duplicate keys in the YAML 1.2 spec.
The spec says a mapping with duplicate keys should be rejected
by a YAML parser. The other is for allowing a library to represent
some equal scalar nodes by an single identical node to save memory
consumption.
https://sourceforge.net/p/yaml/mailman/message/23572250/
It sounds straightforward at first but not in the reality.
The equality of nodes involves issues when anchor/alias and
implicit tag resolution are involved.
To solve the problems, Oren proposed the following.
https://sourceforge.net/p/yaml/mailman/message/24061658/
Do not specify YAML equality rules. Eliminate most of the discussion
of equality, canonical formats etc. and replace it by a stating that
implementations "may" reject mappings that have "equal" keys,
according to their own *implementation-specific* definition of equality.
Constrain this
to say that nodes with equal tags and equal content are always
equal and hence "must" be rejected as duplicates.
Could both of these, the "may reject" option, and the "constrain ...
and must reject" option, be part of the schema, or specified by
command line option/ side-channel specification of at the appropriate
layer?
"Implementation dependent" doesn't feel like enough control.
Thanks again,
Zenaan
------------------------------------------------------------------------------
Site24x7 APM Insight: Get Deep Visibility into Application Performance
APM + Mobile APM + RUM: Monitor 3 App instances at just $35/Month
Monitor end-to-end web transactions and take corrective actions now
Troubleshoot faster and improve end-user experience. Signup Now!
http://pubads.g.doubleclick.net/gampad/clk?id=272487151&iu=/4140
_______________________________________________
Yaml-core mailing list
https://lists.sourceforge.net/lists/listinfo/yaml-core
Loading...