Osamu TAKEUCHI
2016-03-04 13:50:13 UTC
Hi there,
There were long arguments how we should treat equality of nodes in YAML.
https://sourceforge.net/p/yaml/mailman/search/?q=%22%5BYAML-core%5D+equality%22&mail_list=all&sort=posted_date%20desc
Currently, equality of nodes is used in two purposes. One is to
reject a mapping with duplicate keys in the YAML 1.2 spec.
The spec says a mapping with duplicate keys should be rejected
by a YAML parser. The other is for allowing a library to represent
some equal scalar nodes by an single identical node to save memory
consumption.
https://sourceforge.net/p/yaml/mailman/message/23572250/
It sounds straightforward at first but not in the reality.
The equality of nodes involves issues when anchor/alias and
implicit tag resolution are involved.
To solve the problems, Oren proposed the following.
https://sourceforge.net/p/yaml/mailman/message/24061658/
I propose to completely stop defining equality in YAML spec and leave
the choice of accepting or rejecting a mapping with possibly duplicate
keys to the specific applications.
[[Point 1]] Equality of complex nodes with anchors and aliases
It is really difficult to strictly judge equality of complex nodes
with anchors and aliases in accordance with YAML 1.2 specification.
To strictly evaluate equality, a parser must implement a complex
graph comparison algorithm. For example, a parser must distinguish
two objects in the following.
%YAML 1.2
---
&A { *A }
---
&A { { *A } }
...
They are not equal to each other in YAML 1.2 specification
because the node graph topology is different with each other.
https://sourceforge.net/p/yaml/mailman/message/23572250/
https://sourceforge.net/p/yaml/mailman/message/23576035/
Similarly, the next document should be accepted,
- &A [ *A ]
- &B [ *A ]
- { *A : 1, *B : 2 }
while the next should be rejected unless some specific schema
is applied. (Implicit tag resolution allows a parser to give
different tags to *A and *B according to the path of the node
from the root.)
- &A [ *A ]
- &B [ *B ]
- { *A : 1, *B : 2 }
So, almost no library have implemented YAML1.2's equality strictly.
In other words, the equality definition of mapping and sequence
in YAML spec has been almost always neglected so far.
I myself implemented it long time ago but I don't think it was
beneficial to anyone because no one needs strict comparison
of complex graph topology for all mappings in YAML documents.
The underlining issue is that the equality definition of native
hash and arrays in different languages are all slightly different
from each other and also from that of the mapping and sequence in
YAML as discussed. If YAML specifies the equality strictly, the
data model of YAML will not fit any native data model in almost
all languages and libraries.
https://sourceforge.net/p/yaml/mailman/message/23591366/
[[Point 2]] Implicit tag resolution
Oren strongly pushed rejection of duplicate keys in YAML spec for
being able to create generic "schema-blind" YAML tools.
However, it is impossible, anyway.
https://sourceforge.net/p/yaml/mailman/message/24073088/
https://sourceforge.net/p/yaml/mailman/message/24075274/
In the next example, a schema-blind parser can never judge
if *A and *a are equal or not.
- !People
- &A { name: Mike }
- !Cats
- &a { name: Mike }
- !Favorites
*A: beaf steak
*a: canned tuna
In this case, I expect *A should be resolved to !Person while *a
should be !Cat. Note that it is allowed to resolve unspecified tags
from the path of the node from the root. Since the tags are different,
*A and *a are different regardless of their values. The mapping
should not be rejected but no schema-blind parser can judge it
properly.
The point is, equality of nodes should be determined by the schema,
not by the YAML spec.
Kirill gave a good example where a scalar node could be equal to
a collection node under a realistic schema.
https://sourceforge.net/p/yaml/mailman/message/24073575/
x: 1.0i
y: { re: 0.0, im: 1.0 }
z: { rho: 1.0, phi: 1.5707963267948966 }
The three nodes 'x', 'y', and 'z' are probably equal. You don't know
unless you know the schema. Another example is from his own application,
where nodes
db1: postgres://localhost:5432/mydb
db2: mydb
db3: { engine: postgres, host: localhost, port: 5432, database: mydb }
db4: { database: mydb }
will generate the same object.
Again, the equality should be determined by the schema, not by the
YAML spec.
[[Point 3]] A mapping with clearly duplicate keys.
According to my understanding, the next mappings definitely have
duplicated keys under any schema in YAML 1.2 because the spec do
not allow tag resolution from the order of key appearance in a
mapping.
- { a: true, a: false }
- { !SomeTag b: true, !SomeTag b: false }
- { [1,2]: true, [1,2]: false }
- { {a:1,[a,b]:2}: true, {[a,b]:2,a:1}: false }
However, implementing rejection feature is even tougher with arbitrary
schema in mind because it have to check that two complex nodes with
anchors and aliases with unspecific tags will never be different under
any schema in addition to the complex graph comparison.
I don't think any library will implement such equality evaluation
regardless what is written in the spec.
[[Point 4]] Simplicity.
If we just state
It will make the spec simpler.
[[Point 5]] Value-based platform vs Identity-based platform
If YAML spec stops defining equality, some application might
generate a YAML document that is not easily manipulated
on a *value-based* platforms.
For example,
https://sourceforge.net/p/yaml/mailman/message/24075274/
USA:
Presidents: !Presidents
- &PR1
name: George Washington
- &PR2
name: John Adams
(snip)
- &PR41
name: George Bush
- &PR42
name: William Clinton
- &PR43
name: George Bush
- &PR44
name: Barack Obama
(snip)
Parties: !Parties
- &PA1
name: Republican
- &PA2
name: Democratic
(snip)
PresidentToParty:
(snip)
*PR41: *PA1
*PR42: *PA2
*PR43: *PA1
*PR44: *PA2
(snip)
*PR41 and *PR43 have exactly the same properties probably with
the same tag !President. So, YAML 1.2 parser will judge they
are equal to each other and reject the mapping at the end of
the document. But if the YAML parser do not reject it, the
document will be meaningful on an *identity-based* platform.
Oren pointed out that, on a value-based platform, this type of
document will not be manipulated easily.
In my opinion, it can still be manipulated on a value-based
platform if the YAML library adds some identity field to each
node internally. Otherwise, anyway, a YAML library in
a value-based platform will fail to treat valid YAML documents
like the next.
- &A [ *A ]
- &B [ *A ]
- { *A: 1, *B: 2 }
[[Point 6]] JSON do not forbid to have duplicate keys in a mapping
[[Conclusion]]
So, I propose to completely quit defining node equality in YAML spec.
It should just state
https://sourceforge.net/p/yaml/mailman/message/24075274/
Another possible solution is, to define the equality of scalar nodes
only. Without thinking of mappings and sequences, judging equality
is straightforward except for the case where an alias is used as
a mapping key which points a scalar node with implicit tag resolution.
This will not make the spec too complicated for reading and for
implementing if it is very much beneficial in some cases.
Thank you for reading the long email.
Best,
Osamu Takeuchi
There were long arguments how we should treat equality of nodes in YAML.
https://sourceforge.net/p/yaml/mailman/search/?q=%22%5BYAML-core%5D+equality%22&mail_list=all&sort=posted_date%20desc
Currently, equality of nodes is used in two purposes. One is to
reject a mapping with duplicate keys in the YAML 1.2 spec.
The spec says a mapping with duplicate keys should be rejected
by a YAML parser. The other is for allowing a library to represent
some equal scalar nodes by an single identical node to save memory
consumption.
https://sourceforge.net/p/yaml/mailman/message/23572250/
It sounds straightforward at first but not in the reality.
The equality of nodes involves issues when anchor/alias and
implicit tag resolution are involved.
To solve the problems, Oren proposed the following.
https://sourceforge.net/p/yaml/mailman/message/24061658/
Do not specify YAML equality rules. Eliminate most of the discussion of
equality, canonical formats etc. and replace it by a stating that
implementations "may" reject mappings that have "equal" keys, according to
their own *implementation-specific* definition of equality. Constrain this
to say that nodes with equal tags and equal content are always equal and
hence "must" be rejected as duplicates. The problem here is that { 1: "int",
"1" : "string" } would work in Python and not in Javascript. Arguably,
anyone defining a cross-platform schema would be able to "easily" avoid such
issues (e.g., by requiring all keys of the mapping to have the same tag,
which is pretty trivial). But there's no longer a universal cross-platform
validity guarantee.
But the statementequality, canonical formats etc. and replace it by a stating that
implementations "may" reject mappings that have "equal" keys, according to
their own *implementation-specific* definition of equality. Constrain this
to say that nodes with equal tags and equal content are always equal and
hence "must" be rejected as duplicates. The problem here is that { 1: "int",
"1" : "string" } would work in Python and not in Javascript. Arguably,
anyone defining a cross-platform schema would be able to "easily" avoid such
issues (e.g., by requiring all keys of the mapping to have the same tag,
which is pretty trivial). But there's no longer a universal cross-platform
validity guarantee.
nodes with equal tags and equal content are always equal and hence
"must" be rejected as duplicates
was still controversial."must" be rejected as duplicates
I propose to completely stop defining equality in YAML spec and leave
the choice of accepting or rejecting a mapping with possibly duplicate
keys to the specific applications.
[[Point 1]] Equality of complex nodes with anchors and aliases
It is really difficult to strictly judge equality of complex nodes
with anchors and aliases in accordance with YAML 1.2 specification.
To strictly evaluate equality, a parser must implement a complex
graph comparison algorithm. For example, a parser must distinguish
two objects in the following.
%YAML 1.2
---
&A { *A }
---
&A { { *A } }
...
They are not equal to each other in YAML 1.2 specification
because the node graph topology is different with each other.
https://sourceforge.net/p/yaml/mailman/message/23572250/
https://sourceforge.net/p/yaml/mailman/message/23576035/
Similarly, the next document should be accepted,
- &A [ *A ]
- &B [ *A ]
- { *A : 1, *B : 2 }
while the next should be rejected unless some specific schema
is applied. (Implicit tag resolution allows a parser to give
different tags to *A and *B according to the path of the node
from the root.)
- &A [ *A ]
- &B [ *B ]
- { *A : 1, *B : 2 }
So, almost no library have implemented YAML1.2's equality strictly.
In other words, the equality definition of mapping and sequence
in YAML spec has been almost always neglected so far.
I myself implemented it long time ago but I don't think it was
beneficial to anyone because no one needs strict comparison
of complex graph topology for all mappings in YAML documents.
The underlining issue is that the equality definition of native
hash and arrays in different languages are all slightly different
from each other and also from that of the mapping and sequence in
YAML as discussed. If YAML specifies the equality strictly, the
data model of YAML will not fit any native data model in almost
all languages and libraries.
https://sourceforge.net/p/yaml/mailman/message/23591366/
[[Point 2]] Implicit tag resolution
Oren strongly pushed rejection of duplicate keys in YAML spec for
being able to create generic "schema-blind" YAML tools.
However, it is impossible, anyway.
https://sourceforge.net/p/yaml/mailman/message/24073088/
https://sourceforge.net/p/yaml/mailman/message/24075274/
In the next example, a schema-blind parser can never judge
if *A and *a are equal or not.
- !People
- &A { name: Mike }
- !Cats
- &a { name: Mike }
- !Favorites
*A: beaf steak
*a: canned tuna
In this case, I expect *A should be resolved to !Person while *a
should be !Cat. Note that it is allowed to resolve unspecified tags
from the path of the node from the root. Since the tags are different,
*A and *a are different regardless of their values. The mapping
should not be rejected but no schema-blind parser can judge it
properly.
The point is, equality of nodes should be determined by the schema,
not by the YAML spec.
Kirill gave a good example where a scalar node could be equal to
a collection node under a realistic schema.
https://sourceforge.net/p/yaml/mailman/message/24073575/
x: 1.0i
y: { re: 0.0, im: 1.0 }
z: { rho: 1.0, phi: 1.5707963267948966 }
The three nodes 'x', 'y', and 'z' are probably equal. You don't know
unless you know the schema. Another example is from his own application,
where nodes
db1: postgres://localhost:5432/mydb
db2: mydb
db3: { engine: postgres, host: localhost, port: 5432, database: mydb }
db4: { database: mydb }
will generate the same object.
Again, the equality should be determined by the schema, not by the
YAML spec.
[[Point 3]] A mapping with clearly duplicate keys.
According to my understanding, the next mappings definitely have
duplicated keys under any schema in YAML 1.2 because the spec do
not allow tag resolution from the order of key appearance in a
mapping.
- { a: true, a: false }
- { !SomeTag b: true, !SomeTag b: false }
- { [1,2]: true, [1,2]: false }
- { {a:1,[a,b]:2}: true, {[a,b]:2,a:1}: false }
However, implementing rejection feature is even tougher with arbitrary
schema in mind because it have to check that two complex nodes with
anchors and aliases with unspecific tags will never be different under
any schema in addition to the complex graph comparison.
I don't think any library will implement such equality evaluation
regardless what is written in the spec.
[[Point 4]] Simplicity.
If we just state
Implementations "may" reject mappings that have "equal" keys,
according to their own *implementation-specific* definition
of equality.
we can drop all description about equality from YAML spec.according to their own *implementation-specific* definition
of equality.
It will make the spec simpler.
[[Point 5]] Value-based platform vs Identity-based platform
If YAML spec stops defining equality, some application might
generate a YAML document that is not easily manipulated
on a *value-based* platforms.
For example,
https://sourceforge.net/p/yaml/mailman/message/24075274/
USA:
Presidents: !Presidents
- &PR1
name: George Washington
- &PR2
name: John Adams
(snip)
- &PR41
name: George Bush
- &PR42
name: William Clinton
- &PR43
name: George Bush
- &PR44
name: Barack Obama
(snip)
Parties: !Parties
- &PA1
name: Republican
- &PA2
name: Democratic
(snip)
PresidentToParty:
(snip)
*PR41: *PA1
*PR42: *PA2
*PR43: *PA1
*PR44: *PA2
(snip)
*PR41 and *PR43 have exactly the same properties probably with
the same tag !President. So, YAML 1.2 parser will judge they
are equal to each other and reject the mapping at the end of
the document. But if the YAML parser do not reject it, the
document will be meaningful on an *identity-based* platform.
Oren pointed out that, on a value-based platform, this type of
document will not be manipulated easily.
In my opinion, it can still be manipulated on a value-based
platform if the YAML library adds some identity field to each
node internally. Otherwise, anyway, a YAML library in
a value-based platform will fail to treat valid YAML documents
like the next.
- &A [ *A ]
- &B [ *A ]
- { *A: 1, *B: 2 }
[[Point 6]] JSON do not forbid to have duplicate keys in a mapping
[[Conclusion]]
So, I propose to completely quit defining node equality in YAML spec.
It should just state
Implementations "may" reject mappings that have "equal" keys,
according to their own *implementation-specific* definition
of equality.
possibly with some warnings likeaccording to their own *implementation-specific* definition
of equality.
https://sourceforge.net/p/yaml/mailman/message/24075274/
Note that some languages have unique definitions for equality.
For example, !!int 1 and !!str "1" are equal mapping keys in
JavaScript.
[[Another Possible Solution]] Only defining equality of scalar nodesFor example, !!int 1 and !!str "1" are equal mapping keys in
JavaScript.
Another possible solution is, to define the equality of scalar nodes
only. Without thinking of mappings and sequences, judging equality
is straightforward except for the case where an alias is used as
a mapping key which points a scalar node with implicit tag resolution.
This will not make the spec too complicated for reading and for
implementing if it is very much beneficial in some cases.
Thank you for reading the long email.
Best,
Osamu Takeuchi