COOPY » Guide
version 0.6.5
|
This page describes the TDIFF diff format. This is the main format to be used by the COOPY toolbox for representing differences between tables. It is based on a draft specification with Joe Panico, see version history.
TDIFF documents use the UTF-8 character encoding.
TDIFF documents comprise any number of comment blocks, control blocks, and diff hunks, interleaved in any order. Each diff hunk describes a related set of differences between two tables. Each hunk could stand on its own as an independent TDIFF document. When there is a choice in how to decompose differences between two tables as a sequence of hunks, generators are encouraged to choose a decomposition that minimizes ordering effects between hunks.
Example:
# tdiff version 0.3 /* * fix some goofs */ * |bridge=Brooklyn|designer:'J.A.Roebling'|length:1595| = |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600| * |bridge=Queensborough|designer:'Palmer & Hornbostel'|length:1182| /* * remove spam and add a missing bridge */ - |bridge=Spamspan|designer:'S.Spamington'|length:10000| + |bridge=Manhattan|designer:'G.Lindenthal'|length:1470| /* * we are done! */
Comment blocks are delimited using:
/* */
(C style). Any content can occur within a comment block. Examples:
/* This is an example single-line comment */ /* This is an example multi-line comment */
Control blocks begin with "# ", and are delimited by a newline or linefeed. Control blocks may hold meta information about diffs, or environmental information that might be useful to an interpreter. Apart from the special header control block, they lie outside of the scope of this specification.
A TDIFF document should begin with a special control block called the header. The header begins with the characters "# tdiff". It is there to aid in rapid identification of tdiff documents. Example:
# tdiff version 0.3
A hunk is a series of one or more adjacent diff lines, optionally preceded by a column line, where each diff line represents the differences between the source tables for a single row. The lines within a hunk should be separated by only the newline characters that terminate each diff line, so that they all appear as adjacent lines within a text editor. Within a TDIFF document, hunks are delimited from each other via intermediate filler or comment blocks.
Example hunk:
- |bridge=Spamspan|designer:'S.Spamington'|length:10000| = |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600| + |bridge=Manhattan|designer:'G.Lindenthal'|length:1470|
A diff line describes differences in a single row of the two tables that were compared. One table is designated the left or local table (called L) and the other table is designated the right or remote table (called R).
There are three types of diff lines:
Each diff line occupies its own line in the document, and begins with one of three characters. These three characters are called "line type" characters:
The line type character can be left or right padded with any amount of whitespace, for readability. The line type character is followed by any number of name-value pairs, where the names represent column names, and the values are the values for the corresponding column name in that particular row. The name is separated from the value by an equals ('=') sign for identifying columns (usually part of the primary key, but see Keys versus identity) or a colon (':') sign for all other columns. The name-value pairs, as well as the line type character, are delimited by a pipe '|' character.
Example diff line:
= |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600|
Optionally, key names can be "factored out" of diff lines and placed in a special column line. A column line lists column names, followed by "=" for identifying columns. New columns that were not present in the input should have "->" appended, to flag that cells in such columns have no prior values.
Here's a column line example:
@ |bridge=|designer|length|
This establishes bridge as an identifying column that appears first, followed by designer and length columns. We can now rewrite this:
= |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600|
as this:
= |Williamsburg|'D.D.Duck'->'L.L.Buck'|1600|
The effect of column lines should be limited to within a single hunk.
In the case of column diffs, for each cell that was different between L and R, both the old and new values are displayed. The old value must come first, followed by '->' (dash greater than), followed by the new value. For all three diff line types, the generator may include L name-value pairs that are not strictly needed, but may help with row identification.
Determining whether a row is present in L and R requires a judgment about row identity. This judgment may be simple. For example, the identity of a row may simply be the value of its primary key. However, it is possible that the identity of a row is distinct from its primary key. Consider for example a table with an auto-incrementing integer primary key, rather than something derived from the row data. Comparison of that key between separately maintained copies of that table will be meaningless. For meaningful comparison, an alternate row identity would need to be constructed.
This issue lies outside the TDIFF specification, but it is important that implementors be aware that columns used for identification may or may not be part of the primary key.
Names or values may be quoted in a TDIFF document. Quoting is done as follows:
It is always safe to single-quote a name or value. Names or values must by quoted in any of the following conditions:
document ::= header ((space)? block)* block ::= hunk | control | comment | filler hunk ::= (hunk_header)? ((space)? diff_line)+ hunk_header ::= '@' (space)+ divider (column divider)* break diff_line ::= ('-' | '+' | '=') (space)+ divider (term divider)* break term ::= (name ('='|':'))? (value '->')? value column ::= name ('='|'->')? break ::= divider? linebreak control ::= "#" (divider value)* break comment ::= ("/*" comment_body "*/") filler ::= (linebreak | divider)+ header ::= "# tdiff" ([^\r\n])* break divider ::= '|' linebreak ::= ('\r\n' | '\r' | '\n')
The linebreak non-terminal needs to be handled more carefully than the grammar suggests, since the number of linebreaks is significant in the grammar. The comment_body non-terminal is as for the C language.
In example one, both tables are in an RDBMS, both tables have the same column names, and the rows are identified using column1.
Example 1:
L: R: column1,column2,column3,column4 column1,column2,column3,column4 1, 0000, x, aaaa ---------------------------- ---------------------------- 2, 1111, x, aaaa 3, 2222, x, aaaa 3, 2222, y, aaaa 4, 3333, x, aaaa 4, 0000, z, bbbb 5, 4444, x, aaaa 5, 4444, z, bbbb 6, 5555, x, aaaa 6, 5555, u, aaaa ---------------------------- 7, 0000, v, aaaa ---------------------------- 8, 1111, x, aaaa ----
Example 1 diff, variant 1:
# tdiff version 0.3 /* * this is the tDiff document for example 1, using 1 hunk only and no context. * Note the "|" usage varies from previous examples in this document. * "|" plays the same role as spaces and tabs in the spec, so varying * styles are possible. */ - | column1=1 + | column1=2| column2:1111| column3:x| column4:aaaa = | column1=3| column3:x->y = | column1=4| column2:3333->0000| column3:x->z| column4:aaaa->bbbb = | column1=5| column3:x->z| column4:aaaa->bbbb = | column1=6| column3:x->u + | column1=7| column2:0000| column3:v| column4:aaaa + | column1=8| column2:1111| column3:x| column4:aaaa /* * end of tDiff document */
Example 1 diff, variant 2:
# tdiff version 0.2 /* * here is a tDiff document that is equivalent to the document above, except * that it uses 8 hunks, more comments, and adds in some context */ /* * hunk 1: notice that columns 2,3,4 are context-- not strictly necessary * to specify a remove */ - | column1=1| column2:0000| column3:x| column4:aaaa /* * hunk 2: notice that the hunks are separated by standalone newline */ + | column1=2| column2:1111| column3:x| column4:aaaa /* * hunk 3: notice that column2 and column4 are merely context */ = | column1=3| column2:2222| column3:x->y| column4:aaaa /* * hunk 4: notice that the column diff line is surrounded by context rows, and * that the context rows describe the values on the RHS. */ * | column1=3| column2:2222| column3:x| column4:aaaa = | column1=4| column2:3333->0000| column3:x->z| column4:aaaa->bbbb * | column1=5| column2:4444| column3:x| column4:aaaa /* * hunk 5 */ = | column1=5| column3:x->z| column4:aaaa->bbbb /* * hunk 6 */ = | column1=6| column3:x->u /* * hunk 7 */ + | column1=7| column2:0000| column3:v| column4:aaaa /* * hunk 8 */ + | column1=8| column2:1111| column3:x| column4:aaaa
TDIFF version 0.2 was co-developed by COOPY author Paul Fitzpatrick and diffkit author Joe Panico (TDIFF 0.2 draft, diffkit-user thread). Version 0.3 contains extensions to deal with schema changes and the like, and hasn't been sanity checked by Joe.
Differences between version 0.2 and 0.3: