This document defines the tDiff format; a text encoding for describing the differences between two tables of data.
tDiff output is intended for two audiences: end-users who are looking for an easily readable report that describes the differences between two tables; and tools that will interpret tDiff output in order to perform further processing or analysis (e.g. "patch" tools).
tDiff should be able to describe the differences between any types of data tables. That includes: tables that have an identifiable unique key (e.g. RDBMS tables with primary key) and tables that don’t (RDBMS tables without primary key, or heaps); tables where row ordering is meaningful, whether explicit or implied, such as spreadsheets or CSV files; and tables where row ordering has no inherent meaning (RDBMS tables). Further, tDiff should be able to represent the differences between two different types of table, e.g. an RDBMS table versus a spreadsheet table.
tDiff is designed to described the differences in content between two tables-- it does not have any facilities for describing the differences in structure between two tables. A future revision will expand the specification to include structure or schema information.
tDiff is designed to described text (character) content. It has no facilities to describe differences in content that does not have a natural default character representation. A future revision may expand the specification to include arbitrary binary data.
tDiff can represent the differences between exactly two tables. It cannot represent an n-way compare where n > 2. It also cannot represent the differences between collections of tables, such as multiple-sheet spreadsheets.
tDiff documents use the UTF-8 character encoding.
tDiff documents comprise any number of comment blocks, control blocks, and diff hunks, interleaved in any order. Each diff hunk describes a related set of differences between two tables. Each hunk could stand on its own as an independent tDiff document. When there is a choice in how to decompose differences between two tables as a sequence of hunks, generators are encouraged to choose a decomposition that minimizes ordering effects between hunks.
# tdiff version 0.2 /* * this comment could describe the whole document */ /* * this comment could describe the following hunk */ * |bridge=Brooklyn|designer:'J.A.Roebling'|length:1595| = |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600| * |bridge=Queensborough|designer:'Palmer & Hornbostel'|length:1182| /* * this comment could describe the following hunk, even * though it's not immediately adjacent to the hunk (there is * an intervening newline). */ - |bridge=Spamspan|designer:'S.Spamington'|length:10000| + |bridge=Manhattan|designer:'G.Lindenthal'|length:1470| /* * final, wrap up, comment */
Comment blocks are delimited using: /* */ (C style). Any content can occur within a comment block.
/* This is an example single-line comment */ /* This is an example multi-line comment */
Control blocks begin with "# ", and are delimited by a newline or linefeed. Control blocks may hold meta information about diffs, or environmental information that might be useful to an interpreter. Apart from the special header control block, they lie outside of the scope of this specification.
A tdiff document should begin with a special control block called the header. The header begins with the characters "# tdiff". It is there to aid in rapid identification of tdiff documents. Documents without this block should be accepted by interpreters, but this block should always be included by generators.
# tdiff version 0.2
A hunk is a series of one or more adjacent diff lines, optionally preceeded by a column line, where each diff line represents the differences between the source tables for a single row. The lines within a hunk should be separated by only the newline characters that terminate each diff line, so that they all appear as adjacent lines within a text editor. Within a tDiff document, hunks are delimited from each other via intermediate filler or comment blocks.
- |bridge=Spamspan|designer:'S.Spamington'|length:10000| = |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600| + |bridge=Manhattan|designer:'G.Lindenthal'|length:1470|
A diff line describes differences in a single row of the two tables that were compared. One table is designated the left or local table (called L) and the other table is designated the right or remote table (called R).
There are three types of diff lines:
MISSING line: describes a row that is present in L but absent in R.
EXTRA line: describes a row that is absent in L but present in R.
CHANGE line: describes a row that is present in both tables, but differs in some specific column values.
Each diff line occupies its own line in the document, and begins with one of three characters. These three characters are called "line type" characters:
MISSING lines begin with plus: '+'. In order to make R look like L we would have to add the missing row to R.
EXTRA lines begin with minus: '-'. In order to make R look like L we would remove the extra row from R.
CHANGE lines begin with equals: '='. In order to make R look like L we would update some of its column values.
The line type character can be left or right padded with any amount of whitespace, for readability. The line type character is followed by any number of name-value pairs, where the names represent column names, and the values are the values for the corresponding column name in that particular row. The name is separated from the value by an equals (=) sign for identifying columns (usually part of the primary key, but see Keys versus identity) or a colon (:) sign for all other columns. The name-value pairs, as well as the line type character, are delimited with whitespace and/or a pipe '|' character.
Optionally, key names can be removed from diff lines. To do this, a special column line may be given within a hunk. This lists column names, followed by "=" for identifying columns.
This establishes bridge as an identifying column that appears first, followed by designer and length columns. We can now rewrite this:
The effect of column lines is limited to within a single hunk.
ROW Pseudo Column
Some sources of table data have a meaningful notion of row or line number (e.g. spreadsheets, CSV files) but others do not (RDBMS tables). The representation of the row number may be external to the table data itself. In order to accommodate these sources, document generators can make use of the ROW pseudo column.
+ |ROW=3|name3:value3|name4:value4| - |ROW=4| = |ROW=5|name3:old-value9->new-value9|name4:old-value10->new-value10 |
Note that a diff that uses ROW numbers may be less useful for applying to RDBMS tables, and alternate row identity choices may be considered by the generator (or specified to the generator by the user). See Keys versus identity.
Some sources of table data do not use named columns-- for instance, CSV files that have not designated a column header row. In those cases, column names are replaced with column number designators. The designators follow this pattern:
1, 2, …
N.B. The ordinal numbers embedded within column number designators are 1’s based, not 2’s based. In other words, counting starts at 1, not at zero.
If column names are omitted, and no column line is given, then ordinal numbers are assumed. All columns are assumed to be identifying, unless they are modified. The previous example can therefore be equivalently expressed as:
Though not always strictly required to fully and accurately described differences between two tables, generators are allowed to include extra rows and extra column values as contextual information. These can help human readers oriented themselves within the data, and might help machine interpreters to resolve otherwise ambiguities in applying a tDiff document.
In the case of missing or extra row lines, extra column data can be included within the row as name-value pairs, just like any other name-value pairs. In the case of a change line, extra name-value pairs can be included within the line. These contextual name-value pairs should also include only the old value. In other words, contextual column values should not include "→new-value". In addition to contextual column information, a hunk may contain any number of contextual rows. These rows appear with the line type character '*'. Streaks of contextual rows can prefix and suffix the block of actual (+, -, =) real diff lines, but cannot be interspersed amongst them. In other words, a hunk is:
<streak of context lines>
<streak of real (+ or - or =) diff lines>
<streak of context lines>
* |bridge=Brooklyn|designer:'J.A.Roebling'|length:1595| = |bridge=Williamsburg|designer:'D.D.Duck'->'L.L.Buck'|length:1600| * |bridge=Queensborough|designer:'Palmer & Hornbostel'|length:1182|
Some tables are ordered, some unordered. The context features of tDiff make sense for ordered tables. When a diff between ordered tables is applied as a patch to an unordered table, context features may be ignored. When a diff between unordered tables is applied as a patch to an ordered table, the resulting order may be underspecified (but presumably does not matter).
Keys versus identity
Determining whether a row is present in L and R requires a judgment about row identity. This judgment may be simple. For example, the identity of a row may simply be the value of its primary key. However, it is possible that the identity of a row is distinct from its primary key. Consider for example a table with an auto-incrementing integer primary key, rather than something derived from the row data. Comparison of that key between separately maintained copies of that table will be meaningless. For meaningful comparison, an alternate row identity would need to be constructed.
This issue lies outside the tDiff specification, but it is important that implementors be aware that columns used for identification may in some cases not be part of the primary key.
Names or values may be quoted in a tDiff document. Quoting is done as follows:
All instances of the single-quote character are duplicated into pairs.
All control characters and the backslash character are escaped as for C literals.
The name or value is wrapped in single-quotes
It is always safe to single-quote a name or value. Names or values must by quoted in any of the following conditions:
A name or value conflicts with a reserved word: NULL, ROW.
A name or value contains any character in the 7-bit ASCII range that is not in the following set: [A-Za-z0-9+.]
A name begins with any of the characters [0-9+.].
In the case of column diffs, for each cell that was different between L and R, both the old and new values are displayed. The old value must come first, followed by → (dash greater than), followed by the new value. For all three diff line types, the generator may include L name-value pairs that are not strictly needed, but may help with row identification.
document ::= header (block)* block ::= hunk | control | comment | filler hunk ::= (hunk_header)? (diff_line)+ hunk_header ::= '@' (divider name)* break diff_line ::= ('-' | '+' | '=') (divider term)* break term ::= (name ('='|':'))? (value '->')? value break ::= divider? linebreak control ::= "#" (divider value)* break comment ::= | ("/*" comment_body "*/") filler ::= (linebreak | divider)+ header ::= '#' (divider term)* break divider ::= (' ' | '\t')* ('|')? (' ' | '\t')* linebreak ::= ('\r\n' | '\r' | '\n') # unspecified in grammar: name, value, comment_body
The linebreak non-terminal needs to be parsed greedily to work with all common end-of-line formats, since the number of linebreaks is significant in the grammar. The comment_body non-terminal is as for C.
Appendix: Complete examples
In example one, both tables are in an RDBMS, both tables have the same column names, and the rows are identified using column1.
L: R: column1,column2,column3,column4 column1,column2,column3,column4 1, 0000, x, aaaa ---------------------------- ---------------------------- 2, 1111, x, aaaa 3, 2222, x, aaaa 3, 2222, y, aaaa 4, 3333, x, aaaa 4, 0000, z, bbbb 5, 4444, x, aaaa 5, 4444, z, bbbb 6, 5555, x, aaaa 6, 5555, u, aaaa ---------------------------- 7, 0000, v, aaaa ---------------------------- 8, 1111, x, aaaa
# tdiff version 0.2 /* * this is the tDiff document for example 1, using 1 hunk only and no context. * Note the "|" usage varies from previous examples in this document. * "|" plays the same role as spaces and tabs in the spec, so varying * styles are possible. */ - | column1=1 + | column1=2| column2:1111| column3:x| column4:aaaa = | column1=3| column3:x->y = | column1=4| column2:3333->0000| column3:x->z| column4:aaaa->bbbb = | column1=5| column3:x->z| column4:aaaa->bbbb = | column1=6| column3:x->u + | column1=7| column2:0000| column3:v| column4:aaaa + | column1=8| column2:1111| column3:x| column4:aaaa /* * end of tDiff document */
# tdiff version 0.2 /* * here is a tDiff document that is equivalent to the document above, except * that it uses 8 hunks, more comments, and adds in some context */ /* * hunk 1: notice that columns 2,3,4 are context-- not strictly necessary * to specify a remove */ - | column1=1| column2:0000| column3:x| column4:aaaa /* * hunk 2: notice that the hunks are separated by standalone newline */ + | column1=2| column2:1111| column3:x| column4:aaaa /* * hunk 3: notice that column2 and column4 are merely context */ = | column1=3| column2:2222| column3:x->y| column4:aaaa /* * hunk 4: notice that the column diff line is surrounded by context rows, and * that the context rows describe the values on the RHS. */ * | column1=3| column2:2222| column3:x| column4:aaaa = | column1=4| column2:3333->0000| column3:x->z| column4:aaaa->bbbb * | column1=5| column2:4444| column3:x| column4:aaaa /* * hunk 5 */ = | column1=5| column3:x->z| column4:aaaa->bbbb /* * hunk 6 */ = | column1=6| column3:x->u /* * hunk 7 */ + | column1=7| column2:0000| column3:v| column4:aaaa /* * hunk 8 */ + | column1=8| column2:1111| column3:x| column4:aaaa