The COOPY Toolbox: The COOPY Toolbox: CSV DTBL diff format

Version:: 0.5

This page describes the diff format (called DTBL) generated by "ssdiff --format-csv". This is the main format used in the COOPY toolbox for representing differences between tables. The diff format is itself tabular, so that diff files can be loaded by spreadsheet and database tools without writing a special parser. Within this document, we assume the diff is represented as a CSV file. Some simplification is possible if the diff can be stored directly in a format that allows NULLs and other types, see Representing NULLs and other types.

General structure

A diff file consists of a series of lines in CSV format, following RFC4180 (see CSV details). The first two cells should evaluate to the strings "dtbl" and "csv" (see First row).

Here's an example:

dtbl,csv,version,0.5
column,name,number,digits
column,move,digits,number

In words, this would mean:

Start with a table whose columns are called "number" and "digits".
Reorder the columns, so the "number" column comes after "digits".

The width of the diff file in cells has no significance. Extra empty columns can be added to the end of rows without affecting the interpretation of the diff.

First row

The first row contains at least four non-empty cells.

The first cell of the first row should have the value "dtbl" (a four-character string). The representation of that value may be quoted or unquoted.
The second cell of the first row should have the value "csv" (a three-character string). The representation of that value may be quoted or unquoted.
The third cell of the first row should have the value "version" (a seven-character string). The representation of that value may be quoted or unquoted.
The fourth cell of the first row may have any non-empty value. The representation of that value may be quoted or unquoted.

This means that a CSV format diff file will begin with one of the following seqence of bytes (ascii/utf8 encoding):

dtbl,csv,...
"dtbl","csv",...
"dtbl",csv,...
dtbl,"csv"...

This set is effectively the format's "magic number".

Here's an example of a first row:

dtbl,csv,version,0.5,,,

Config rows

The first row may be followed by a series of rows, each of which has the value "config" in its first cell. The second and third cells on such rows are interpreted as key-value pairs. The values of keys may affect the interpretation of the rest of the diff. For example:

config,param1,val1,,

There are currently no specified configuration parameters. This functionality is reserved for possible future use.

Operation rows

The remaining rows describe properties and transformations of a table. This rows are called "operations". Operations assume that the rows of the diff file are considered in order from the beginning of the file to the end.

The values of the first and second cells may be:

table name

Specifies the table/sheet name. Useful for working with multi-table differences and patches.

table,name,"Outcome Measurement",,,

column name

This gives arbitrary labels for the columns in the table. The labels are those used to refer to the columns within the patch file. Trailing blank cells are ignored. For example:

column,name,number,digits,,,

This specifies that two columns are expected, and those columns are labeled "number" and "digits".

Note that for some tables, such as those from CSV files, column names may be arbitrary or "guessed". This should be borne in mind in tools that apply diffs as patches to tables.

A "column name" row is mandatory if there are operations on columns to be performed ("column move", "column insert", or "column delete"). Otherwise, it is optional.

column move

This operation specifies the movement of a single column. The move is specified by giving the resultant order of columns after the move. For example:

column,name,[0],[1],,
column,move,[1],[0],,

All operations of the form "column *" are specified by giving the resultant order of columns after that operation.

column insert

This operation specifies the insertion of a single column. It has the same format as a "column move" row. The point of insertion is specified by giving the resulting column order, with a new column label inserted appropriately. Example:

column,name,[0],[1],,
column,insert,[0],{4},[1],

All operations of the form "column *" are specified by giving the resultant order of columns after that operation.

column delete

This operation specifies the removal of a single column. It has the same format as the "column move" row. The point of insertion is specified by giving the resulting column order, with the unwanted column removed. Example:

column,name,{3},[1],[0],{4}
column,delete,{3},[0],{4},

All operations of the form "column *" are specified by giving the resultant order of columns after that operation.

column rename

This operation specifies that a single column should be renamed. The specification is given by repeating the column names, with one name changed. For example:

column,name,number,digits
column,rename,number,digit

All operations of the form "column *" are specified by giving the resultant order of columns after that operation.

link name

To allow tidy descriptions of operations on rows in tabular form (see the "row *" operations), the DTBL format allows for flexibility in the roles assigned to diff columns. That role can be controlled using link name and link act. The role of a diff column is:

The data column it is associated with, if any. Data columns are columns in the table being transformed, whereas diff columns are columns in the diff file itself.
The activity the diff column is associated with (such as matching or assigning data).

The first two diff columns are never associated with data columns; they instead contain operation tags such as "link name". The remaining diff columns may be associated with data columns as follows.

After any "column *" operation, the association of patch columns is set to match one-to-one where they appear in that row. For example, after:

column,move,digits,number

the third patch column is associated with the data column called "digits", and the fourth patch column is associated with the data column called "number".

It is possible to change the association using the "link name" command. For example, after:

link,name,number,digits,digits

the third patch column is associated with the data column called "number", and the fourth and fifth patch columns are associated with the data column called "digits". It is acceptable for multiple patch columns to be associated with a single data column, since the those patch columns may have distinct actions (see link act). It is also acceptable to have patch columns associated with non-existent/virtual data columns. The interpretation of such columns must be resolved with a link act operation.

link act

This operation controls the action associated with patch columns. It assumes that associations between patch columns and data columns have already been set up (see link name). For example:

link,name,number,digits,digits
link,act,*,*,=

This associates the action "match" (or "*") with the third and fourth column, and the action "assign" (or "=") with the fifth column. Here's another example:

link,name,number,digits
link,act,*,*=

This associates the action "match" (or "*") with the third column, and the actions "match" and "assign" (or "*=") with the fourth column.

See row update and row select to understand why this kind of flexibility is useful.

Possible actions are:

"": none - no action, ignore this column.
"*": match - match against the data in this column.
"=": assign - assign data in this column.
"*=": match and assign.
"ROW": row - match a row identifier against the data in this column. The data column association is ignored in this case.
"#": context - no action, ignore this column. However, the column will contain data just as for "*" for the benefit of a reader.

row update

Modifies the values of a row. If not preceeded by a row select, it will select a row to modify using the same rules as row select. It then makes changes according to the assignment actions in effect (see link act). Here's an example which would change a row whose first column contains "Paul" such that the second column becomes "Fitzpatrick":

column,name,first,last,age
link,act,*,=,
row,update,Paul,Fitzpatrick,

Here's an alternate example using row select.

column,name,first,last,age
link,act,*,=,
row,select,Paul,,
row,update,,Fitzpatrick,

Here's an example that would select a row whose first column is "Paul" and change "Paul" to be "Peter".

column,name,first,last,age
link,name,first,first,
link,act,*,=,
row,update,Paul,Peter,

Here's an alternate example to achieve the same result:

column,name,first,last,age
link,act,*=,,
row,select,Paul,,
row,assign,Peter,,

row select

Selects a row matching the given values. Matches are made based on the actions linked with diff columns (see link act). This example selects a row whose first column is "Paul":

column,name,first,last,age
link,act,*,,
row,select,Paul,,

The selection is stored until the next row update, row insert, or row delete.

row insert

Inserts a row with the given values. Some table formats allow control over row order. In this case, if an insertion is preceeded by a row select, the insertion is done before the selected row, otherwise it is done after the last existing row. Values are specified as for row update.

row delete

Deletes a row matching the given values. Values are specified as for row select.

CSV details

Since the CSV format has some variability, here are some notes on the variant expected. In summary, keep to RFC4180 as closely as you can.

If a cell begins with a double quote, the extent of the cell is determined by quoting rules. It is important to note that a cell may extend across line boundaries. The cell will extend to the next unpaired double quote in the patch. A pair of double quotes ("") found by scanning forwards through the file is interpreted as the double quote character, and not used to delimit the cell's extent. Once the extent of a cell is determined, if a double quote is present at the beginnning of that cell, it is stripped, along with the matching quote at the end.
Otherwise, a cell extends until a comma, CR, or LF character, or the end of the file.
An unquoted CRLF sequence is treated as a single delimiter.
Cells should be quoted if they contain any whitespace or the comma character.

Representing NULLs and other types

The DTBL format, as represented in CSV, needs a way to describe all cell types as text. So far, only the issue of representing "NULL" has been addressed. The rule adopted is as follows:

the four-letter string "NULL" is interpreted as a cell whose value is NULL (in the database sense).
the five-letter string "_NULL" is interpreted as a cell whose value is the four-letter string "NULL".
the six-letter string "__NULL" is interpreted as a cell whose value is the five-letter string "_NULL".
...

There are other possibilities, such as distinguishing between a completely blank cell and a cell containing a pair of quotes, but such possibilities are unlikely to survive a round-trip through a graphical CSV editor and so could cause more problems then they solve.

Pending issues

Control of row order (when possible) is not fully resolved.
Typing of columns or cells is not resolved, especially for newly inserted columns. For existing columns/cells, it is not so big a problem, if the diff is interpreted in a type-preserving way.
Comparing e.g. date/times is currently prone to differences in representation causing false mismatches.
This list of pending issues has not been fully fleshed out.