COOPY » Guide  version 0.6.5
CSV DTBL diff format
Version:
0.5

This page describes the diff format (called DTBL) generated by "ssdiff --format-csv". This is the main format used in the COOPY toolbox for representing differences between tables. The diff format is itself tabular, so that diff files can be loaded by spreadsheet and database tools without writing a special parser. Within this document, we assume the diff is represented as a CSV file. Some simplification is possible if the diff can be stored directly in a format that allows NULLs and other types, see Representing NULLs and other types.

General structure

A diff file consists of a series of lines in CSV format, following RFC4180 (see CSV details). The first two cells should evaluate to the strings "dtbl" and "csv" (see First row).

Here's an example:

dtbl,csv,version,0.5
column,name,number,digits
column,move,digits,number

In words, this would mean:

The width of the diff file in cells has no significance. Extra empty columns can be added to the end of rows without affecting the interpretation of the diff.

First row

The first row contains at least four non-empty cells.

This means that a CSV format diff file will begin with one of the following seqence of bytes (ascii/utf8 encoding):

This set is effectively the format's "magic number".

Here's an example of a first row:

dtbl,csv,version,0.5,,,

Config rows

The first row may be followed by a series of rows, each of which has the value "config" in its first cell. The second and third cells on such rows are interpreted as key-value pairs. The values of keys may affect the interpretation of the rest of the diff. For example:

config,param1,val1,,

There are currently no specified configuration parameters. This functionality is reserved for possible future use.

Operation rows

The remaining rows describe properties and transformations of a table. This rows are called "operations". Operations assume that the rows of the diff file are considered in order from the beginning of the file to the end.

The values of the first and second cells may be:

table name

Specifies the table/sheet name. Useful for working with multi-table differences and patches.

table,name,"Outcome Measurement",,,

column name

This gives arbitrary labels for the columns in the table. The labels are those used to refer to the columns within the patch file. Trailing blank cells are ignored. For example:

column,name,number,digits,,,

This specifies that two columns are expected, and those columns are labeled "number" and "digits".

Note that for some tables, such as those from CSV files, column names may be arbitrary or "guessed". This should be borne in mind in tools that apply diffs as patches to tables.

A "column name" row is mandatory if there are operations on columns to be performed ("column move", "column insert", or "column delete"). Otherwise, it is optional.

column move

This operation specifies the movement of a single column. The move is specified by giving the resultant order of columns after the move. For example:

column,name,[0],[1],,
column,move,[1],[0],,

All operations of the form "column *" are specified by giving the resultant order of columns after that operation.

column insert

This operation specifies the insertion of a single column. It has the same format as a "column move" row. The point of insertion is specified by giving the resulting column order, with a new column label inserted appropriately. Example:

column,name,[0],[1],,
column,insert,[0],{4},[1],

All operations of the form "column *" are specified by giving the resultant order of columns after that operation.

column delete

This operation specifies the removal of a single column. It has the same format as the "column move" row. The point of insertion is specified by giving the resulting column order, with the unwanted column removed. Example:

column,name,{3},[1],[0],{4}
column,delete,{3},[0],{4},

All operations of the form "column *" are specified by giving the resultant order of columns after that operation.

column rename

This operation specifies that a single column should be renamed. The specification is given by repeating the column names, with one name changed. For example:

column,name,number,digits
column,rename,number,digit

All operations of the form "column *" are specified by giving the resultant order of columns after that operation.

link name

To allow tidy descriptions of operations on rows in tabular form (see the "row *" operations), the DTBL format allows for flexibility in the roles assigned to diff columns. That role can be controlled using link name and link act. The role of a diff column is:

The first two diff columns are never associated with data columns; they instead contain operation tags such as "link name". The remaining diff columns may be associated with data columns as follows.

After any "column *" operation, the association of patch columns is set to match one-to-one where they appear in that row. For example, after:

column,move,digits,number

the third patch column is associated with the data column called "digits", and the fourth patch column is associated with the data column called "number".

It is possible to change the association using the "link name" command. For example, after:

link,name,number,digits,digits

the third patch column is associated with the data column called "number", and the fourth and fifth patch columns are associated with the data column called "digits". It is acceptable for multiple patch columns to be associated with a single data column, since the those patch columns may have distinct actions (see link act). It is also acceptable to have patch columns associated with non-existent/virtual data columns. The interpretation of such columns must be resolved with a link act operation.

link act

This operation controls the action associated with patch columns. It assumes that associations between patch columns and data columns have already been set up (see link name). For example:

link,name,number,digits,digits
link,act,*,*,=

This associates the action "match" (or "*") with the third and fourth column, and the action "assign" (or "=") with the fifth column. Here's another example:

link,name,number,digits
link,act,*,*=

This associates the action "match" (or "*") with the third column, and the actions "match" and "assign" (or "*=") with the fourth column.

See row update and row select to understand why this kind of flexibility is useful.

Possible actions are:

row update

Modifies the values of a row. If not preceeded by a row select, it will select a row to modify using the same rules as row select. It then makes changes according to the assignment actions in effect (see link act). Here's an example which would change a row whose first column contains "Paul" such that the second column becomes "Fitzpatrick":

column,name,first,last,age
link,act,*,=,
row,update,Paul,Fitzpatrick,

Here's an alternate example using row select.

column,name,first,last,age
link,act,*,=,
row,select,Paul,,
row,update,,Fitzpatrick,

Here's an example that would select a row whose first column is "Paul" and change "Paul" to be "Peter".

column,name,first,last,age
link,name,first,first,
link,act,*,=,
row,update,Paul,Peter,

Here's an alternate example to achieve the same result:

column,name,first,last,age
link,act,*=,,
row,select,Paul,,
row,assign,Peter,,

row select

Selects a row matching the given values. Matches are made based on the actions linked with diff columns (see link act). This example selects a row whose first column is "Paul":

column,name,first,last,age
link,act,*,,
row,select,Paul,,

The selection is stored until the next row update, row insert, or row delete.

row insert

Inserts a row with the given values. Some table formats allow control over row order. In this case, if an insertion is preceeded by a row select, the insertion is done before the selected row, otherwise it is done after the last existing row. Values are specified as for row update.

row delete

Deletes a row matching the given values. Values are specified as for row select.

CSV details

Since the CSV format has some variability, here are some notes on the variant expected. In summary, keep to RFC4180 as closely as you can.

Representing NULLs and other types

The DTBL format, as represented in CSV, needs a way to describe all cell types as text. So far, only the issue of representing "NULL" has been addressed. The rule adopted is as follows:

There are other possibilities, such as distinguishing between a completely blank cell and a cell containing a pair of quotes, but such possibilities are unlikely to survive a round-trip through a graphical CSV editor and so could cause more problems then they solve.

Pending issues

 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Defines