COOPY » Guide
version 0.6.5
|
This page describes the diff format (called DTBL) generated by "ssdiff --format-csv". This is the main format used in the COOPY toolbox for representing differences between tables. The diff format is itself tabular, so that diff files can be loaded by spreadsheet and database tools without writing a special parser. Within this document, we assume the diff is represented as a CSV file. Some simplification is possible if the diff can be stored directly in a format that allows NULLs and other types, see Representing NULLs and other types.
A diff file consists of a series of lines in CSV format, following RFC4180 (see CSV details). The first two cells should evaluate to the strings "dtbl" and "csv" (see First row).
Here's an example:
dtbl,csv,version,0.5 column,name,number,digits column,move,digits,number
In words, this would mean:
The width of the diff file in cells has no significance. Extra empty columns can be added to the end of rows without affecting the interpretation of the diff.
The first row contains at least four non-empty cells.
This means that a CSV format diff file will begin with one of the following seqence of bytes (ascii/utf8 encoding):
This set is effectively the format's "magic number".
Here's an example of a first row:
dtbl,csv,version,0.5,,,
The first row may be followed by a series of rows, each of which has the value "config" in its first cell. The second and third cells on such rows are interpreted as key-value pairs. The values of keys may affect the interpretation of the rest of the diff. For example:
config,param1,val1,,
There are currently no specified configuration parameters. This functionality is reserved for possible future use.
The remaining rows describe properties and transformations of a table. This rows are called "operations". Operations assume that the rows of the diff file are considered in order from the beginning of the file to the end.
The values of the first and second cells may be:
Specifies the table/sheet name. Useful for working with multi-table differences and patches.
table,name,"Outcome Measurement",,,
This gives arbitrary labels for the columns in the table. The labels are those used to refer to the columns within the patch file. Trailing blank cells are ignored. For example:
column,name,number,digits,,,
This specifies that two columns are expected, and those columns are labeled "number" and "digits".
Note that for some tables, such as those from CSV files, column names may be arbitrary or "guessed". This should be borne in mind in tools that apply diffs as patches to tables.
A "column name" row is mandatory if there are operations on columns to be performed ("column move", "column insert", or "column delete"). Otherwise, it is optional.
This operation specifies the movement of a single column. The move is specified by giving the resultant order of columns after the move. For example:
column,name,[0],[1],, column,move,[1],[0],,
All operations of the form "column *" are specified by giving the resultant order of columns after that operation.
This operation specifies the insertion of a single column. It has the same format as a "column move" row. The point of insertion is specified by giving the resulting column order, with a new column label inserted appropriately. Example:
column,name,[0],[1],, column,insert,[0],{4},[1],
All operations of the form "column *" are specified by giving the resultant order of columns after that operation.
This operation specifies the removal of a single column. It has the same format as the "column move" row. The point of insertion is specified by giving the resulting column order, with the unwanted column removed. Example:
column,name,{3},[1],[0],{4} column,delete,{3},[0],{4},
All operations of the form "column *" are specified by giving the resultant order of columns after that operation.
This operation specifies that a single column should be renamed. The specification is given by repeating the column names, with one name changed. For example:
column,name,number,digits column,rename,number,digit
All operations of the form "column *" are specified by giving the resultant order of columns after that operation.
To allow tidy descriptions of operations on rows in tabular form (see the "row *" operations), the DTBL format allows for flexibility in the roles assigned to diff columns. That role can be controlled using link name and link act. The role of a diff column is:
The first two diff columns are never associated with data columns; they instead contain operation tags such as "link name". The remaining diff columns may be associated with data columns as follows.
After any "column *" operation, the association of patch columns is set to match one-to-one where they appear in that row. For example, after:
column,move,digits,number
the third patch column is associated with the data column called "digits", and the fourth patch column is associated with the data column called "number".
It is possible to change the association using the "link name" command. For example, after:
link,name,number,digits,digits
the third patch column is associated with the data column called "number", and the fourth and fifth patch columns are associated with the data column called "digits". It is acceptable for multiple patch columns to be associated with a single data column, since the those patch columns may have distinct actions (see link act). It is also acceptable to have patch columns associated with non-existent/virtual data columns. The interpretation of such columns must be resolved with a link act operation.
This operation controls the action associated with patch columns. It assumes that associations between patch columns and data columns have already been set up (see link name). For example:
link,name,number,digits,digits link,act,*,*,=
This associates the action "match" (or "*") with the third and fourth column, and the action "assign" (or "=") with the fifth column. Here's another example:
link,name,number,digits link,act,*,*=
This associates the action "match" (or "*") with the third column, and the actions "match" and "assign" (or "*=") with the fourth column.
See row update and row select to understand why this kind of flexibility is useful.
Possible actions are:
Modifies the values of a row. If not preceeded by a row select, it will select a row to modify using the same rules as row select. It then makes changes according to the assignment actions in effect (see link act). Here's an example which would change a row whose first column contains "Paul" such that the second column becomes "Fitzpatrick":
column,name,first,last,age link,act,*,=, row,update,Paul,Fitzpatrick,
Here's an alternate example using row select.
column,name,first,last,age link,act,*,=, row,select,Paul,, row,update,,Fitzpatrick,
Here's an example that would select a row whose first column is "Paul" and change "Paul" to be "Peter".
column,name,first,last,age link,name,first,first, link,act,*,=, row,update,Paul,Peter,
Here's an alternate example to achieve the same result:
column,name,first,last,age link,act,*=,, row,select,Paul,, row,assign,Peter,,
Selects a row matching the given values. Matches are made based on the actions linked with diff columns (see link act). This example selects a row whose first column is "Paul":
column,name,first,last,age link,act,*,, row,select,Paul,,
The selection is stored until the next row update, row insert, or row delete.
Inserts a row with the given values. Some table formats allow control over row order. In this case, if an insertion is preceeded by a row select, the insertion is done before the selected row, otherwise it is done after the last existing row. Values are specified as for row update.
Deletes a row matching the given values. Values are specified as for row select.
Since the CSV format has some variability, here are some notes on the variant expected. In summary, keep to RFC4180 as closely as you can.
The DTBL format, as represented in CSV, needs a way to describe all cell types as text. So far, only the issue of representing "NULL" has been addressed. The rule adopted is as follows:
There are other possibilities, such as distinguishing between a completely blank cell and a cell containing a pair of quotes, but such possibilities are unlikely to survive a round-trip through a graphical CSV editor and so could cause more problems then they solve.