COOPY » Guide
version 0.6.5
|
The highlighter diff format is a format for expressing the difference between two spreadsheets.
A highlighter diff is designed to be represented in a spreadsheet format, using "common denominator" features of Excel, OpenOffice/LibreOffice, Gnumeric, etc. A highlighter diff can however be represented in any tabular format with ordered columns and rows. It is a simple format, easy to parse, yet expressive enough to capture an important set of possible differences. It is intended for spreadsheets containing data, rather then spreadsheets with formulae, charts, and the like. Highlighter diffs deal with cell content, not formatting.
Assume we are comparing two versions of a sheet/table, L and R (for Left/Local and Right/Remote). L is the "reference" sheet, and we describe the difference between it and R in terms of operations needed to change L into R. The difference is described in a "diff sheet" that looks like this:
@@ | bridge | designer | length |
---|---|---|---|
Brooklyn | J. A. Roebling | 1595 | |
+++ | Manhattan | G. Lindenthal | 1470 |
-> | Williamsburg | D. Duck->L. L. Buck | 1600 |
Queensborough | Palmer & Hornbostel | 1182 | |
Triborough | O. H. Ammann | 1380,383 | |
... | ... | ... | ... |
Throgs Neck | O. H. Ammann | 1800 | |
George Washington | O. H. Ammann | 3500 | |
--- | Spamspan | S. Spamington | 10000 |
A diff sheet consists of a single action column and a block of data columns, as follows.
The diff sheet has a column called the "action" column. The action column is the first column that does not contain integers; generally it will be exactly the first column. Here are cell values that can appear in this column:
@@ | the header tag marks a header row, if one exists. |
---|---|
+++ | the insert tag marks an added row (present in R, not present in L). |
--- | the delete tag marks a removed row (present in L, not present in R). |
-> | the update tag marks a row in which at least one cell changed. -->, --->, ----> etc. have the same meaning. |
Blank | A blank string or NULL marks a row common to L and R. |
... | the skip tag marks that rows common to L and R are being skipped. |
! | The schema tag marks a change in the sheet structure. |
The columns after the action column are called the data columns. Each column in L and each column in R map onto a data column. If the diff contains no schema row ("!"), then L and R have the same columns, and the data columns map straightforwardly onto columns in L and R. Here is what the data columns contain, for each action tag:
@@ | cells contain column names. |
---|---|
+++ | cells shown are from R, and not present in L. |
--- | cells shown are from L, and not present in R. |
-> | cells shown are common to both L and R. When L and R cells differ, a compound "Vl->Vr" cell is shown. |
Blank | cells shown are common to both L and R. |
... | skipping rows common to both L and R. |
In added rows, the data columns are copied from R, and in deleted rows, the data columns are copied from R. For updates, data columns that are common to L and R are filled with that common value. Data columns that differ are filled with a "compound value".
When a cell in L and R differs, we need to show both values (call them Vl and Vr) in a single cell without introducing ambiguity. We keep this a simple as possible, as follows. For data where the substring "->" will not occur, and there are no ambiguities in the string representation of values, compounds are simple:
compound(Vl,Vr) = string(Vl) + "->" + string(Vr)
So we can get (the string representation of) Vl and Vr by splitting the compound value at "->". But in general, "->" might occur in data. We optimize ease of interpretation of the diff (rather than ease of producing it), by requiring that the update action tag be chosen to avoid collision with the data (either per row or over the entire sheet, at the discretion of the implementor). So the rule becomes:
compound(Vl,Vr) = string(Vl) + action_tag + string(Vr)
So as before we can get (the string representation of) Vl and Vr by splitting the compound value at occurance of the action_tag.
The string representations of Vl and Vr in compound values require some care, in the case of data that contains NULL values and for which NULLs and blank strings are distinct. If the diff contains any string ending in "NULL" then "NULL-quoting" is in effect, and any such string should be interpreted as follows:
STRING | VALUE |
---|---|
"NULL" | This string represents a NULL value |
"_NULL" | This represents the string "NULL" |
"__NULL" | This represents the string "_NULL" |
... | ... |
When possible, NULL values may be represented in the diff directly through empty cells, rather than as strings.
Easy! Just the same as we've done so far, repeated, with appropriate sheet names. For example:
@@ | id | street | city |
---|---|---|---|
1 | 305 Memorial Drive | Cambridge | |
2 | Big Crater | Moon | |
+++ | 3 | 10 Ten Street | Denver |
@@ | org_id | loc_id |
---|---|---|
1 | 2 | |
2 | 1 | |
3 | 1 | |
3 | 2 | |
+++ | 5 | 3 |
@@ | id | name |
---|---|---|
1 | Dracula Inc | |
2 | The Firm | |
3 | Omni Cooperative | |
4 | Nonexistence Unlimited | |
+++ | 5 | Alice and Company |
Highlighter diffs can represent changes to the sheet's structure, by adding an initial "schema" row (identified by having an exclamation mark in the first column). Columns to be added are identified with a "+++". For example:
! | +++ | |||
@@ | bridge | designer | quark | length |
---|---|---|---|---|
+ | Brooklyn | J. A. Roebling | strange | 1595 |
+++ | Manhattan | G. Lindenthal | charm | 1470 |
-> | Williamsburg | D. Duck->L. L. Buck | up | 1600 |
+ | Queensborough | Palmer & Hornbostel | down | 1182 |
+ | Triborough | O. H. Ammann | strange | 1380,383 |
+ | Bronx Whitestone | O. H. Ammann | charm | 2300 |
+ | Throgs Neck | O. H. Ammann | hairy | 1800 |
+ | George Washington | O. H. Ammann | moody | 3500 |
--- | Spamspan | S. Spamington | 10000 |
A new action called "+" has been added, to mark rows that have no change other than added cells.
Columns to be removed are identified with a "---" in the schema row. For example:
! | --- | |||
@@ | bridge | designer | quark | length |
---|---|---|---|---|
Brooklyn | J. A. Roebling | strange | 1595 | |
--- | Manhattan | G. Lindenthal | charm | 1470 |
-> | Williamsburg | L. L. Buck->D. Duck | up | 1600 |
Queensborough | Palmer & Hornbostel | down | 1182 | |
Triborough | O. H. Ammann | strange | 1380,383 | |
... | ... | ... | ... | ... |
Throgs Neck | O. H. Ammann | hairy | 1800 | |
George Washington | O. H. Ammann | moody | 3500 | |
+++ | Spamspan | S. Spamington | 10000 |
This spec doesn't deal with order changes. Suggestions welcome!
The COOPY toolbox can generate highlighter diffs, apply them as patches, and convert them to other patch formats (such as SQL and tdiff). See: Applying highlighter diffs using sspatch. Using COOPY, highlighter diffs can be read in C++, Ruby (via wrappers), and Python (via wrappers). It would be great to see native implementations for reading and writing highlighter diffs in various languages and (crucially) spreadsheet plugins.