The COOPY Toolbox: The COOPY Toolbox: Specification of the highlighter diff format

The highlighter diff format is a format for expressing the difference between two spreadsheets.

A highlighter diff is designed to be represented in a spreadsheet format, using "common denominator" features of Excel, OpenOffice/LibreOffice, Gnumeric, etc. A highlighter diff can however be represented in any tabular format with ordered columns and rows. It is a simple format, easy to parse, yet expressive enough to capture an important set of possible differences. It is intended for spreadsheets containing data, rather then spreadsheets with formulae, charts, and the like. Highlighter diffs deal with cell content, not formatting.

Single sheet case, with fixed columns

Assume we are comparing two versions of a sheet/table, L and R (for Left/Local and Right/Remote). L is the "reference" sheet, and we describe the difference between it and R in terms of operations needed to change L into R. The difference is described in a "diff sheet" that looks like this:

@@	bridge	designer	length
	Brooklyn	J. A. Roebling	1595
+++	Manhattan	G. Lindenthal	1470
->	Williamsburg	D. Duck->L. L. Buck	1600
	Queensborough	Palmer & Hornbostel	1182
	Triborough	O. H. Ammann	1380,383
...	...	...	...
	Throgs Neck	O. H. Ammann	1800
	George Washington	O. H. Ammann	3500
---	Spamspan	S. Spamington	10000

A diff sheet consists of a single action column and a block of data columns, as follows.

Action column

The diff sheet has a column called the "action" column. The action column is the first column that does not contain integers; generally it will be exactly the first column. Here are cell values that can appear in this column:

@@	the header tag marks a header row, if one exists.
+++	the insert tag marks an added row (present in R, not present in L).
---	the delete tag marks a removed row (present in L, not present in R).
->	the update tag marks a row in which at least one cell changed. -->, --->, ----> etc. have the same meaning.
Blank	A blank string or NULL marks a row common to L and R.
...	the skip tag marks that rows common to L and R are being skipped.
!	The schema tag marks a change in the sheet structure.

Data columns

The columns after the action column are called the data columns. Each column in L and each column in R map onto a data column. If the diff contains no schema row ("!"), then L and R have the same columns, and the data columns map straightforwardly onto columns in L and R. Here is what the data columns contain, for each action tag:

@@	cells contain column names.
+++	cells shown are from R, and not present in L.
---	cells shown are from L, and not present in R.
->	cells shown are common to both L and R. When L and R cells differ, a compound "Vl->Vr" cell is shown.
Blank	cells shown are common to both L and R.
...	skipping rows common to both L and R.

In added rows, the data columns are copied from R, and in deleted rows, the data columns are copied from R. For updates, data columns that are common to L and R are filled with that common value. Data columns that differ are filled with a "compound value".

Compound values

When a cell in L and R differs, we need to show both values (call them Vl and Vr) in a single cell without introducing ambiguity. We keep this a simple as possible, as follows. For data where the substring "->" will not occur, and there are no ambiguities in the string representation of values, compounds are simple:

compound(Vl,Vr) = string(Vl) + "->" + string(Vr)

So we can get (the string representation of) Vl and Vr by splitting the compound value at "->". But in general, "->" might occur in data. We optimize ease of interpretation of the diff (rather than ease of producing it), by requiring that the update action tag be chosen to avoid collision with the data (either per row or over the entire sheet, at the discretion of the implementor). So the rule becomes:

compound(Vl,Vr) = string(Vl) + action_tag + string(Vr)

So as before we can get (the string representation of) Vl and Vr by splitting the compound value at occurance of the action_tag.

String representation of values

The string representations of Vl and Vr in compound values require some care, in the case of data that contains NULL values and for which NULLs and blank strings are distinct. If the diff contains any string ending in "NULL" then "NULL-quoting" is in effect, and any such string should be interpreted as follows:

STRING	VALUE
"NULL"	This string represents a NULL value
"_NULL"	This represents the string "NULL"
"__NULL"	This represents the string "_NULL"
...	...

When possible, NULL values may be represented in the diff directly through empty cells, rather than as strings.

Multiple sheet case, with fixed columns

Easy! Just the same as we've done so far, repeated, with appropriate sheet names. For example:

locations
@@	id	street	city
	1	305 Memorial Drive	Cambridge
	2	Big Crater	Moon
+++	3	10 Ten Street	Denver

org2loc
@@	org_id	loc_id
	1	2
	2	1
	3	1
	3	2
+++	5	3

organizations
@@	id	name
	1	Dracula Inc
	2	The Firm
	3	Omni Cooperative
	4	Nonexistence Unlimited
+++	5	Alice and Company

Adding columns

Highlighter diffs can represent changes to the sheet's structure, by adding an initial "schema" row (identified by having an exclamation mark in the first column). Columns to be added are identified with a "+++". For example:

@@	bridge	designer	quark	length
!			+++
+	Brooklyn	J. A. Roebling	strange	1595
+++	Manhattan	G. Lindenthal	charm	1470
->	Williamsburg	D. Duck->L. L. Buck	up	1600
+	Queensborough	Palmer & Hornbostel	down	1182
+	Triborough	O. H. Ammann	strange	1380,383
+	Bronx Whitestone	O. H. Ammann	charm	2300
+	Throgs Neck	O. H. Ammann	hairy	1800
+	George Washington	O. H. Ammann	moody	3500
---	Spamspan	S. Spamington		10000

A new action called "+" has been added, to mark rows that have no change other than added cells.

Removing columns

Columns to be removed are identified with a "---" in the schema row. For example:

@@	bridge	designer	quark	length
!			---
	Brooklyn	J. A. Roebling	strange	1595
---	Manhattan	G. Lindenthal	charm	1470
->	Williamsburg	L. L. Buck->D. Duck	up	1600
	Queensborough	Palmer & Hornbostel	down	1182
	Triborough	O. H. Ammann	strange	1380,383
...	...	...	...	...
	Throgs Neck	O. H. Ammann	hairy	1800
	George Washington	O. H. Ammann	moody	3500
+++	Spamspan	S. Spamington		10000

Row and column order?

This spec doesn't deal with order changes. Suggestions welcome!

Implementations

The COOPY toolbox can generate highlighter diffs, apply them as patches, and convert them to other patch formats (such as SQL and tdiff). See: Applying highlighter diffs using sspatch. Using COOPY, highlighter diffs can be read in C++, Ruby (via wrappers), and Python (via wrappers). It would be great to see native implementations for reading and writing highlighter diffs in various languages and (crucially) spreadsheet plugins.