COOPY » Guide  version 0.6.5
Specification of the highlighter diff format

The highlighter diff format is a format for expressing the difference between two spreadsheets.

A highlighter diff is designed to be represented in a spreadsheet format, using "common denominator" features of Excel, OpenOffice/LibreOffice, Gnumeric, etc. A highlighter diff can however be represented in any tabular format with ordered columns and rows. It is a simple format, easy to parse, yet expressive enough to capture an important set of possible differences. It is intended for spreadsheets containing data, rather then spreadsheets with formulae, charts, and the like. Highlighter diffs deal with cell content, not formatting.

Single sheet case, with fixed columns

Assume we are comparing two versions of a sheet/table, L and R (for Left/Local and Right/Remote). L is the "reference" sheet, and we describe the difference between it and R in terms of operations needed to change L into R. The difference is described in a "diff sheet" that looks like this:

@@bridgedesignerlength
BrooklynJ. A. Roebling1595
+++ManhattanG. Lindenthal1470
->WilliamsburgD. Duck->L. L. Buck1600
QueensboroughPalmer & Hornbostel1182
TriboroughO. H. Ammann1380,383
............
Throgs NeckO. H. Ammann1800
George WashingtonO. H. Ammann3500
---SpamspanS. Spamington10000

A diff sheet consists of a single action column and a block of data columns, as follows.

Action column

The diff sheet has a column called the "action" column. The action column is the first column that does not contain integers; generally it will be exactly the first column. Here are cell values that can appear in this column:

@@the header tag marks a header row, if one exists.
+++the insert tag marks an added row (present in R, not present in L).
---the delete tag marks a removed row (present in L, not present in R).
->the update tag marks a row in which at least one cell changed.
-->, --->, ----> etc. have the same meaning.
BlankA blank string or NULL marks a row common to L and R.
...the skip tag marks that rows common to L and R are being skipped.
!The schema tag marks a change in the sheet structure.

Data columns

The columns after the action column are called the data columns. Each column in L and each column in R map onto a data column. If the diff contains no schema row ("!"), then L and R have the same columns, and the data columns map straightforwardly onto columns in L and R. Here is what the data columns contain, for each action tag:

@@cells contain column names.
+++cells shown are from R, and not present in L.
---cells shown are from L, and not present in R.
->cells shown are common to both L and R.
When L and R cells differ, a compound "Vl->Vr" cell is shown.
Blankcells shown are common to both L and R.
...skipping rows common to both L and R.

In added rows, the data columns are copied from R, and in deleted rows, the data columns are copied from R. For updates, data columns that are common to L and R are filled with that common value. Data columns that differ are filled with a "compound value".

Compound values

When a cell in L and R differs, we need to show both values (call them Vl and Vr) in a single cell without introducing ambiguity. We keep this a simple as possible, as follows. For data where the substring "->" will not occur, and there are no ambiguities in the string representation of values, compounds are simple:

compound(Vl,Vr) = string(Vl) + "->" + string(Vr)

So we can get (the string representation of) Vl and Vr by splitting the compound value at "->". But in general, "->" might occur in data. We optimize ease of interpretation of the diff (rather than ease of producing it), by requiring that the update action tag be chosen to avoid collision with the data (either per row or over the entire sheet, at the discretion of the implementor). So the rule becomes:

compound(Vl,Vr) = string(Vl) + action_tag + string(Vr)

So as before we can get (the string representation of) Vl and Vr by splitting the compound value at occurance of the action_tag.

String representation of values

The string representations of Vl and Vr in compound values require some care, in the case of data that contains NULL values and for which NULLs and blank strings are distinct. If the diff contains any string ending in "NULL" then "NULL-quoting" is in effect, and any such string should be interpreted as follows:

STRINGVALUE
"NULL"This string represents a NULL value
"_NULL"This represents the string "NULL"
"__NULL"This represents the string "_NULL"
......

When possible, NULL values may be represented in the diff directly through empty cells, rather than as strings.

Multiple sheet case, with fixed columns

Easy! Just the same as we've done so far, repeated, with appropriate sheet names. For example:

locations
@@idstreetcity
1305 Memorial DriveCambridge
2Big CraterMoon
+++310 Ten StreetDenver
org2loc
@@org_idloc_id
12
21
31
32
+++53
organizations
@@idname
1Dracula Inc
2The Firm
3Omni Cooperative
4Nonexistence Unlimited
+++5Alice and Company

Adding columns

Highlighter diffs can represent changes to the sheet's structure, by adding an initial "schema" row (identified by having an exclamation mark in the first column). Columns to be added are identified with a "+++". For example:

!+++
@@bridgedesignerquarklength
+BrooklynJ. A. Roeblingstrange1595
+++ManhattanG. Lindenthalcharm1470
->WilliamsburgD. Duck->L. L. Buckup1600
+QueensboroughPalmer & Hornbosteldown1182
+TriboroughO. H. Ammannstrange1380,383
+Bronx WhitestoneO. H. Ammanncharm2300
+Throgs NeckO. H. Ammannhairy1800
+George WashingtonO. H. Ammannmoody3500
---SpamspanS. Spamington10000

A new action called "+" has been added, to mark rows that have no change other than added cells.

Removing columns

Columns to be removed are identified with a "---" in the schema row. For example:

!---
@@bridgedesignerquarklength
BrooklynJ. A. Roeblingstrange1595
---ManhattanG. Lindenthalcharm1470
->WilliamsburgL. L. Buck->D. Duckup1600
QueensboroughPalmer & Hornbosteldown1182
TriboroughO. H. Ammannstrange1380,383
...............
Throgs NeckO. H. Ammannhairy1800
George WashingtonO. H. Ammannmoody3500
+++SpamspanS. Spamington10000

Row and column order?

This spec doesn't deal with order changes. Suggestions welcome!

Implementations

The COOPY toolbox can generate highlighter diffs, apply them as patches, and convert them to other patch formats (such as SQL and tdiff). See: Applying highlighter diffs using sspatch. Using COOPY, highlighter diffs can be read in C++, Ruby (via wrappers), and Python (via wrappers). It would be great to see native implementations for reading and writing highlighter diffs in various languages and (crucially) spreadsheet plugins.

 All Data Structures Namespaces Files Functions Variables Typedefs Enumerations Enumerator Defines