We have recently added a great new feature to LawFlow – similar document detection. This uses the extracted text of a document to instantly match “similar” documents in the project. If any similar documents are detected, they are listed on the “Related” tab of the document being viewed.

Similar document detection can provide a number of benefits, including:

  • Detecting email chains;
  • Finding draft or revised versions of documents;
  • Quickly setting a group of similar documents as not discoverable, etc.

How it works

As with other search tools, similar document detection relies on the text that is extracted from documents. If the documents are in a compatible native format (e.g. standard Office formats and emails), most if not all of the text can be extracted. If the document is a PDF of a scanned document, then text can be extracted via OCR (although as always with OCR, results are dependent on the quality of the scan and other factors).

Each document’s text is then cleaned and broken up into sentences or fragments. These fragments are then used to find other documents that contain the same fragments.

What is “similar”?

A “similar” document is one that has a specified number of matching fragments. LawFlow lets you configure the minimum number of matches for a document to be “similar”. LawFlow also lets you flag certain fragments as “junk” fragments which will be ignored for the purposes of determining similarity – for example, common email footers can be flagged as junk to reduce the number of matches.


Like any search function or automated tool, the efficacy of similar document detection is dependent on a large number of factors and should be used judiciously and as an aid, not a substitute, for a review of documents. For example, documents with a significant amount of unusual or non-textual content (e.g. documents containing financial data), or unusual formatting, may not produce matches.

As always, we welcome all feedback and look forward to making this a useful addition to New Zealand’s e-discovery solution.

