A feature of our September update of LawFlow that we are particularly excited about is our new OCR system. While LawFlow has always provided OCR capability, the September update implements a new custom system that we have been developing for some time, incorporating leading OCR technology, and tailored specifically for e-discovery.
Key improvements of the new OCR system include:
- Significantly reduced lead time for uploaded documents to be OCRd.
- More robust processing due to improved identification and handling of corrupt or malformed PDFs.
- The ability to OCR more DRM-protected PDF files (some DRM restrictions may still prevent specific PDFs being processed).
- The ability to perform OCR on detected image-based pages within an otherwise text-based PDF. This can occur where text images are inserted into a natively-generated PDF, or where text-based and image-based PDFs are merged into one.
- Improved detection of document number stamps (frequently applied in e-discovery) that otherwise prevent a PDF (or certain pages of it) from being a candidate for OCR.
- Confidence scoring of OCR-processed documents.
- Detection of specific pages with low confidence scores.
- Separate processing of longer documents in order to reduce delays in processing smaller, faster-processable documents.
As with our previous OCR system, the new system is not cloud-based but is fully hosted on our hardware right here in New Zealand. This means we do not send project data to a third-party or overseas for OCR processing.
OCR accuracy
As always with OCR, accuracy depends heavily on the quality and characteristics of the input. In general terms, well-scanned clean black-and-white block text with standard fonts & font sizing is likely to produce a relatively accurate OCR result. Conversely, lower-quality scans, non-standard fonts, stylised/coloured layout, marks on the image, etc will likely result in lower accuracy.
However even with high quality input, there can still be inaccuracies – a “good” OCR accuracy rate is considered to be around 95-99%. There can also be complications and inaccuracies in reconstructing the OCR text into sentences or paragraphs. This should be taken into consideration when searching or otherwise using OCR-generated text.
OCR processing
The outline of the new OCR system’s basic processing stages for each document in a project (which remains similar to the previous system) is as follows:
- Determine whether the document is of a type suitable for OCR (PDF or supported image files). If not, do not attempt OCR.
- For PDF documents, if every page of this PDF file already contains detectable text above a de minimis level (after attempting to exclude any detected document number stamps) then do not attempt OCR.
- Run OCR process on the document (for PDFs, do this only for pages excluding any with detectable text above the de minimis level).
- If the OCR process detected any text, convert the document to a searchable PDF with the OCR text applied.
- Index the OCR detected text (for use in searching).
If you have any questions about our new OCR system or how to handle OCR text in your discovery project, get in touch with us and we’ll be happy to help!