A feature of our September update of LawFlow that we are particularly excited about is our new OCR system. While LawFlow has always provided OCR capability, the September update implements a new custom system that we have been developing for some time, incorporating leading OCR technology, and tailored specifically for e-discovery.
Key improvements of the new OCR system include:
- Significantly reduced lead time for uploaded documents to be OCRd.
- More robust processing due to improved identification and handling of corrupt or malformed PDFs.
- The ability to OCR more DRM-protected PDF files (some DRM restrictions may still prevent specific PDFs being processed).
- The ability to perform OCR on detected image-based pages within an otherwise text-based PDF. This can occur where text images are inserted into a natively-generated PDF, or where text-based and image-based PDFs are merged into one.
- Improved detection of document number stamps (frequently applied in e-discovery) that otherwise prevent a PDF (or certain pages of it) from being a candidate for OCR.
- Confidence scoring of OCR-processed documents.
- Detection of specific pages with low confidence scores.
- Separate processing of longer documents in order to reduce delays in processing smaller, faster-processable documents.
As with our previous OCR system, the new system is not cloud-based but is fully hosted on our hardware right here in New Zealand. This means we do not send project data to a third-party or overseas for OCR processing.
As always with OCR, accuracy depends heavily on the quality and characteristics of the input. In general terms, well-scanned clean black-and-white block text with standard fonts & font sizing is likely to produce a relatively accurate OCR result. Conversely, lower-quality scans, non-standard fonts, stylised/coloured layout, marks on the image, etc will likely result in lower accuracy.
However even with high quality input, there can still be inaccuracies – a “good” OCR accuracy rate is considered to be around 95-99%. There can also be complications and inaccuracies in reconstructing the OCR text into sentences or paragraphs. This should be taken into consideration when searching or otherwise using OCR-generated text.
The outline of the new OCR system’s basic processing stages for each document in a project (which remains similar to the previous system) is as follows:
- Determine whether the document is of a type suitable for OCR (PDF or supported image files). If not, do not attempt OCR.
- For PDF documents, if every page of this PDF file already contains detectable text above a de minimis level (after attempting to exclude any detected document number stamps) then do not attempt OCR.
- Run OCR process on the document (for PDFs, do this only for pages excluding any with detectable text above the de minimis level).
- If the OCR process detected any text, convert the document to a searchable PDF with the OCR text applied.
- Index the OCR detected text (for use in searching).
If you have any questions about our new OCR system or how to handle OCR text in your discovery project, get in touch with us and we’ll be happy to help!
We are continuing to work on a lot of new features & improvements to our LawFlow e-discovery system, and will continue rolling them out progressively. Here are highlights of the latest update.
- New OCR system to improve the performance, robustness and usefulness of the OCR process.
- Hide emails by address tool (similar to “hide emails by domain sender tool” in the previous update). This allows hiding (or deleting) of all emails sent from selected email addresses.
- Ability to exclude a saved search from within another search. This makes it easier to create searches that exclude documents meeting specific criteria. So if you want to do a search such as “All documents in criteria A, but excluding any in criteria B“, you can create a search for criteria B, save it as say “Criteria B”, and then create another search with criteria A that also excludes the “Criteria B” saved search.
- Option to toggle additional columns (author, recipient, etc) in the left-hand slide-out pane in Details view.
- Ability to link multiple documents to chronology events at the same time via the tray.
- Usability improvements to “link email addresses to parties” tool.
- Quick link for adding users added to home page.
- Improved detection of specific watermark text on PDFs.
- Improved no-content detection for vector-based PDFs.
Thanks as always to our great customers for your support and feedback.
This update includes two significant new features: redundant email detection, and duplicate image detection. We’ll post more on these features later. In the meantime, here is a summary of the new features and improvements in this update.
- Redundant email detection – automatic detection of whether the text of an email in a thread (chain) is incorporated in a later reply in the thread.
- Apparent duplicate image detection – for finding substantively duplicate images (in native form and embedded in PDFs).
- Added a search option for finding documents with embedded documents.
- Improved email thread detection for native emails (Outlook MSG files and extracted Outlook PSTs).
- Improved text duplicates detection: the system now attempts to detect discovery number stamps applied on other parties’ documents, and ignore them when checking for text duplicates. This can significantly improve text duplicate detection with documents received from other parties.
- Improved similar document detection algorithm.
- Better organisation of per-repository sub-categories in main Documents page.
- ‘Open in new tab’ link added to related documents.
- Improved search box for selecting authors & recipients on Discovery tab.
- Improved detection of hidden content in Excel files.
- Option to limit the length of the recipients column when browsing documents. This prevents a very long number of recipients from causing layout issues (e.g. pushing other content off the screen).
- Additions and removals of documents on custom lists is now logged in document history.
- More information on folder hide/delete pages.
- Improved performance of apparent duplicate email detection.
- Improved scaling of native images when converting to PDF format.
- Improved performance of “go to document” function.
- Project export bundle now includes a hyperlinked index.
- Improved ability to extract text from some email files with non-standard/malformed HTML content.
- Various performance improvements & fixes.
We hope everyone is doing well during the COVID-19 lockdown. Highlights of this update:
- Saved searches: you can now save a search configuration to re-use in multiple searches.
- Support for Outlook for Mac (OLM) files (the Mac equivalent of a PST file).
- Support for 7-zip (7z) compressed archive files. Uploaded 7z files will extract in the same manner as Zip files.
- Ability to set custom “badge” text for issues. This is useful for showing a shorter badge (e.g. an abbreviation or acronym) for issues with longer names.
- Option to tag all search results directly from the search results page (without having to use the document tray function).
- Improved detection of email dates from non-native emails.
- Improved performance of the Analysis and Parties pages.
- Easier merging of document types.
- Easier editing of issues.
- Add documents to tray function now shows more detail about matched documents.
- Improved performance when searching multiple lists.
The LawFlow team will be maintaining full support and services during the Level 4 COVID-19 restrictions. We have full work-from-home capabilities to continue on a “business as usual” basis.
Please continue to send support queries to email@example.com and contact us as usual during this time.
We wish our customers all the best during these challenging times.
The LawFlow Team
We are still working very hard on exciting new features, in the meantime here is a summary of new features and improvements since the last update:
- Addition of a new list export mode: “Native format with sensitive document exceptions“. This mode exports list documents in native format except if the native document has one or more “sensitive” attachments (being a visible or hidden attachment that is set as privileged, confidential, or redacted, or unreviewed for privilege or confidentiality and therefore potentially sensitive). For those documents, they are exported in PDF format instead.
- List bundles: option to skip the generation of a placeholder PDF for native file exceptions in a PDF bundle generation.
- Option to limit by repository when adding documents to the tray by production number. This makes it easier when the same production number has been used in multiple repositories.
- Option to ignore mismatched documents when adding documents to the tray by ID.
- Print option added to document preview pane.
- Barcode scanning: you can now use a special separator sheet for use when the next document is an attachment of the previous top-level document. This allows a bulk-scanned PDF to be automatically split into separate top-level documents and attachments.
- Category for browsing privileged documents that have no privilege category.
- Scanned PDF detection: ability to detect in some cases whether a PDF is a scanned document, as opposed to a text-based PDF.
- Vector-based PDF detection: ability to detect in some cases whether a PDF includes vector content, as opposed to text or image-based content.
- Faster extraction of PST email archives.
- Better performance of apparent duplicate detection.
- Better performance of zipping document bundles.
- Some improvements to descriptions in statistics reports.
- Warning before displaying very large text and XML files.
- Better handling of extremely large history exports to Excel (now spans across multiple tabs).
- Better detection of poor-quality barcode separator pages in bulk scans.
- Better performance for bulk-linking a large number of documents to an issue.
- Various general performance improvements.
Thanks to all of our great users for their support and feedback.
This update includes some useful new functionality in the search tool, and significant performance improvements particularly for large projects.
- New search options:
- You can now search documents by party role (i.e. author and/or recipient), specify multiple authors/recipients/both on an “all” or “any” basis, and specify separate “must include” and “must exclude” criteria. This makes it easier to carry out searches such as “all documents authored by A or B, and received by X and Y“.
- Added the ability to search for multiple tags on an “all” or “any” basis.
- There are now separate options for searching documents explicitly marked as “undated”, or documents that have not had any date information set.
- Added the ability to search documents by discovery workflow status – partially reviewed documents and fully reviewed documents.
- Support for digitally signed & encrypted emails (P7M/P7S S/MIME).
- Automatic extraction of Outlook OST cache files (not to be confused with regular Outlook PST email archives, which LawFlow has always been able to automatically extract). LawFlow could previously extract Outlook OST files with helpdesk support. This update includes automatic processing of uploaded OST files.
- Ability to view workflow status information for tray documents.
- Significant performance improvements for very large projects, in particular browsing & viewing documents and generating long discovery lists.
- Added ability to easily view hidden attachments from a parent document.
- Improved layout of search options.
- More search tips.
- More robust process for expiring (i.e. refreshing) browser-cached versions of live-preview list document PDFs.
- Date range searching now supports partial dates (e.g. 0/11/2019 instead of a specific date in November 2019).
We are continuing to roll out lots of new features and improvements to LawFlow, New Zealand’s e-discovery solution. Highlights from recent updates include:
- Statistics module. This enables you to generate statistical analyses, in raw data and visual form, of your project documents. Available reports include:
- Email volumes;
- Email sender/recipient & subject analysis;
- File type & document type analysis;
- Discovery metrics.
- Word lists for searching. This allows you to create & save “word lists” for carrying out searches on a list of words or phrases. You can create multiple word lists, and combine them in various ways to carry out powerful searches.
- Ability to search for multiple email addresses (or email domains, or display names) at a time, and the addition of a second criteria to allow searches such as “emails from A or B, and sent to C or D”.
- Improved indexing of documents for more consistent searching.
- More helpful error messages on invalid search queries.
- Searching on email subject field now allows wildcards and boolean search operators.
- Improved identification and extraction of email calendar item information.
- Extraction of embedded attachments in Rich Text Format (RTF) files, where possible.
- Detection of hidden content in Word documents now includes hidden images.
- Detection of hidden content in PowerPoint documents now includes hidden slides.
- Improved duplicate detection of Word documents containing only images.
- Performance improvement for indexing documents, which means documents can be searched sooner after upload.
- Improved performance for adding family documents to a large number of tray documents.
- Automatic correction of some invalid email address encoding in native email files.
- Improved detection of inline images in native email files with body text in RTF format.
- Option to remove selected images from the “junk image removal” tool (i.e. to designate selected images as “not junk”).
- Improved detection of the LawFlow separator page in bulk scanned PDFs using separator pages.
- Party codes (abbreviations) of parent organisations referenced in lists are now included in the abbreviations table.
- Automatic processing of Outlook OST files (“offline storage” files, similar but different to regular PST files which LawFlow also supports).
- Lots of performance improvements.
We are working hard on lots of exciting features to help users analyse and cut-through ever-growing volumes of electronic documents. Features coming soon include:
- Advanced text analytic tools to help analyse large volumes of electronic discovery documents. This will include the ability to find common words & phrases, assess text quality, and view content-related statistics.
- Email chain culling – the ability to safely identify and remove “redundant” emails in a chain, where the content of the email is replicated in a later reply.
- Automatic detection of email disclaimers and other ‘boilerplate’, to improve the effectiveness of similarity & duplicate detection.
- Advanced email deduplication tools.
Thank you as always to our fantastic customers for your feedback and support. If you have any requested features or ideas on how we can make New Zealand’s e-discovery solution even better, please get in touch with us.
Thanks as always to our awesome customers, for making us the leading e-discovery solution in New Zealand. We’ve got loads of new features coming up in the next few months, including new tools for handling duplicates, analytics functions, and the new electronic casebook module. In the meantime, here are highlights of the latest update.
- “Look-through” folders: the ability to browse folders on a “look-through” basis. This shows documents in the selected folder as well as those in any subfolders. This is useful if you want to browse documents in a tree of folders without having to click through into each subfolder.
- Ability to specify a type of confidentiality when setting a document as confidential (e.g. commercial sensitivity, personal information).
- Automatic & domain detection for some MS Exchange-style (X.500) addresses on native emails.
- Automatic detection and removal of blank image attachments on native emails, such as “spacer” images.
- Automatic detection and removal of some system properties attachment from PDFs (such as added by Acrobat Distiller).
- Added category for browsing privileged documents across all repositories.
- Ability to use custom fields in automatic index templates for electronic casebooks.
- Improved performance when adding large numbers of documents to the tray.
- Improved performance when looking up documents by production number or filename.
- Better handling of non-standard and corrupted attachments when attempting to extract attachments from a PDF.
- Better parsing of non-standard email addresses on native emails.
- Better handling of surplus/redundant barcode separator pages for bulk-scanned PDFs.
- Ability to create PDF versions of some non-standard Outlook appointment requests with email body text.
- Improved performance for hiding empty subfolders.
- Performance improvements to email archive (PST/MBOX) extraction.
- Improved layout of long email addresses when linking email addresses to parties.
- Better validation of the end date of a date range on the discovery tab.