The aim of data deduplication is to reduce the amount of data in any repository to the point where no redundancy of data exists. The benefits are to reduce the amount of storage required, to speed up the search for relevant data, and to reduce the cost of discovering data by not sifting through it twice (or more), amongst other reasons. The process involves the collection and storage of hash values, which are like digital fingerprints for files and which are identical each time a file the same content is encountered. These hash values are subsequently analyzed against one another for duplicity, identifying multiple instances of the same data and delivering potentially significant reductions to the amount of data that needs to be stored, managed and reviewed.

Documents and file content are not always exact duplicates of one another, but these near-duplicate data may, at times, also be excluded from the review process. The most common example is that of a contract which underwent revisions prior to arriving at the final document. It is likely that if the final document is not relevant to the matter at hand that each version of the document which is a significant percentage similar in content to the final version are also not relevant.

The deduplication process examines file content for exact similarity, thus allowing for duplicate documents to be suppressed prior to review; the workflow for near-duplicate file content identification is usually executed differently. Again using the example from above, even if the final version of the contract does not contain relevant content, it is possible that a prior version of the document does, and therefore should not be suppressed. The process of near-duplication identification is usually for the purpose of grouping similar documents together so at the time of review decisions can quickly be made about how to handle the similar documents. If there are ten (10) drafts of a document which is not relevant to the matter at hand, then by reviewing one document the other nine (9) can be ignored, saving time and money, and given that the data review process is commonly the most expensive portion of any eDiscovery workflow, these savings add up considerably.

Contact us today to learn more