Deduplicator



Deduplicator

Version 0.4.0 released / Future plans - 15/07/2008

Deduplicator

Stellar Deduplicator For Outlook

Version 0.4.0 includes numerous tweaks and patches introduced since 0.2.0.

The Systematic Review Accelerator project is based at the Bond University Institute for Evidence-Based Healthcare. The aim is to reduce the amount of time it takes. Image Deduplicator uses little RAM, so it can also be used on slower machines. Can be run from USB flash drive or CD/DVD disc, because it does not use the registry nor it creates files outside its. Here is the source code from the tool to illustrate an example of how this can be done.

Deduplicator Meaning

Notable changes:

Spotify Deduplicator

  • Support for changed crawl.log format that Heritrix introduced in 1.12.0.
  • Improved memory usage for large indexes.
  • Can now exclude duplicate URIs from new index.
  • Various bug fixes.

Deduplication Tool

This will be the last version of the DeDuplicator that is built against Heritrix 1.10.0. Building against that version of Heritrix has made the DeDuplicator compatible with almost all 1.x versions of Heritrix. Note though that 0.4.0 is built with Java 1.5, unlike 0.2.0 which was built with Java 1.4.2.

Tiny Deduplicator

In version 1.12.0 Heritrix added some useful features that the DeDuplicator should make use of, most notably marking content as 'not novel' (i.e. duplicate). Also in 1.14.0 there is rudimentary WARC support and the aim is to have the DeDuplicator support writing to WARC files. Therefor, any future versions will be built against Heritrix 1.14.0.

Tiny Deduplicator

Support for Heritrix 2.0 is planned but there is no set timeframe for it. This requires considerable changes to the DeDuplicator and will likely not be implemented until Heritrix 2.x is sufficiently mature that it is used routinely instead of 1.x for large scale production crawls.