Zygo Blaxell [Fri, 14 May 2010 18:37:58 +0000 (14:37 -0400)]
dm6: drop inode and digest suffixes
Digests are much longer than inode numbers can possibly be (*), so it's
not necessary to distinguish between them.
(*) Yes you could have a hypothetical filesystem with 256-bit inode
numbers, but you'd have to adjust Perl and the size of 'Q' first.
Either way, you're changing this code anyway.
Zygo Blaxell [Sat, 9 Jan 2010 04:09:27 +0000 (23:09 -0500)]
Allow --skip-hash and --skip-compare to be specified together
Since --skip-hash is now a threshold parameter which applies to some
files and not others, it makes sense to allow --skip-compare to be
specified at the same time.
--skip-compare is respected for files smaller than the --skip-hash
threshold, since such files will always be hashed. --skip-compare is
ignored for files larger than or equal to the --skip-hash threshold
since such files will never be hashed.
Zygo Blaxell [Sat, 9 Jan 2010 04:01:23 +0000 (23:01 -0500)]
Add a threshold to skip-hash
Hashing usually performs well except in special cases where many
large files have the same size but very different contents near
the beginning of the file. In these cases, it is usually faster
to execute the O(N^2) comparisons between all the files, thereby
avoiding reading most of their data.
In cases where there are many small files, the opposite setting of
the skip-hash option is usually better, so skip-hash is now a
threshold parameter with a default of 1MB.
Unfortunately it is not trivial to detect this condition without
doing sufficient work to negate the benefit; hence, we require
the operator to specify a preference.
Zygo Blaxell [Sat, 9 Jan 2010 02:33:20 +0000 (21:33 -0500)]
merge_files: add extra sanity check that input files are in fact files
Now that there is no '-f _' filter in the input loop, merge_files may
encounter a candidate or incumbent filename which resolves to a non-file
object on the filesystem. If --skip-hash is not used, this sanity check
will be applied during file hashing; however, if --skip-hash is used,
the sanity check will not be applied until the files are compared.
Zygo Blaxell [Sat, 9 Jan 2010 02:30:09 +0000 (21:30 -0500)]
Update *low* range of copyright years: 2002-2010
While consolidating the histories of faster-dupemerge across the
many SCM repositories in which it has lived, I found evidence of
faster-dupemerge existing as far back as 2002.
6038ff7 dupemerge: have find tell us the device too fd1d958 dupemerge: maybe improve seek performance by sorting perl hashes 9bf1c63 dupemerge: merge with state-of-the-art on Serenity a27ea1f dupemerge: update copyright year to 2010 593804d dupemerge: sort incumbent inodes too 1df8571 dupemerge: make inode sort order strictly numeric 24fecbb dupemerge: make sure 'sort -znr' still considers dev/inode numeric c92b00c dupemerge: inodes are now non-numeric
Zygo Blaxell [Sat, 9 Jan 2010 01:08:45 +0000 (20:08 -0500)]
Work around new fileutils output
findutils now appends a redundant ".000000000" to the %T@ output.
I've apparently missed the window to get findutils to fix this, so
I've worked around it.
root [Sun, 26 Nov 2006 22:05:51 +0000 (22:05 +0000)]
Properly handle cases where multiple files have the same hash
(e.g. because --skip-hash is used). This version now generates all N^2
combinations of comparisons.
Zygo Blaxell [Wed, 6 Jan 2010 16:10:04 +0000 (11:10 -0500)]
dupemerge: maybe improve seek performance by sorting perl hashes
Thank Johannes Niess <Linux@johannes-niess.de> for this idea.
To improve seek performance, choose inodes for linking in a fixed order.
This will mean that two directories with multiple identical files will
end up with links to the copies with lower inode numbers. This is an
improvement over the previous result, which was that both directories
would end up with randomly chosen files from both directories.
The sort order isn't strictly numeric; however, it's hopefully close
enough.
As a crude heuristic, we assume that inode numbers approximate file
position on disk, and file names approximate typical usage patterns.
Previously we used the perl hash semantics, which are mostly random
and might change depending on the numbers of files considered.
Zygo Blaxell [Wed, 6 Jan 2010 16:07:24 +0000 (11:07 -0500)]
dupemerge: have find tell us the device too
faster-dupemerge cannot be used to link files on multiple filesystems
because the hardlinks will fail; however, if this is attempted anyway
then files with identical weak keys (size+timestamp+permissions) and
identical inode numbers might be considered as identical for hashing
and comparing purposes when they are not. That would be bad.
cvs [Mon, 5 May 2003 04:20:14 +0000 (04:20 +0000)]
digest: Fix incorrect statistics when hashes fail
An order-of-operations bug can lead to files being counted as hashed
when they are not (e.g. due to I/O error or the file disappearing).
Calculate the digest, then increment the statistics.