git.hungrycats.org Git - dupemerge/log

]> git.hungrycats.org Git - dupemerge/log

projects / dupemerge / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Zygo Blaxell [Fri, 14 May 2010 18:37:58 +0000 (14:37 -0400)]

dm6: drop inode and digest suffixes

Digests are much longer than inode numbers can possibly be (*), so it's
not necessary to distinguish between them.

(*) Yes you could have a hypothetical filesystem with 256-bit inode
numbers, but you'd have to adjust Perl and the size of 'Q' first.
Either way, you're changing this code anyway.

commit | commitdiff | tree

Zygo Blaxell [Fri, 14 May 2010 18:20:47 +0000 (14:20 -0400)]

dm6: fix email address

commit | commitdiff | tree

Zygo Blaxell [Fri, 14 May 2010 17:52:06 +0000 (13:52 -0400)]

dm6: '/' should not be part of inode names

commit | commitdiff | tree

Zygo Blaxell [Fri, 14 May 2010 17:51:41 +0000 (13:51 -0400)]

dm6: fix inode naming for little-endian systems

commit | commitdiff | tree

Zygo Blaxell [Fri, 14 May 2010 17:33:04 +0000 (13:33 -0400)]

dm6: make the version number friendlier

"date.0" means we have problems if we want to make a "1.0"
release. "0.date" means we can put off a 1.0 release indefinitely.

commit | commitdiff | tree

Zygo Blaxell [Fri, 14 May 2010 16:57:42 +0000 (12:57 -0400)]

Merge branch 'dm6'

commit | commitdiff | tree

Zygo Blaxell [Fri, 14 May 2010 16:40:03 +0000 (12:40 -0400)]

dm6: verbose output during garbage collection

commit | commitdiff | tree

Zygo Blaxell [Fri, 14 May 2010 16:39:36 +0000 (12:39 -0400)]

dm6: use Base64 encoding for inode numbers

commit | commitdiff | tree

Zygo Blaxell [Fri, 14 May 2010 16:39:05 +0000 (12:39 -0400)]

dm6: add comment about why we're not supporting non-ext[234] filesystems

commit | commitdiff | tree

Zygo Blaxell [Thu, 13 May 2010 21:20:32 +0000 (17:20 -0400)]

Go ahead and create the link_dir if it doesn't exist

commit | commitdiff | tree

Zygo Blaxell [Thu, 13 May 2010 21:13:59 +0000 (17:13 -0400)]

Fold the inode and digest trees into a single structure that shares the non-leaf directories we create.

commit | commitdiff | tree

Zygo Blaxell [Thu, 13 May 2010 21:07:10 +0000 (17:07 -0400)]

Fix the hardlinking hopefully once and for all

commit | commitdiff | tree

Zygo Blaxell [Thu, 13 May 2010 20:53:53 +0000 (16:53 -0400)]

Make sure we are using the right inode filename

commit | commitdiff | tree

Zygo Blaxell [Thu, 13 May 2010 20:53:35 +0000 (16:53 -0400)]

Fix up the progress reporting so we can see directories being created

commit | commitdiff | tree

Zygo Blaxell [Thu, 13 May 2010 20:41:30 +0000 (16:41 -0400)]

Add garbage collection at the end of each run.

You can trivially invoke the garbage collector by redirecting stdin
to /dev/null.

commit | commitdiff | tree

Zygo Blaxell [Thu, 13 May 2010 20:41:13 +0000 (16:41 -0400)]

Get the right file linked to the inode directory

commit | commitdiff | tree

Zygo Blaxell [Thu, 13 May 2010 20:40:49 +0000 (16:40 -0400)]

parameterize the prefix length and ensure we don't have files with names too short

commit | commitdiff | tree

Zygo Blaxell [Thu, 13 May 2010 20:40:23 +0000 (16:40 -0400)]

Replace progress reports with single-character markers

commit | commitdiff | tree

Zygo Blaxell [Thu, 13 May 2010 20:38:53 +0000 (16:38 -0400)]

usage: Add version string

commit | commitdiff | tree

Zygo Blaxell [Thu, 13 May 2010 20:09:47 +0000 (16:09 -0400)]

dm6: fix duplicate prefix paths

commit | commitdiff | tree

Zygo Blaxell [Thu, 13 May 2010 20:07:52 +0000 (16:07 -0400)]

dm6: finish WIP

commit | commitdiff | tree

Zygo Blaxell [Sat, 27 Mar 2010 14:16:27 +0000 (10:16 -0400)]

usage: fix uninitialized-value message if no directories given

commit | commitdiff | tree

Zygo Blaxell [Sat, 27 Mar 2010 01:25:25 +0000 (21:25 -0400)]

humane: fix uninitialized-value warnings

...such as these:

Use of uninitialized value $3 in concatenation (.) or string at ./faster-dupemerge line 668, <FIND> chunk 156.

Reported by Lenny Foner.

commit | commitdiff | tree

root [Mon, 8 Mar 2010 14:33:42 +0000 (09:33 -0500)]

dm6: WIP, see TODO

TODO:

- check for existence of hardlink before linking, and link the
other way

- needs a maximum link count to work around ext[234] limits

- should hardlink to lowest inode found, unless link count
limit exceeeded

commit | commitdiff | tree

Zygo Blaxell [Sun, 14 Feb 2010 00:36:22 +0000 (19:36 -0500)]

Fix "expected inode FOO, found BAR" warning message

FOO was passed through format_inode, BAR was not. The behavior was
correct, but the message incorrectly suggested otherwise.

commit | commitdiff | tree

Zygo Blaxell [Mon, 11 Jan 2010 01:10:46 +0000 (20:10 -0500)]

Fix whitespace in usage message (no tabs please)

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 17:34:07 +0000 (12:34 -0500)]

Document the file size indicator in --progress output

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 17:24:32 +0000 (12:24 -0500)]

Make the *absence* of --skip-hash work again

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 07:23:07 +0000 (02:23 -0500)]

Don't report size in --progress unless it changes

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 04:26:24 +0000 (23:26 -0500)]

Suppress spurious warnings in --dry-run mode

In --dry-run mode, the filesystem inodes intentionally do not change even
though links are "successful", so don't emit warnings when they haven't.

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 04:20:39 +0000 (23:20 -0500)]

Clean up output if --progress active and bogon file encountered

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 04:20:06 +0000 (23:20 -0500)]

Add file sizes to --progress output, once per second

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 04:09:27 +0000 (23:09 -0500)]

Allow --skip-hash and --skip-compare to be specified together

Since --skip-hash is now a threshold parameter which applies to some
files and not others, it makes sense to allow --skip-compare to be
specified at the same time.

--skip-compare is respected for files smaller than the --skip-hash
threshold, since such files will always be hashed. --skip-compare is
ignored for files larger than or equal to the --skip-hash threshold
since such files will never be hashed.

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 04:01:23 +0000 (23:01 -0500)]

Add a threshold to skip-hash

Hashing usually performs well except in special cases where many
large files have the same size but very different contents near
the beginning of the file. In these cases, it is usually faster
to execute the O(N^2) comparisons between all the files, thereby
avoiding reading most of their data.

In cases where there are many small files, the opposite setting of
the skip-hash option is usually better, so skip-hash is now a
threshold parameter with a default of 1MB.

Unfortunately it is not trivial to detect this condition without
doing sufficient work to negate the benefit; hence, we require
the operator to specify a preference.

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 02:33:20 +0000 (21:33 -0500)]

merge_files: add extra sanity check that input files are in fact files

Now that there is no '-f _' filter in the input loop, merge_files may
encounter a candidate or incumbent filename which resolves to a non-file
object on the filesystem. If --skip-hash is not used, this sanity check
will be applied during file hashing; however, if --skip-hash is used,
the sanity check will not be applied until the files are compared.

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 02:30:09 +0000 (21:30 -0500)]

Update *low* range of copyright years: 2002-2010

While consolidating the histories of faster-dupemerge across the
many SCM repositories in which it has lived, I found evidence of
faster-dupemerge existing as far back as 2002.

commit | commitdiff | tree

Zygo Blaxell [Fri, 8 Jan 2010 15:07:32 +0000 (10:07 -0500)]

dupemerge: clean up stats_blob output if --progress

commit | commitdiff | tree

Zygo Blaxell [Fri, 8 Jan 2010 14:16:02 +0000 (09:16 -0500)]

dupemerge: don't stat during the file collection loop

Remove the lstat from the find output reading loop. It's a redundant
copy of the same code in merge_files.

Adjust merge_files to filter out possible non-files that will now leak
through from the find output.

commit | commitdiff | tree

Zygo Blaxell [Fri, 8 Jan 2010 14:06:00 +0000 (09:06 -0500)]

dupemerge: only report '.' for non-trivial merges

commit | commitdiff | tree

Zygo Blaxell [Fri, 8 Jan 2010 14:00:57 +0000 (09:00 -0500)]

dupemerge: document --progress option, and expand tabs in rest of usage

commit | commitdiff | tree

Zygo Blaxell [Fri, 8 Jan 2010 14:00:40 +0000 (09:00 -0500)]

dupemerge: add --progress option

commit | commitdiff | tree

Zygo Blaxell [Wed, 6 Jan 2010 18:58:11 +0000 (13:58 -0500)]

dupemerge: merge with waya-zblaxell, fix warning message

Merges up to c92b00c812898fcba6e5ebde14ef83ae597756ee, including
these commits:

6038ff7 dupemerge: have find tell us the device too
fd1d958 dupemerge: maybe improve seek performance by sorting perl hashes
9bf1c63 dupemerge: merge with state-of-the-art on Serenity
a27ea1f dupemerge: update copyright year to 2010
593804d dupemerge: sort incumbent inodes too
1df8571 dupemerge: make inode sort order strictly numeric
24fecbb dupemerge: make sure 'sort -znr' still considers dev/inode numeric
c92b00c dupemerge: inodes are now non-numeric

commit | commitdiff | tree

Zygo Blaxell [Wed, 6 Jan 2010 18:22:38 +0000 (13:22 -0500)]

dupemerge: inodes are now non-numeric

commit | commitdiff | tree

Zygo Blaxell [Wed, 6 Jan 2010 17:23:28 +0000 (12:23 -0500)]

dupemerge: make sure 'sort -znr' still considers dev/inode numeric

commit | commitdiff | tree

Zygo Blaxell [Wed, 6 Jan 2010 16:59:49 +0000 (11:59 -0500)]

dupemerge: make inode sort order strictly numeric

commit | commitdiff | tree

Zygo Blaxell [Wed, 6 Jan 2010 16:57:42 +0000 (11:57 -0500)]

dupemerge: sort incumbent inodes too

commit | commitdiff | tree

Zygo Blaxell [Wed, 6 Jan 2010 16:56:31 +0000 (11:56 -0500)]

dupemerge: update copyright year to 2010

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 01:51:45 +0000 (20:51 -0500)]

Merge branch 'performance'

Conflicts:
faster-dupemerge

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 01:08:45 +0000 (20:08 -0500)]

Update copyright year and email address

It helps my spam filter if I can keep track of which web page the
spammers have scraped my email address from.

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 01:08:45 +0000 (20:08 -0500)]

Work around new fileutils output

findutils now appends a redundant ".000000000" to the %T@ output.
I've apparently missed the window to get findutils to fix this, so
I've worked around it.

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 02:21:44 +0000 (21:21 -0500)]

Update copyright year

commit | commitdiff | tree

root [Sun, 26 Nov 2006 22:05:51 +0000 (22:05 +0000)]

Properly handle cases where multiple files have the same hash
(e.g. because --skip-hash is used). This version now generates all N^2
combinations of comparisons.

git-svn-id: svn+ssh://svn.furryterror.org/r/trunk/mokona/zblaxell@6218 a5e33b96-951a-0410-ae88-c0fe16d076bb

git-svn-id: file:///root/SVN@4 f049ffa3-53c0-42dd-8896-c8778eaba0c5

git-svn-id: file:///root/SVN@10 f049ffa3-53c0-42dd-8896-c8778eaba0c5

commit | commitdiff | tree

Zygo Blaxell [Wed, 6 Jan 2010 16:10:04 +0000 (11:10 -0500)]

dupemerge: maybe improve seek performance by sorting perl hashes

Thank Johannes Niess <Linux@johannes-niess.de> for this idea.

To improve seek performance, choose inodes for linking in a fixed order.
This will mean that two directories with multiple identical files will
end up with links to the copies with lower inode numbers. This is an
improvement over the previous result, which was that both directories
would end up with randomly chosen files from both directories.

The sort order isn't strictly numeric; however, it's hopefully close
enough.

As a crude heuristic, we assume that inode numbers approximate file
position on disk, and file names approximate typical usage patterns.
Previously we used the perl hash semantics, which are mostly random
and might change depending on the numbers of files considered.

commit | commitdiff | tree

Zygo Blaxell [Wed, 6 Jan 2010 16:07:24 +0000 (11:07 -0500)]

dupemerge: have find tell us the device too

faster-dupemerge cannot be used to link files on multiple filesystems
because the hardlinks will fail; however, if this is attempted anyway
then files with identical weak keys (size+timestamp+permissions) and
identical inode numbers might be considered as identical for hashing
and comparing purposes when they are not. That would be bad.

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 01:59:05 +0000 (20:59 -0500)]

Remove ad-hoc copyright notice, add formal copyright statement and GPL

git-svn-id: svn+ssh://svn.furryterror.org/r/trunk/mokona/zblaxell@3269 a5e33b96-951a-0410-ae88-c0fe16d076bb

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 01:58:48 +0000 (20:58 -0500)]

tick_quote: properly quote the string '\''

commit | commitdiff | tree

cvs [Sat, 7 Jan 2006 08:44:02 +0000 (08:44 +0000)]

Implement --dry-run and --humane options

git-svn-id: svn+ssh://svn.furryterror.org/r/trunk/mokona/zblaxell@4518 a5e33b96-951a-0410-ae88-c0fe16d076bb

commit | commitdiff | tree

cvs [Mon, 5 May 2003 04:20:14 +0000 (04:20 +0000)]

digest: Fix incorrect statistics when hashes fail

An order-of-operations bug can lead to files being counted as hashed
when they are not (e.g. due to I/O error or the file disappearing).
Calculate the digest, then increment the statistics.

git-svn-id: svn+ssh://svn.furryterror.org/r/trunk/mokona/zblaxell@3332 a5e33b96-951a-0410-ae88-c0fe16d076bb

commit | commitdiff | tree

Zygo Blaxell [Sat, 9 Jan 2010 01:04:54 +0000 (20:04 -0500)]

Replace --trust with --skip-compare and add --skip-hash and copyright statement

git-svn-id: svn+ssh://svn.furryterror.org/r/trunk/mokona/zblaxell@3225 a5e33b96-951a-0410-ae88-c0fe16d076bb

Conflicts:

faster-dupemerge

commit | commitdiff | tree

root [Tue, 23 Dec 2008 19:52:04 +0000 (14:52 -0500)]

Initial commit

commit | commitdiff | tree

root [Tue, 23 Dec 2008 19:52:04 +0000 (14:52 -0500)]

Initial commit

faster-dupemerge (see http://www.hungrycats.org/~zblaxell/projects/dupemerge/dupemerge.html)