[ale] Stumped-

David A. De Graaf dad at datix.us
Thu Mar 11 23:00:29 EST 2010


On Thu, Mar 11, 2010 at 09:53:22AM -0500, Dennis Ruzeski wrote:
> I have a directory structure-
> 
> |
> |
> \client1\
> |           \images
> |           \movies
> \client2\
> |           \images
> |           \movies
> |etc..
> To around 10000 client directories.
> 
> I need a bash script (I started to write it in python and they want
> bash) to go through this
> tree and find, for each client, files with the same name but different sizes.
> 
> Pretty simple with perl or python but I'm struggling to get a shell
> script to do the same thing.
> 
> Anyone have any ideas?

Firstly, I believe there's a logical inconsistency in your problem
statement.  As an elementary example, say there are three files of the
same basename:
  Path         Size    Basename
  pathname1	0	file1
  pathname2	0	file1
  pathname3	1	file1

Do you wish to report 
  all three - because at least one size is different;
  only the third - because its size and name are unique:
  or some combination of first and second, with the third?

My script below provides choices one and two;  I can't see any
sensible way to present the third "some combination" choice.  
  /tmp/dups4 - lists all pathnames with the same basename
  /tmp/dups5 - lists unique (size, file) combos, ie, pathname3 only.

I suspect you really want /tmp/dups4.

Here's a script - dups
It uses find to enumerate all the files (excluding non-files).
File names with embedded spaces, or worse, trailing spaces are 
really pernicious.  For each file it prints the full pathname, the
size, and the basename, separated by tab characters.  

Then it uses sort and uniq repeatedly to get the desired list.
Obviously the separate commands could be strung together in a big
pipeline, but debugging each stage is easier if separated.

I ran this on my home directory.  
Analyzing 27707 files took several minutes, mostly to generate the
list with file sizes.
I found an amazing number of files with the same name but different
sizes, eg, 235 files named "%gconf.xml".   Ugh!


***   dups   ***

#  List files in a directory with the same basename but different size

rm /tmp/dups?

#  List all files in stated dirs; escape those damnable blanks
#  For each, print full pathname, size, and basename, 
#  separated by tab

find $* -type f |
    sed -e 's/ /\\ /g' |
    while read fn; do
      echo -e "$fn"\\t`stat -c %s "$fn"`\\t`basename "$fn"`
    done > /tmp/dups1

#  sort on field 3, grouping all lines with same basename
sort -k 3 /tmp/dups1 > /tmp/dups2

#  list duplicates in the third field, dropping singletons
uniq --all-repeated -f 2 -d /tmp/dups2 > /tmp/dups3
    
#  sort on the third field, and within matching lines, on the second
sort -k3 -k2,3 /tmp/dups3 > /tmp/dups4

#  select lines that are unique in fields 2 and 3
uniq -u -f1 /tmp/dups4 > /tmp/dups5

cat /tmp/dups5


-- 
	David A. De Graaf    DATIX, Inc.    Hendersonville, NC
	dad at datix.us         www.datix.us


More information about the Ale mailing list