Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help

       samefile	- find identical files

       samearchive - find identical files, while keeping archives intact

       samefile	 [-a  |	 -A  |	-At | -L | -Z |	| -Zt] [-g size] [-l | -r] [-m
       size] [-S sep] [-0HiqVvx]

       samearchive [-a | -A | -At | -L | -Z | -Zt] [-g size]  [-l  |  -r]  [-m
       size] [-S sep] [-0HiqVv]	dir1 dir2 [...]

       These  programs	reads a	list of	filenames (one filename	per line) from
       stdin and output	the identical files on stdin.  samearchive is  written
       for the special case where each directory acts as an archive of backup.
       The output will only contain filename pairs that	have the same relative
       path from the archive base.  Therefor the output	of samearchive will be
       a subset	of samefile

       The output exist	out of six fields: the size in	bytes,	two  filenames
       (with  identical	contence), the character = if the two files are	on the
       same device, X otherwise, and the link counts of	the  two  files.   The
       output is sorted	in reverse order by size as the	primary	key and	a sec-
       ondary key that depends on the user input.

       -0     Indicates	that the input list of file names is  NUL  terminated,
	      for example as generated by implementations of find(1) that sup-
	      port the -print0 option.	Without	this option,  the  file	 names
	      are assumed to be	newline	terminated.

       -A     Sort filenames alphabetically. (default)

       -At    Sort  filenames cronologicly using the modification date (oldest
	      first).  This option is not available when you've	 compiled  the
	      application  with	 the  low  memory profile.  This option	is not
	      available	when you've compiled the application with the low mem-
	      ory profile.

       -a     Do not sort files	with same size alphabetically.

       -g size
	      Compare  only files with size greater than size bytes.  (Default
	      is 0.)

       -H     Print human friendly statistic when at verbose level 2

       -i     Allow files with the same	device/i-node pair to be added to  the
	      binary  tree.   This  might be useful if output will be fed into
	      some other program.

       -L     Sort filenames in	reversed natural order	using  the  number  of
	      times the	file was hard linked.

       -l     Do not report whether identical files are	hard linked.  This op-
	      tion reverses the	effects	of the -r option.

       -m size
	      Compare only files with size less	or equal than size bytes.  De-
	      fault is 0 which indicates there is no limit.

       -q     This  option  keep  the  information you are recieved during the
	      processes	to a minimum. (Verbose level 0)

       -r     Report whether identical files are hard linked.	The  separator
	      string  followed	by  the	 [bracketed] link count	is appended to
	      each name	pair if	they are hard links created with ln(1).	  This
	      option  is incompatible with the -l option.  Note	that this kind
	      of output	has only four fields and will appear  unsorted	before
	      the actual output	of samefile.

       -S sep Use  string sep as the output field separator, defaults to a tab
	      character.  Useful if filenames contain tab characters and  out-
	      put must be processed by another program,	say awk(1).

       -V     Print the	version	information and	exit.

       -v     This  option  increases  the  amount  of information you recieve
	      while running samefile.  At level	0 you will just	see the	 error
	      messages.	  At  level 1 you will see warning messages indicating
	      that samefile coudn't do something.  And at level	2 you will re-
	      cieve information	about the stages that samefile enters and some
	      statistic	when samefile finishes.	 Defaults to verbose level 1.

       -x     By default the program will print	just 1 x n lines for each  set
	      of  matches, but when this option	is used	the program will print
	      m	x n lines for each set of matches.  (i.e. when using  the  op-
	      tion -i and two files match and on is hard
	       linked  twice  and the other is hard linked three time then you
	      will get
	       6 lines instead of just 2 or 3.)

       -Z     Sort filenames in	reversed alphabetical order.

       -Zt    Sort filenames in	reversed cronological order using the  modifi-
	      cation date (youngest first).  This option is not	available when
	      you've compiled the application with  the	 low  memory  profile.
	      This  option  is not available when you've compiled the applica-
	      tion with	the low	memory profile.

       These programs uses two stages to give optimum performance.

       In the first stage, all non-plain files are skipped  (directories,  de-
       vices,  FIFOs,  sockets,	 symbolic  links)  as  well as files for which
       stat(2) fails and files that have a size	less than or equal to size  or
       greater than size.

       When the	memory is full,	samefile will try to store a part of the file-
       names temporarily in /tmp/samefile/_pid_.  When samefile	is not able to
       do this it will rais the	minimum	size and removes paths from the	memory

       In the second stage the filenames that are hard	linked	are  reported,
       assuming	 option	 -r was	passed to the program.	And the	files are com-
       pared and identical filenames are reported after	this.

       For any i-node only one filename	will  be  added	 (unless  -i  was  re-

       For  each two i-nodes that match	n lines	will be	printed	that shows the
       first filename of the first i-node matched against all the filenames of
       the  second i-node.  Note however, that because only the	first filename
       per i-node gets into the	second stage, the output for a group of	 iden-
       tical files with	different i-node numbers is also minimized.

       Suppose	you  have  six	identical files	of size	100 in an i-node group
       consisting of the three i-nodes with numbers 10,	20 and	30  (the  term
       file  systems  -	 it merely refers to a set of i-nodes addressing files
       with identical contents):

       % ls -i
	  10 file1     20 file4	    30 file6
	  10 file2     20 file5
	  10 file3
       % ls | samefile
       100     file1   file4   =       3       2
       100     file1   file6   =       3       1

       The sum of the sizes in the first column	is the amount  of  disk	 space
       you  could  gain	by making all 6	files links to only one	file or	remove
       all but one of the files.  To be	precise, disk space  is	 allocated  in
       blocks -	you will probably gain two blocks here,	rather than 200	bytes.
       Note that it is not enough to just remove file4 and  file6  (you	 would
       gain  only 100 bytes because file5 still	exists.)  The proper way is to
       use the -i option. The output will look like:

       100     file1   file4   =       3       2
       100     file1   file5   =       3       2
       100     file1   file6   =       3       1

       Removing	all files listed in the	third field  will  leave  only	file1.
       Making all files	hard links to file1 is easy.  If the fourth field is a
       ``='' do	a forced hard link.  If	you need to know  about	 all  combina-
       tions  of  identical  files,  then  you use both	the -i and -x options.
       This produces:

       % ls | samefile -ix
       100     file1   file4   =       3       2
       100     file1   file5   =       3       2
       100     file2   file4   =       3       2
       100     file2   file5   =       3       2
       100     file3   file4   =       3       2
       100     file3   file5   =       3       2
       100     file1   file6   =       3       1
       100     file2   file6   =       3       1
       100     file3   file6   =       3       1
       100     file4   file6   =       2       1
       100     file5   file6   =       2       1


	      When the list is to large	to fit	in  to	the  memory,  samefile
	      tries  to	 temporarily  store  the path on the disk by creaeting
	      files within the directory /tmp/samefile/_pid_


	      When the list is to large	to fit in to the  memory,  samearchive
	      tries  to	 temporarily  store  the path on the disk by creaeting
	      files within the directory /tmp/samefile/_pid_

       Find all	identical files	in the current working directory:

       % ls | samefile -i

       Find all	identical files	in my HOME directory  and  subdirectories  and
       also tell me if there are hard links:

       % find $HOME -type f -print | samefile -r

       Find  all  identical  files  in the /usr	directory tree that are	bigger
       than 10000 bytes	and write the result to	/tmp/usr (that one is for  the
       sysadmin	 folks,	 you may want to 'amp' - put it	in the background with
       the ampersand & - this command because it takes a few minutes.)

       % find /usr -type f -print | samefile -g	10000 >	/tmp/usr

       Find all	identical files	with in	the system archives that  live	within
       the current working directory:

       % find /path/to/backup/system-* | samearchive system-*

       inaccessible:  path This	is probably due	to a 'permission denied' error
       on files	or directories within the given	path for  which	 you  have  no
       read permission.

       unreadable:  path  The file could be opend for reading jet failed while
       reading.	 You shouldn't encounter such a	warnings but if	 you  do,  and
       recieve	more  than  a few, this	could be very well due to failing hard

       _file.cpp_:_line_ message You can encounter such	a errors  when	you've
       compiled	 the port with debugging information.  Please report such mes-
       sages to	the author with	some relevant information about	how to	repro-
       duce this bug.

       memory full: written amount path	to disk	The memory was full and	a num-
       ber of paths where temporarily written to disk.

       memory full: changed minimum file size to number	The  memory  was  full
       and  the	 program coudn't temporarily write paths to disk, so it	raised
       the minimum file	size to	the given number.  At a	later time  you	 could
       rerun  the  program  using the option -m	to check that paths that where
       skipped and going to be skipped as a result.

       memory full: aborting...	to manny files with the	same size  There  were
       just  to	 manny	files with the same size to fit	in to memory from this
       point on.  Try to split the list	up and then run	the  program  multiple

       samearchive-lite(1) sameln(1) samesame(1) find(1) ls(1)

       Input  filenames	 must  not have	leading	or trailing white space	unless
       the white space is part of the filename.

       samefile	was first written by Jens Schweikhardt in 1996.	 It was	 later
       rewritten  by  Alex  de kruijff in 2009 in order	to improve the perfor-
       mace.  In addition the program now was able to handle memory allocation
       problems	due to large list and gained some addition options.

       The  list is not	sorted properly	when using the option -x.  This	is not
       a bug but a feature. Proper sorting would either	consume	 vast  amounts
       of  memory or time.  The	sorting	options	are there just to controle the
       output.	(i.e. use -Zt if you intent to link with the file that was the
	most recently modified.	You will find that file	on the left.)

       Alex de Kruijff

				 14 APRIL 2009			   SAMEFILE(1)


Want to link to this manual page? Use this URL:

home | help