[ale] 117000 files vs 240 missing - amazon

Michael H. Warfield mhw at WittsEnd.com
Mon Nov 25 10:26:45 EST 2013


On Thu, 2013-11-21 at 21:59 +0000, Lightner, Jeff wrote: 
> A vendor put a site on Amazon with some files we need.   We don’t have
> sftp access to this Amazon site but do have ftp access.   
> 
>  
> 
> Accordingly we did a wget to download all the files using our ftp
> credentials.    When all done we got over 117,000 files and saw no
> errors in the wget.
> 
>  
> 
> The problem is vendor is telling our director there are 240 more files
> in their count than we downloaded.    This is less than a 0.2%
> difference so I suspect it has something to do with the way they count
> vs. the way we did.  (We used find piped to wc –l.)   Our count
> matches the summary wget output when it finished so we are sure we’re
> correctly counting what wget did but of course it’s possible wget
> actually missed something though it seems unlikely to me.
> 
>  
> 
> The question is does anyone know what might cause such a difference?
> Alternative does anyone know another way we could count the files on
> the Amazon site using our ftp credentials other than going in and
> counting them one by one?
> 
I can think of several reasons why their count might be off, and
different reasons depending on if they were running on Windows, Mac, or
*NIX.  It's important to find out their methodology on how they counted
noses in a complex directory hierarchy to really know (did they
accidentally count . and .. in the directories, for instance).  They
should have provided you with a directory tree listing in the root of
that tree so you could compare.  If they can, they should go back and
create an "ls -R" listing in that directory.  Sending something blind
like that with no verification information seem rather incompetent to
me.

That being said, my next step would be to use curl instead of wget.
There are some, albeit rare, circumstances, mostly to do with http
redirects - but there are others, where wget does not always do the
right thing but curl does.

Curl also has some ftp options as well for fine grained control over
whether it uses multiple CWD commands, a single CWD command, or no CWD
commands when retrieving a tree.  Depending on the ftp server, this can
make a big difference (note: MultiCWD is the slowest but the most
formally correct by RFC).

I would also use the listing command, which uses NLIST, in a shell
script to simulate a recursive list by parsing out the directories and
issuing commands for each directory to drill into the hierarchy, then
count the files from the resulting hairball.

You mentioned in another message you had also done a find for files and
directories and added them up, which matched your total.  What were the
specific detailed counts?  Files, directories, your total, their total
expected.
> 
> We’re trying to find out how the vendor did their count but I was
> hoping someone already knows of some vagary on Amazon sites that would
> cause this kind of discrepancy.
> 
> 
>  
> 
>  
> 
>  
> 
>  
> 
> Athena®, Created for the Cause™
> 
> Making a Difference in the Fight Against Breast Cancer
> 
>  
> 
>  
> 
> How and Why I Should Support Bottled Water!
> Do not relinquish your right to choose bottled water as a healthy
> alternative to beverages that contain sugar, calories, etc. Your
> support of bottled water will make a difference! Your signatures
> count! Go to
> http://www.bottledwatermatters.org/luv-bottledwater-iframe/dswaters
> and sign a petition to support your right to always choose bottled
> water. Help fight federal and state issues, such as bottle deposits
> (or taxes) and organizations that want to ban the sale of bottled
> water. Support community curbside recycling programs. Support bottled
> water as a healthy way to maintain proper hydration. Our goal is
> 50,000 signatures. Share this petition with your friends and family
> today!
> 
>  
> 
> ---------------------------------
> CONFIDENTIALITY NOTICE: This e-mail may contain privileged or
> confidential information and is for the sole use of the intended
> recipient(s). If you are not the intended recipient, any disclosure,
> copying, distribution, or use of the contents of this information is
> prohibited and may be unlawful. If you have received this electronic
> transmission in error, please reply immediately to the sender that you
> have received the message in error, and delete it. Thank you.
> ----------------------------------
> 
>  
> 
> 
> -- 
> This message has been scanned for viruses and 
> dangerous content by MailScanner, and is 
> believed to be clean. 
> _______________________________________________
> Ale mailing list
> Ale at ale.org
> http://mail.ale.org/mailman/listinfo/ale
> See JOBS, ANNOUNCE and SCHOOLS lists at
> http://mail.ale.org/mailman/listinfo

-- 
Michael H. Warfield (AI4NB) | (770) 978-7061 |  mhw at WittsEnd.com
   /\/\|=mhw=|\/\/          | (678) 463-0932 |  http://www.wittsend.com/mhw/
   NIC whois: MHW9          | An optimist believes we live in the best of all
 PGP Key: 0x674627FF        | possible worlds.  A pessimist is sure of it!

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 482 bytes
Desc: This is a digitally signed message part
URL: <http://mail.ale.org/pipermail/ale/attachments/20131125/0cf51418/attachment.sig>


More information about the Ale mailing list