[ale] Bash/Python Question

Brian Pitts brian at polibyte.com
Tue Mar 16 22:49:10 EDT 2010


On 03/16/2010 09:13 PM, Jim Popovitch wrote:
> On Tue, Mar 16, 2010 at 02:47, Omar Chanouha <ofosho at gatech.edu> wrote:
>> Hey all,
>>
>>   I am creating an information gatherer for a school project.
> 
> There are 2 ways to do this.   Figure it out by yourself, or figure it
> out with assistance from others.  Those that learn how to ask for
> assistance and help, will go further in life than those that beat
> their heads against the wall all day.   Since you asked for help....
> Have a look at the attached, I wrote it last year'ish.  It fetches rss
> feed(s) and downloads items that contain videos.  The one feed
> currently in it is a public shared google reader feed that contains
> various online videos.  I use this script to periodically download the
> videos, queue them, and re-encode them to work on my blackberry.   The
> code should be sort of self-explanatory, essentially there are 2
> queues (DownloadQueue and EncodeQueue) that are subclasess of
> WorkQueue.  Each item added to a Queue is assigned a TaskRunner
> (DownloadTask and EncodeTask).   The main loop just looks for finished
> tasks, pops them, and moves something from the Queue queued queue into
> the Queue active queue.
> 
> It's not perfect, it still needs some tweaks.. but it works for what i
> need it to do.
> 
> You will need Python's feedparser lib.   Enjoy

I'm curious Jim, why wget instead of urllib2?

On the subject of useful Python libraries for web data, chardet is
fantastic. When the character set isn't specified in the http headers or
in the content (e.g. a meta tag in html), chardet can often look at the
data and figure it out. I crawled around 4,000,000 pages recently and
ended up with less than 10,000 that I couldn't decode.

-- 
All the best,
Brian Pitts


More information about the Ale mailing list