[ale] 100 million Facebook pages leaked on torrent site
Michael B. Trausch
mike at trausch.us
Sun Aug 1 11:15:45 EDT 2010
On Fri, 2010-07-30 at 11:55 -0400, Jim Philips wrote:
> I saw a report today that major corporations are already downloading
> the file through BitTorrent. A free goldmine of information for them!
I have already downloaded it myself, just to take a look at what's
actually in the whole thing.
There is a *lot* of data, mostly names, but also URLs to profile pages
for each of those names. It's about 17GB worth of data, enough to burn
to a BD-R for storage. It's not indexed, just plain-text, along with
counts for various names which could be used to determine popularity, as
an example.
I can see some of this data taking the place of 1930 Census Data in
terms of storage of proper names, such that businesses that use the aid
of data to parse free-form documents would benefit.
Here are the ten most listed first names (with frequency of occurrence):
977014 michael
963693 john
924816 david
819879 chris
640957 mike
602088 james
584438 mark
515686 jason
503658 robert
484403 jessica
And the ten most listed last names (also with frequency of occurrence):
913465 smith
571819 johnson
512312 jones
503266 williams
471390 brown
386764 lee
360010 khan
355639 singh
343220 kumar
324972 miller
I guess "Michael Smith" would be the most generic name possible if you
look at those numbers. :-)
I'm not sure what there really is in terms of useful data that companies
could use, other than having a large pool of names to be able to pick
from for things like random name generators, or parsers that look for
proper names in freeform documents, or other fairly specific things such
as that. Perhaps it's possible to use it for more than I envison, as
well.
It seems (at least from where I sit) that the Web site that is supposed
to have more information about the whole thing is unreachable; I get 17
hops before my packets to the thing enter some form of black hole on the
Internet in Canada. Oops.
Anyway, it's interesting, though of only limited use, I think; I don't
know that it contains enough information (by itself) to be harmful,
though I suppose that if you could combine it with other databases that
have additional data, it could be potentially detrimental.
One thing that I had expected to see based on all the chatter about it
was some form of relationship graph, say, showing who has friended who
on Facebook. That would be something that I could see companies easily
(ab)using for things like debt collection purposes. However, that sort
of data doesn't seem to be present, which I would consider to be a good
thing.
--- Mike
More information about the Ale
mailing list