I wanted to figure out some statistics about some Orkut (Google equivalent of Myspace or Facebook) social communities. I want to know the mean age, the gender rate and other things among given communities. And actually I'm getting a lot more as you'll see. The method I'm showing is very simple and can be reused for other purposes. It's automatic enough to grab profiles 15 by 15 and build a good sampling of the community, but not enough to grab ALL the profiles. Anyway, I guess Google would kick out crawlers wanting to grab all the profiles.
The challenge for collecting those data is that you need to be logged in to crawl Orkut communities. Then only you can grab information in the HTML pages if you manage to handle the navigation properly. I didn't really manage to automate all the process using the HPricot HTML parser.
Instead I went for a semi-automatic combination of GreaseMonkey scripts to grab the profiles and a Rails server to store the data. Now I only need to browse every members page of an Orkut community to get all its member profiles stored inside my database. Then I also have a rails action to export data in csv so I can open it in spreadsheet editor:

The first thing is to install the GreaseMonkey Firefox plugin.
Then you'll install the following user script name orkut_crawler.user.js:
// ==UserScript==
// @name          Orkut_crawler
// @namespace     http://livetribune.org/
// @description   grab orkut properties
// @include       *
// @exclude       http://diveintogreasemonkey.org/*
// @exclude       http://www.diveintogreasemonkey.org/*
// ==/UserScript==
var scripts = [
'http://localhost:3000/javascripts/orkut_crawler.js'
];
for (i in scripts) {
var script = document.createElement('script');
script.src = scripts[i];
document.getElementsByTagName('head')[0].appendChild(script);
}
What we are doing here is including the Prototype Javascript library to make it easier to grab the information using CSS selectors. I tried to paste the prototype lib directly inside the browser script but it didn't worked for some unknown reason. So anyway, this last solution works.
You can now make sure the script gets activated when you visit an Orkut page:

OK, now we need to set up our actual orkut_crawler script as well as the Rails backend.
build a simple Rails app using the ">rails orkut_crawler" command line.
You can use any database. I used the default Sqlite DB in my case.
Then create the database with that command ">rake db:create" inside the orkut crawler directory.
Now, let's make a simple persistent model to store our data:
"> ruby script/generate model item"
Now edit the db/migrate/001_create_items.rb migration file and write:
class CreateItems <>
Let's create the table: ">rails db:migrate"
Now we need to create the orkut_crawler Javascript file that'll be called by our GreaseMonkey script. make a new file called public/javascripts/orkut_crawler.js
Inside that file, you first need to copy the Prototype javascript library you'll find in public/javascripts. Then write this code at the end of the file:
/* Crawler */
function (src_url, id) {
var headID = document("head")[0];
var newScript = document('script');
newScript = 'snap';
newScript = 'text/javascript';
newScript = src_url;
headID(newScript);
}
function (line) {
try {
var key = line[1];
var value = line[3];
key = key(" ", "_");
key = key("/", "_");
key = key(":", "");
value = value("\\", "");
value = value("\"", "");
value = value(";", " ");
value = value(":", " ");
params += "&" + key + "=" + value;
console(key, value);
} catch(e) {}
}
var tab;
if (document('CommMembers.aspx') > 0) {//it's a community page
tab = $$('');
for (var i =0; i<>
setTimeout(tab[i].innerHTML="<" + "iframe src='"+ tab[i].childNodes[1].href + "'/>", 100 * i );
}
} else if (document.location.href.indexOf('Profile') > 0) {//it's a profile
var params="";
tab = $$('.listlight');
for (var i =0; i< tab;i++) {(tab[i]); }
tab = $$('.listdark');
for (var i =0; i< tab;i++) {(tab[i]); }
params = document + params;
("http://localhost:3000/data/new" + params);//send the params back to our Rails app!
This code will grab the profile when you browse an Orkut page. If you are rather browsing a community page, then it will open all the profiles of the listed members of this page and thus grab those profiles.
Finally, we need to write a Rails controller that will persist the data (the 'new' action), render a global csv file ('index' action) and even tell how many profiles we have ('size' action). So edit app/controllers/data_controller.rb this way:
require 'cgi'
class DataController < ApplicationController
  def new
    params.delete 'action'
    params.delete 'controller'
    params.each_key {|key| params[key] = CGI.escape(params[key])}
    puts params.inspect
    if existing =Item.find_by_uid(params[:uid])
      item = existing
    else
      item = Item.new
      item.uid = params[:uid]
    end
    item.properties = params.inspect
    if item.save
      render :text => 'UPDATED' and return if existing
      render :text => 'SAVED!'
    else
      render :text => 'ERROR!'
    end
  end
  def size
    render :text => Item.find(:all).size.to_s
  end
  def index
    all_props = []
    @items = Item.find(:all)
    res = ""
    @items.each do |item|
      map = eval item.properties
      map.each_key do |key|
        all_props << key unless all_props.index key
      end
    end
    all_props.each do |key|
      res +=  key + "; "
    end
    res = res[0..-3] + "\n"
    @items.each do |item|
      all_props.each do |key|
        map = eval item.properties
        res +=  CGI.unescape(map[key].to_s).gsub("\n", " ").gsub("\r", " ").gsub("\r\n", "").gsub(";", " - ") + "; "
      end
      res = res[0..-3] + "\n"
    end
    render :text => res
  end
end
OK, we are done.
Now make sure our Rails server is working:
"> ruby script/server"
You can also track what is going on:
"tail -f log/development.log"
Now visit some Orkuts community pages with this URL pattern:
http://www.orkut.com/CommMembers.aspx?cmm=10087467&tab=0&na=3&nst=151&nid=0

As you'll see, instead of the normal member's icons, we are now loading the members profiles inside iframes. At the same time, you can ensure in your terminal our Rails server is storing all the profiles.
Finally; look at the profile you collected:
see how many profiles you have collected: http://localhost:3000/data/size
Now let's export the data as csv (this can take a while depending how much data you stored):
in your terminal, execute:
">wget http://localhost:3000/data"
Now rename the data file into data.csv and browse it in OpenOffice for instance:

Ok, you are done. Now you can start making some statistics, but that's for round two! 
3 comments:
buy facebook likes
get facebook likes
http://www.crowdedhouse.com/news/vote-best-crowded-house-live http://www.columbusneighborhoods.org/content/irs-recognizes-columbus-hilltop-neighborhood-vita-program-volunteers
buy facebook likes 1000 facebook likes buy facebook likes
I have been running the free version of AVG for years but it has been causing problems with Internet Explorer 8. I solved it by replacing my AVG with BT Net Protect but now my father is having the same problem and his ISP doesn't provide any security software. I want to uninstall AVG from his PC but I'm not sure what to replace it with. Does anyone have any experience of other free anti-virus software that they can recommend? Many thanks.
1000 facebook likes buy facebook likes [url=http://1000fbfans.info]1000 facebook likes [/url] buy facebook likes
buy tramadol online tramadol high tolerance - where to buy tramadol online forum
Post a Comment