Hi,
I wanted to figure out some statistics about some Orkut (Google equivalent of Myspace or Facebook) social communities. I want to know the mean age, the gender rate and other things among given communities. And actually I'm getting a lot more as you'll see. The method I'm showing is very simple and can be reused for other purposes. It's automatic enough to grab profiles 15 by 15 and build a good sampling of the community, but not enough to grab ALL the profiles. Anyway, I guess Google would kick out crawlers wanting to grab all the profiles.
The challenge for collecting those data is that you need to be logged in to crawl Orkut communities. Then only you can grab information in the HTML pages if you manage to handle the navigation properly. I didn't really manage to automate all the process using the HPricot HTML parser.
Instead I went for a semi-automatic combination of GreaseMonkey scripts to grab the profiles and a Rails server to store the data. Now I only need to browse every members page of an Orkut community to get all its member profiles stored inside my database. Then I also have a rails action to export data in csv so I can open it in spreadsheet editor:
The first thing is to install the GreaseMonkey Firefox plugin.
Then you'll install the following user script name orkut_crawler.user.js:
// ==UserScript==
// @name Orkut_crawler
// @namespace http://livetribune.org/
// @description grab orkut properties
// @include *
// @exclude http://diveintogreasemonkey.org/*
// @exclude http://www.diveintogreasemonkey.org/*
// ==/UserScript==
var scripts = [
'http://localhost:3000/javascripts/orkut_crawler.js'
];
for (i in scripts) {
var script = document.createElement('script');
script.src = scripts[i];
document.getElementsByTagName('head')[0].appendChild(script);
}
What we are doing here is including the Prototype Javascript library to make it easier to grab the information using CSS selectors. I tried to paste the prototype lib directly inside the browser script but it didn't worked for some unknown reason. So anyway, this last solution works.
You can now make sure the script gets activated when you visit an Orkut page:
OK, now we need to set up our actual orkut_crawler script as well as the Rails backend.
build a simple Rails app using the ">rails orkut_crawler" command line.
You can use any database. I used the default Sqlite DB in my case.
Then create the database with that command ">rake db:create" inside the orkut crawler directory.
Now, let's make a simple persistent model to store our data:
"> ruby script/generate model item"
Now edit the db/migrate/001_create_items.rb migration file and write:
class CreateItems <>
Let's create the table: ">rails db:migrate"
Now we need to create the orkut_crawler Javascript file that'll be called by our GreaseMonkey script. make a new file called public/javascripts/orkut_crawler.js
Inside that file, you first need to copy the Prototype javascript library you'll find in public/javascripts. Then write this code at the end of the file:
/* Crawler */
function appendNewScript(src_url, id) {
var headID = document.getElementsByTagName("head")[0];
var newScript = document.createElement('script');
newScript.id = 'snap';
newScript.type = 'text/javascript';
newScript.src = src_url;
headID.appendChild(newScript);
}
function addTuple(line) {
try {
var key = line.childNodes[1].innerHTML;
var value = line.childNodes[3].innerHTML;
key = key.replace(" ", "_");
key = key.replace("/", "_");
key = key.replace(":", "");
value = value.replace("\\", "");
value = value.replace("\"", "");
value = value.replace(";", " ");
value = value.replace(":", " ");
params += "&" + key + "=" + value;
console.log(key, value);
} catch(e) {}
}
var tab;
if (document.location.href.indexOf('CommMembers.aspx') > 0) {//it's a community page
tab = $$('.listitem');
for (var i =0; i<>
setTimeout(tab[i].innerHTML="<" + "iframe src='"+ tab[i].childNodes[1].href + "'/>", 100 * i );
}
} else if (document.location.href.indexOf('Profile.aspx') > 0) {//it's a profile
var params="";
tab = $$('.listlight');
for (var i =0; i< tab.length;i++) { addTuple(tab[i]); }
tab = $$('.listdark');
for (var i =0; i< tab.length;i++) {addTuple(tab[i]); }
params = document.location.search + params;
appendNewScript("http://localhost:3000/data/new" + params);//send the params back to our Rails app!
This code will grab the profile when you browse an Orkut page. If you are rather browsing a community page, then it will open all the profiles of the listed members of this page and thus grab those profiles.
Finally, we need to write a Rails controller that will persist the data (the 'new' action), render a global csv file ('index' action) and even tell how many profiles we have ('size' action). So edit app/controllers/data_controller.rb this way:
require 'cgi'
class DataController < ApplicationController
def new
params.delete 'action'
params.delete 'controller'
params.each_key {|key| params[key] = CGI.escape(params[key])}
puts params.inspect
if existing =Item.find_by_uid(params[:uid])
item = existing
else
item = Item.new
item.uid = params[:uid]
end
item.properties = params.inspect
if item.save
render :text => 'UPDATED' and return if existing
render :text => 'SAVED!'
else
render :text => 'ERROR!'
end
end
def size
render :text => Item.find(:all).size.to_s
end
def index
all_props = []
@items = Item.find(:all)
res = ""
@items.each do |item|
map = eval item.properties
map.each_key do |key|
all_props << key unless all_props.index key
end
end
all_props.each do |key|
res += key + "; "
end
res = res[0..-3] + "\n"
@items.each do |item|
all_props.each do |key|
map = eval item.properties
res += CGI.unescape(map[key].to_s).gsub("\n", " ").gsub("\r", " ").gsub("\r\n", "").gsub(";", " - ") + "; "
end
res = res[0..-3] + "\n"
end
render :text => res
end
end
OK, we are done.
Now make sure our Rails server is working:
"> ruby script/server"
You can also track what is going on:
"tail -f log/development.log"
Now visit some Orkuts community pages with this URL pattern:
http://www.orkut.com/CommMembers.aspx?cmm=10087467&tab=0&na=3&nst=151&nid=0
As you'll see, instead of the normal member's icons, we are now loading the members profiles inside iframes. At the same time, you can ensure in your terminal our Rails server is storing all the profiles.
Finally; look at the profile you collected:
see how many profiles you have collected: http://localhost:3000/data/size
Now let's export the data as csv (this can take a while depending how much data you stored):
in your terminal, execute:
">wget http://localhost:3000/data"
Now rename the data file into data.csv and browse it in OpenOffice for instance:
Ok, you are done. Now you can start making some statistics, but that's for round two!