Friday, July 11, 2008

OLAP component: Mondrian+JPivot or Flex OLAPDataGrid ?

Recently I've been spending a few hours digging into OLAP cube components (a special table to analyse large and multi-dimmentionnal data in Business Intelligence). I wanted to produce a nice demo with an OLAP cube plugged to the OpenERP (by far the best open source ERP) database.

I actually saw a guy demoing such an OLAP Flex component along with Openbravo ERP (looking nice but the details were provided): http://opensourceerpguru.com/
Anyway I just wanted to achieve the same but using OpenERP (I feel much more comfortable with it's elegant and efficient architecture, not to tell about the features nor the business model) this time and eventually JRuby on Rails, my best Swiss knife to pull the data to the OLAP viewer.

So at first I've been stunned by the OLAP Flex component (called OLAPDataGrid) . It looks really nicer than the old fashion Mondrian JPivot. So I decided to give it a try. First time with Flex and back to the half open source crapp since long ago. So I had to follow all the Adobe crapp flow: create a fucking account, read all their commercial crapp, agree with their whatever license and finally download the stuff and start playing with it. Then I remarked that all my FlexBuilder nice OLAP samples were coming with a "Flex Data Vizualisation Trial" watermark. OK, time to remeber, FlexBuilder is not open source yet, so let's go with the Flex SDK, back to the Adobe legacy crapp, download again and try again (I heard that the SDK was open source, or sort of).

Then I tried to compile my mxml component with the following command line:

flex_sdk_3/bin$ ./mxmlc /home/rvalyi/DEV/olap_test/src/olap.mxml
Loading configuration file /home/rvalyi/Desktop/flex_sdk_3/frameworks/flex-config.xml
/home/rvalyi/DEV/olap_test/src/olap.mxml(162): Error: Could not resolve to a component implementation.

id="myMXMLCube"

/home/rvalyi/DEV/olap_test/src/olap.mxml(195): Error: Could not resolve to a component implementation.


WTF ???

Googled the error message and got the official answer from an Adobe employee on a forum here: http://www.codeverge.net/item.aspx?item=101509

The OLAPDatagrid component is available only in the Flex Builder Professional
version. For the first problem, the one where you are building a Flex + LCDS
2.5.1 project, the answer is that the LCDS 2.5.1 doesn't include the
OLAPDatagrid component and that is way you are getting the errors on runtime.


You f****** b*st*rds! you got me! So that how I lost a few hours trapped by the Adobe half open source policy, go hell!

Mondrian or Flex OLAPDataGrid?
Well, back to Mondrian and JPivot...

OK, for sure Mondrian and JPivot aren't really something optimal and I feel more like it's a bloated non HTTP compliant piece of code, but hey it's free and it just works. So until Tiny.be release their awesome open source "TinyBI framework" (for October?) I'll stick with it. I'll hardly try Flex and Flash again, I promise.

Finally I should say that I really don't know anyway how that Flex component would deal with a large database as it seems it's an in memory client side OLAP solution only. Even if it were to change, I'm not sure I would feel comfortable in feeding the right data pieces as the OLAP component requires them and finally all the existing samples don't come with drill down, slicer and rotation widgets, so I'm not sure how easily one can interact with the cube.

May this post save your time.

Sunday, January 20, 2008

Orkut social network community profiling with GreaseMonkey and Rails, round one.

Hi,

I wanted to figure out some statistics about some Orkut (Google equivalent of Myspace or Facebook) social communities. I want to know the mean age, the gender rate and other things among given communities. And actually I'm getting a lot more as you'll see. The method I'm showing is very simple and can be reused for other purposes. It's automatic enough to grab profiles 15 by 15 and build a good sampling of the community, but not enough to grab ALL the profiles. Anyway, I guess Google would kick out crawlers wanting to grab all the profiles.

The challenge for collecting those data is that you need to be logged in to crawl Orkut communities. Then only you can grab information in the HTML pages if you manage to handle the navigation properly. I didn't really manage to automate all the process using the HPricot HTML parser.

Instead I went for a semi-automatic combination of GreaseMonkey scripts to grab the profiles and a Rails server to store the data. Now I only need to browse every members page of an Orkut community to get all its member profiles stored inside my database. Then I also have a rails action to export data in csv so I can open it in spreadsheet editor:


The first thing is to install the GreaseMonkey Firefox plugin.
Then you'll install the following user script name orkut_crawler.user.js:

// ==UserScript==
// @name Orkut_crawler
// @namespace http://livetribune.org/
// @description grab orkut properties
// @include *
// @exclude http://diveintogreasemonkey.org/*
// @exclude http://www.diveintogreasemonkey.org/*
// ==/UserScript==

var scripts = [
'http://localhost:3000/javascripts/orkut_crawler.js'
];
for (i in scripts) {
var script = document.createElement('script');
script.src = scripts[i];
document.getElementsByTagName('head')[0].appendChild(script);
}


What we are doing here is including the Prototype Javascript library to make it easier to grab the information using CSS selectors. I tried to paste the prototype lib directly inside the browser script but it didn't worked for some unknown reason. So anyway, this last solution works.

You can now make sure the script gets activated when you visit an Orkut page:

OK, now we need to set up our actual orkut_crawler script as well as the Rails backend.
build a simple Rails app using the ">rails orkut_crawler" command line.
You can use any database. I used the default Sqlite DB in my case.

Then create the database with that command ">rake db:create" inside the orkut crawler directory.

Now, let's make a simple persistent model to store our data:
"> ruby script/generate model item"
Now edit the db/migrate/001_create_items.rb migration file and write:

class CreateItems <>

Let's create the table: ">rails db:migrate"

Now we need to create the orkut_crawler Javascript file that'll be called by our GreaseMonkey script. make a new file called public/javascripts/orkut_crawler.js

Inside that file, you first need to copy the Prototype javascript library you'll find in public/javascripts. Then write this code at the end of the file:


/* Crawler */

function appendNewScript(src_url, id) {
var headID = document.getElementsByTagName("head")[0];
var newScript = document.createElement('script');
newScript.id = 'snap';
newScript.type = 'text/javascript';
newScript.src = src_url;
headID.appendChild(newScript);
}

function addTuple(line) {
try {
var key = line.childNodes[1].innerHTML;
var value = line.childNodes[3].innerHTML;
key = key.replace(" ", "_");
key = key.replace("/", "_");
key = key.replace(":", "");
value = value.replace("\\", "");
value = value.replace("\"", "");
value = value.replace(";", " ");
value = value.replace(":", " ");
params += "&" + key + "=" + value;
console.log(key, value);
} catch(e) {}
}

var tab;

if (document.location.href.indexOf('CommMembers.aspx') > 0) {//it's a community page
tab = $$('.listitem');
for (var i =0; i<>
setTimeout(tab[i].innerHTML="<" + "iframe src='"+ tab[i].childNodes[1].href + "'/>", 100 * i );
}
} else if (document.location.href.indexOf('Profile.aspx') > 0) {//it's a profile
var params="";
tab = $$('.listlight');
for (var i =0; i< tab.length;i++) { addTuple(tab[i]); }
tab = $$('.listdark');
for (var i =0; i< tab.length;i++) {addTuple(tab[i]); }

params = document.location.search + params;

appendNewScript("http://localhost:3000/data/new" + params);//send the params back to our Rails app!



This code will grab the profile when you browse an Orkut page. If you are rather browsing a community page, then it will open all the profiles of the listed members of this page and thus grab those profiles.

Finally, we need to write a Rails controller that will persist the data (the 'new' action), render a global csv file ('index' action) and even tell how many profiles we have ('size' action). So edit app/controllers/data_controller.rb this way:

require 'cgi'

class DataController < ApplicationController
def new
params.delete 'action'
params.delete 'controller'
params.each_key {|key| params[key] = CGI.escape(params[key])}
puts params.inspect
if existing =Item.find_by_uid(params[:uid])
item = existing
else
item = Item.new
item.uid = params[:uid]
end
item.properties = params.inspect
if item.save
render :text => 'UPDATED' and return if existing
render :text => 'SAVED!'
else
render :text => 'ERROR!'
end
end

def size
render :text => Item.find(:all).size.to_s
end

def index
all_props = []
@items = Item.find(:all)
res = ""
@items.each do |item|
map = eval item.properties
map.each_key do |key|
all_props << key unless all_props.index key
end
end

all_props.each do |key|
res += key + "; "
end
res = res[0..-3] + "\n"

@items.each do |item|
all_props.each do |key|
map = eval item.properties
res += CGI.unescape(map[key].to_s).gsub("\n", " ").gsub("\r", " ").gsub("\r\n", "").gsub(";", " - ") + "; "
end
res = res[0..-3] + "\n"
end
render :text => res
end
end



OK, we are done.
Now make sure our Rails server is working:
"> ruby script/server"

You can also track what is going on:
"tail -f log/development.log"

Now visit some Orkuts community pages with this URL pattern:
http://www.orkut.com/CommMembers.aspx?cmm=10087467&tab=0&na=3&nst=151&nid=0


As you'll see, instead of the normal member's icons, we are now loading the members profiles inside iframes. At the same time, you can ensure in your terminal our Rails server is storing all the profiles.

Finally; look at the profile you collected:
see how many profiles you have collected: http://localhost:3000/data/size
Now let's export the data as csv (this can take a while depending how much data you stored):
in your terminal, execute:
">wget http://localhost:3000/data"
Now rename the data file into data.csv and browse it in OpenOffice for instance:



Ok, you are done. Now you can start making some statistics, but that's for round two!