I started the weekend watching through these Hadoop videos over at Cloudera. That led me to research manipulating large matrices with Hadoop. I found a spectacular paper about multiplying matrices with Hadoop. A random statement in that paper about sparse matrices sent me back to the documentation for JAMA (the java matrix library I used for TagShadow processing) and the competing library, Jampack. Turns out neither of them have algorithms optimized for sparse matrices. I did find some notes about how to optimize matrix multiplication for a sparse matrix

I’ve been considering pushing TagShadow to it’s very limits. Even if I just included the titles that have been added to ISFDB it looks like that’s pushing 500k entries. Of course I’d like to handle the online magazines that ISFDB doesn’t cover as well… Handling that much data will involve some real-time abstractions to remain functional. It’s these thoughts that have me investigating parallel computing solutions like Hadoop. This is rather stream of conscious at the moment, but I’m excited.

Advertisements