We present the backend architecture behind Spinn3r – our scalable web and blog crawler.
Most existing work in scaling MySQL has been around high read throughput environments similar to web applications. In contrast, at Spinn3r we needed to complete thousands of write transactions per second in order to index the blogosphere at full speed.
We have achieved this through our ground up development of a fault tolerant distributed database and compute infstructure all built on top of cheap commodity hardware.
We’ve built out a number of technologies on top of MySQL that help enable us to easily scale operations.
We’ve implemented an Open Source load balancing JDBC driver named lbpool. (http://code.tailrank.com/lbpool). Lbpool allows us to loosely couple our MySQL slaves which allow us to gracefully handle system failures. It also supports load balancing, reprovisioning, slave lag, and other advanced features not available in the stock MySQL JDBC driver.
We’ve also built out a sharded database similar to infrastructure built at other companies such as Google (Adwords) and Yahoo (Flickr). Our sharded DB has a number of interesting properties including ultra high throughput requirements (we process 52TB per month), distributed sequence generation, and query plan execution.
Founder/CEO of Spinn3r, co-inventor of RSS, Apache contributor, and big data geek.
Jonathan has been developing database driven application frameworks for over ten years. His work has been primarily based in perl, python, php and java. In the last decade he has been lead architect on six different frameworks that encompass applications ranging from social networks to educational simulators.
View a complete list of MySQL contacts.