19th Jul 2008
RarestNews, scalability, 100000 news per day, databases, de-normalization and bugs in Python language
UPDATE: The response to this post by one of the developers of CouchDB.
RarestNews project is currenlty under re-tooling, for some reasons. First of all there’s a design flaw. The accumulation of 100 000 news articles every day in a single MySQL database is a bad idea. The database started to crawl (i.e. being really slow) on 10th day, and on 20th day it nearly came to stop. Actually the site stopped being responsive at all (even at the start it wasn’t really responsive, but it was all due to normalization, which I was taught is good, but it’s not… in some cases). So the technical description of scalability problems I’m facing are following.
The only thing that wasn’t giving me the problems is the new Amazon EC2 High-CPU instances. That’s a terrific thing. Everything else is crap.
Ok, not everything, but probably being misused like I was doing it - it is.
MySQL problems
So, to be technical here I’ve used MyISAM tables (never really liked InnoDB because of it’s slow writes and at 100k new articles a day with lots of meta-data to write about them, like tags, dates, snippets, word frequencies, etc) - it seemed like a good decision. The bad part was that on write MyISAM locks the whole table. So 50 bots scouring the Web for news writing and locking whole table made site almost unresponsive.
I’m not yet sure how to solve it - with InnoDB, with PostgreSQL or with some kind of new-age databases like CouchDB, StrokeDB, maybe Amazon’s SimpleDB, etc…
CouchDB problems
They seem like a nice idea when you read about them, but… there are flaws.. The main problem with CouchDB for example is it’s complete HDD-dependence. Modern memory is hundreds of times faster than DB, so you’re using only 1% of speed if you use HDD-based database. And the second problem is it’s “Do not overwrite” motto. It doesn’t reuse space no longer needed, so if I write a 100KB article to database (along with some other data and then I rewrite this entry - there’s now 200KB stored on my drive) and each update eats 100KB more.
How to avoid it? Compact the database, so it creates a NEW file with only the latter 100KB. And delete the previous database file. So, even I didn’t change anything - I’ve had to write the same data 3 times (along with all of my database in compaction process). What that means.
1) It’s AT LEAST 3 time slower than your HDD speed if you want to effectively use ALL of your hard drive, so now we have only 0.3% of computer speed (compared to memory usage).
2) You can only use databases of size of HALF of your HDD (but in reality more like 33%) to effectively use CouchDB (remember - compaction process creates NEW file, so it needs at least same amount of space as it uses).
Scalability problems
So, the next thing I’ve stumbled is that data is accumulated from the sources almost infinitely. 100k new articles every day is not a joke. There are a few solutions, like taking it out of database and storing as plain files somewhere like Amazon S3, which is what I’m doing now with the archive of RarestNews.
Another solution would be something like big filesystem (something like MogileFS, but it also has a bottleneck - it has only a single MySQL master.
Hadoop filesystem might help.
And the last problem came just a few days ago. My dear Python (language).
Python 2.5 bug
The dam language is 10 years in development and still has no way to force an variable out of the memory.
I’ve stumbled a memory leak. Which couldn’t be found by any tools present. Heapy shows that it sees constant 15MB used, which memory usage reported by Linux’s ps rises up to available 1GB and then starts to swap. Python’s garbage collector shows nothing. gc.collect() doesn’t help. Deleting instances, variables, etc… doesn’t do anything. So the memory is just lost and never returned, nor seen by anything.
Finally I’ve tracked the problem down to my threading implementation, which is basically a “have a limited number of threads and give them jobs” class. It seems that the objects I pass to it never get freed. And here’s a kicker (in pseudo code).
pool of threads=array
take object
threading.Thread(target=do_something, args=(object,)).start()
put thread in pool, check pool for dead threads, .join() them
It Leaks Memory
pool of threads=array
take object
threading.Thread(target=object.do_something, args=(,)).start()
put thread in pool, check pool for dead threads, .join() them
It Leaks Memory
But…
threading.Thread(target=object.do_something, args=(,)).start()
thread.join()
Doesn’t leak memory. But it’s linear, not parallel, cause I have to wait for thread to end ( .join() ) before starting a new one.
del object inside of method, outside of method, before join, after join and any combinations of those doesn’t do anything - memory still leaks, unless I run a single thread and wait for completion. gc.collect() doesn’t do anything too anywhere with any frequency (from dozen calls per second to few times per minute) on any generation of garbage. It’s been 48 hours coding marathon with now luck at all.
Damnation - 10 years in development and still major bugs in very basic things. Just in case - the object being passed is Django’s ORMapped instance from a list. I’ve tried passing only ID and getting new instance inside of do_something method - doesn’t change anything. There’s no cyclic reference between object - they’re all parent-child related (Queued page IDs → Page → Site).
De-nomalization
Well, the last problem that I’ve stumbled upon is normalization of SQL tables. In a nutshell it’s like this. If you have a table where you have to repeat “test” “test” “test” in a single column, you should move it to another table, giving it ID, like 1=test and repeat 1 in previous table, so that you could change “test” to “best” and it changes everywhere.
The “tags” seemed like a good idea for normalization. I.e. why would I repeat “health” “politics” or “breaking news” millions of times in “news table”, better create a “many-to-many” table and use it. Ugh… That didn’t work well. 100 000 news per day, so 2M news in 20 days, 5-10 tags per each and I have 20M M2M (many-to-many) table which is used in 3-table SQL JOIN. The performance was really crap-tacular!
So I’d better store it as “health,breaking_news,politics” in a column rather than normalize it and the JOIN each time. Guys from big guns, like Flickr comfirm my theory that normalization for BIG Tables is bad, but it’s too late.
Last, but not least.
Database Schema
The reason I’m now using Django ORM as standalone app is that my database grew to dozens of columns in dozen of tables and writing correct joins, updates, etc… for them was becoming exponentially problematic. It’s much easier to write News.objects.select_related(depth=2).get(id=1).source.url than “SELECT source.href FROM queue INNER JOIN news ON queue.news_id=news.id INNER JOIN source ON source.id=news.source_id WHERE news.id=1″ each time.
Also my schema is now in models.py so I don’t have to fire up phpMyAdmin or something like that to find out names of fields. But, the problem is that Python leaks memory with Django ORM and threading. I know that many of you would tell me Python doesn’t leak memory and it’s my fault as a programmer if it does, but believe me - I’ve tried hundreds of ways to avoid it, there’s no way. The only two ways I’ve found to be working is os.fork() and doing stuff in child (which isn’t a great idea as each process is 20MB at least so 50 processes eat all memory instead of 500 threads (10 times more!) and I need memory for MySQL too!) and linear programming (or waiting for thread to join right after starting it - then memory is cleared). And linear programming for networks (where you have to wait sometimes for seconds until web server responds) is just stupidly wasted resources.
Conclusion
Well, pretty much that’s it. Now I’m off to search alternatives to do what I want to do with reasonable resources, without tossing new hard drives to be eaten by . ![]()
|
|