TheRarestWords | RarestNews | Suggestan | TheCraziestIdeas| SemanticKernelBot | Flim.me | My dev.blog | Йои Хаджи


Archive for the 'python' Category

31st Aug 2008

Hallelujah! SANE ORM for threading in Python: Elixir+SQLalchemy - no memleaks!

In one of my posts I wrote about refactoring RarestNews’s bot on Django’s ORM problem and that was the fact that after the project was migrated and tested, I’ve added parallellism to it with my threading wrapper and it started leaking memory at a very fast rate - gigabytes in a matter of minutes.

That led to wild accusations on Reddit about the fact that I can’t program. I still don’t know the culprit in that case - be it Django’s ORM or Python, however that led me to look at alternatives. And there was another disappointment!

I’ve found Elixir - as simple as Django’s ORM, but even SIMPLER! Yet it did memleak again, but I’ve found almost an elegant solution

With Django if I only want to use ORM, not the URL mapper or templater (neither of which I don’t need in a bot), I still have to write a lot of boilerplate code (inclusion path to Django’s settings file, lots of imports from lots of files, etc…), with Elixir, it’s “from elixir import *”. (BTW Elixir is layer on top of SQLAlchemy’s ORM). And the declaration is pretty simple:

from elixir import *

metadata.bind = "mysql://root:@127.0.0.1/rarest"

class Movie(Entity):
    title          = Field(Unicode(30))
    year           = Field(Integer)
    description    = Field(UnicodeText)

movie1=Movie(title=u"Blade Runner", year=1982)
session.commit()  # required, transactions are forced

No love lost here, very similar to Django. But…

This time I was smrrrrter, I did the parallel test before migrating a lot of code and WTF! Memleak. Again. If I add 10K objects to DB - there are 10K more variables (according to len(gc.get_objects()) )…

Ok, now that’s not funny. Does every ORM has threading memleak? Forking is not an option (it doesn’t leak, but 20MB forked processes can’t be compared to a few MB threads, especially if you run 200 of them).

Well, I won’t bore you with heapy and garbage collector witchhunt (for memleaks), the leaking part is sqlalchemy.orm.identity.IdentityManagedState object and there’s no documentation on how to “tiptoe around it” (friendly fun on SQLAlchemy’s source code), the solution is here:

movie1=Movie(title=u"Blade Runner", year=1982)
movie1.save()
movie1.expunge()
session.commit()  # required, transactions are forced

FINALLY! Okay, it’s a bit of more labor - to clean every used object, but IT WORKS (others just don’t).

Just in case you were going to recommend an easier way, I’ve tried those ways and they failed:

clear_all()
sqlalchemy.orm.clear_mappers()
movie.expire()
session.flush()
session.close()
cleanup_entities(entities)
entities.clear()

P.S. There were no memleaks in my threading implementation.

Posted in python | Comments Off

20th Jul 2008

Testing SQL engines/queries with Django (avg.query time)

I love Django for many reasons and here’s one of them. Testing average time for queries I’ve done today to compare engines (mySQL vs postgreSQL) is easily done with django.

(more…)

Posted in python, site | No Comments »

22nd May 2008

How to split glued words in domains into parts in PHP/Python

Well, I’ve finally got the idea of algorithm how to split things like belfastjobs.com into Belfast Jobs.com - i.e. detect words glued together. So, as soon as algo finishes crunching 70 million domains (which as you can see is goint to take more than 24hrs) - then you’re going to have a title on your site and all the links to related sites would finally become readable.

If you’re going to do that yourself - here’s how I did it:

(more…)

Posted in php, python | No Comments »