TheRarestWords | RarestNews | Suggestan | TheCraziestIdeas| SemanticKernelBot | Flim.me | My dev.blog | Йои Хаджи


Archive for August, 2008

31st Aug 2008

Hallelujah! SANE ORM for threading in Python: Elixir+SQLalchemy - no memleaks!

In one of my posts I wrote about refactoring RarestNews’s bot on Django’s ORM problem and that was the fact that after the project was migrated and tested, I’ve added parallellism to it with my threading wrapper and it started leaking memory at a very fast rate - gigabytes in a matter of minutes.

That led to wild accusations on Reddit about the fact that I can’t program. I still don’t know the culprit in that case - be it Django’s ORM or Python, however that led me to look at alternatives. And there was another disappointment!

I’ve found Elixir - as simple as Django’s ORM, but even SIMPLER! Yet it did memleak again, but I’ve found almost an elegant solution

With Django if I only want to use ORM, not the URL mapper or templater (neither of which I don’t need in a bot), I still have to write a lot of boilerplate code (inclusion path to Django’s settings file, lots of imports from lots of files, etc…), with Elixir, it’s “from elixir import *”. (BTW Elixir is layer on top of SQLAlchemy’s ORM). And the declaration is pretty simple:

from elixir import *

metadata.bind = "mysql://root:@127.0.0.1/rarest"

class Movie(Entity):
    title          = Field(Unicode(30))
    year           = Field(Integer)
    description    = Field(UnicodeText)

movie1=Movie(title=u"Blade Runner", year=1982)
session.commit()  # required, transactions are forced

No love lost here, very similar to Django. But…

This time I was smrrrrter, I did the parallel test before migrating a lot of code and WTF! Memleak. Again. If I add 10K objects to DB - there are 10K more variables (according to len(gc.get_objects()) )…

Ok, now that’s not funny. Does every ORM has threading memleak? Forking is not an option (it doesn’t leak, but 20MB forked processes can’t be compared to a few MB threads, especially if you run 200 of them).

Well, I won’t bore you with heapy and garbage collector witchhunt (for memleaks), the leaking part is sqlalchemy.orm.identity.IdentityManagedState object and there’s no documentation on how to “tiptoe around it” (friendly fun on SQLAlchemy’s source code), the solution is here:

movie1=Movie(title=u"Blade Runner", year=1982)
movie1.save()
movie1.expunge()
session.commit()  # required, transactions are forced

FINALLY! Okay, it’s a bit of more labor - to clean every used object, but IT WORKS (others just don’t).

Just in case you were going to recommend an easier way, I’ve tried those ways and they failed:

clear_all()
sqlalchemy.orm.clear_mappers()
movie.expire()
session.flush()
session.close()
cleanup_entities(entities)
entities.clear()

P.S. There were no memleaks in my threading implementation.

Posted in python | Comments Off

21st Aug 2008

TheCraziestIdeas reborn

TheCraziestIdeas has been expanded. It has now much more potential and an input field :) It’s still a joke project, but with more abilities. The “bikini inspector” query now yields better things, like “blouses exterminator”, “wax adjuster” and “bikini superintendant”.

It can also be used in some serious ways, like when you are searching for a good idea for domain name or company. Let’s say you have a car club, but the carclub.com is obviously taken. Spin it off with exclamation point, like so: “!car club”, and here you go: carcondo.com, carcafe.com, carsmart.com, etc…

For all you domaineers - there’s export to .com domains list, and there’s option to export to AdWords lists and plain text too.

Enjoy all new TheCraziestIdeas.com .

Posted in site | Comments Off

21st Aug 2008

“Rarest Synonyms” or auto-related words magic

I was quiet for some time now. Actually, I was on vacation in a middle of nowhere, without TV, radio or Internet. Let me tell you something - those who tell you “Russia has two problems - idiots and roads” are wrong about the roads. I drove nearly 1000 miles and the worst problem was local road police extortion, not the roads :) Okay, enough chit-chat, let’s get back to business!

Ever thought of words that are close / almost synonymous to “ufos“? :)
Me neither, but still: ufos, alien, aliens, extraterrestrials, sedition, strange, ghosts, foreigners, predator, crop, extra, ufo, robots, colors, strangers, spooks, monsters, animals, weird, dead, …. Ok, that wasn’t really tought.

How about synonymous words for “TechCrunch“? Easy! Techcrunch, mashable, techmeme, slashdot, digg, engadget, correct, cnet, com, readwriteweb, gigaom, perezhilton.

Magic? No way, you can do it too now. New feature added to TheRarestWords - just go to the word page, as in http://therarestwords.com/word/apple and see for yourself.

If you want to know the technical details of how it’s done - take 3-grams (like from the wikipedia n-grams I released) and search for “red and blue“, “techcrunch or gigaom“, etc. See? Easy!

No, I didn’t use wikipedia’s n-grams, I used much broader sample from the pages all around the Internet with phrases detection and some more of smart stuff, but even if you parse the Wikipedia’s n-grams - you’d still get pretty good results (although, no synonyms for “TechCrunch” and other rare words).

Posted in site | Comments Off

01st Aug 2008

All Wikipedia’s n-grams are REALLY belong to us

A few days ago I’ve thought about Google releasing it’s n-grams in the past. Damn, that was the second thing I’ve wanted to get after the TLD Zone Access Program [I did apply to it, but never heard back from them. In our open information age - the most wanted information even though seem open is usually out-of-reach, like in those 2 cases ]

So, I’ve decided to do the next best thing - build a ngram frequency list from Wikipedia. It’s not quite as big as Google’s (with trillion or so n-grams and 5 DVDs), but the licensing terms are much better (”Free” vs Google’s “$180″, “you can use it” vs “you can’t use it” Google’s license).

Read more and download is here

Posted in downloads, word frequency | Comments Off