TheRarestWords | RarestNews | Suggestan | TheCraziestIdeas| SemanticKernelBot | Flim.me | My dev.blog | Йои Хаджи


31st Aug 2008

Hallelujah! SANE ORM for threading in Python: Elixir+SQLalchemy - no memleaks!

In one of my posts I wrote about refactoring RarestNews’s bot on Django’s ORM problem and that was the fact that after the project was migrated and tested, I’ve added parallellism to it with my threading wrapper and it started leaking memory at a very fast rate - gigabytes in a matter of minutes.

That led to wild accusations on Reddit about the fact that I can’t program. I still don’t know the culprit in that case - be it Django’s ORM or Python, however that led me to look at alternatives. And there was another disappointment!

I’ve found Elixir - as simple as Django’s ORM, but even SIMPLER! Yet it did memleak again, but I’ve found almost an elegant solution

With Django if I only want to use ORM, not the URL mapper or templater (neither of which I don’t need in a bot), I still have to write a lot of boilerplate code (inclusion path to Django’s settings file, lots of imports from lots of files, etc…), with Elixir, it’s “from elixir import *”. (BTW Elixir is layer on top of SQLAlchemy’s ORM). And the declaration is pretty simple:

from elixir import *

metadata.bind = "mysql://root:@127.0.0.1/rarest"

class Movie(Entity):
    title          = Field(Unicode(30))
    year           = Field(Integer)
    description    = Field(UnicodeText)

movie1=Movie(title=u"Blade Runner", year=1982)
session.commit()  # required, transactions are forced

No love lost here, very similar to Django. But…

This time I was smrrrrter, I did the parallel test before migrating a lot of code and WTF! Memleak. Again. If I add 10K objects to DB - there are 10K more variables (according to len(gc.get_objects()) )…

Ok, now that’s not funny. Does every ORM has threading memleak? Forking is not an option (it doesn’t leak, but 20MB forked processes can’t be compared to a few MB threads, especially if you run 200 of them).

Well, I won’t bore you with heapy and garbage collector witchhunt (for memleaks), the leaking part is sqlalchemy.orm.identity.IdentityManagedState object and there’s no documentation on how to “tiptoe around it” (friendly fun on SQLAlchemy’s source code), the solution is here:

movie1=Movie(title=u"Blade Runner", year=1982)
movie1.save()
movie1.expunge()
session.commit()  # required, transactions are forced

FINALLY! Okay, it’s a bit of more labor - to clean every used object, but IT WORKS (others just don’t).

Just in case you were going to recommend an easier way, I’ve tried those ways and they failed:

clear_all()
sqlalchemy.orm.clear_mappers()
movie.expire()
session.flush()
session.close()
cleanup_entities(entities)
entities.clear()

P.S. There were no memleaks in my threading implementation.

Posted by admin under python | Comments Off

21st Aug 2008

TheCraziestIdeas reborn

TheCraziestIdeas has been expanded. It has now much more potential and an input field :) It’s still a joke project, but with more abilities. The “bikini inspector” query now yields better things, like “blouses exterminator”, “wax adjuster” and “bikini superintendant”.

It can also be used in some serious ways, like when you are searching for a good idea for domain name or company. Let’s say you have a car club, but the carclub.com is obviously taken. Spin it off with exclamation point, like so: “!car club”, and here you go: carcondo.com, carcafe.com, carsmart.com, etc…

For all you domaineers - there’s export to .com domains list, and there’s option to export to AdWords lists and plain text too.

Enjoy all new TheCraziestIdeas.com .

Posted by admin under site | Comments Off

21st Aug 2008

“Rarest Synonyms” or auto-related words magic

I was quiet for some time now. Actually, I was on vacation in a middle of nowhere, without TV, radio or Internet. Let me tell you something - those who tell you “Russia has two problems - idiots and roads” are wrong about the roads. I drove nearly 1000 miles and the worst problem was local road police extortion, not the roads :) Okay, enough chit-chat, let’s get back to business!

Ever thought of words that are close / almost synonymous to “ufos“? :)
Me neither, but still: ufos, alien, aliens, extraterrestrials, sedition, strange, ghosts, foreigners, predator, crop, extra, ufo, robots, colors, strangers, spooks, monsters, animals, weird, dead, …. Ok, that wasn’t really tought.

How about synonymous words for “TechCrunch“? Easy! Techcrunch, mashable, techmeme, slashdot, digg, engadget, correct, cnet, com, readwriteweb, gigaom, perezhilton.

Magic? No way, you can do it too now. New feature added to TheRarestWords - just go to the word page, as in http://therarestwords.com/word/apple and see for yourself.

If you want to know the technical details of how it’s done - take 3-grams (like from the wikipedia n-grams I released) and search for “red and blue“, “techcrunch or gigaom“, etc. See? Easy!

No, I didn’t use wikipedia’s n-grams, I used much broader sample from the pages all around the Internet with phrases detection and some more of smart stuff, but even if you parse the Wikipedia’s n-grams - you’d still get pretty good results (although, no synonyms for “TechCrunch” and other rare words).

Posted by admin under site | Comments Off

01st Aug 2008

All Wikipedia’s n-grams are REALLY belong to us

A few days ago I’ve thought about Google releasing it’s n-grams in the past. Damn, that was the second thing I’ve wanted to get after the TLD Zone Access Program [I did apply to it, but never heard back from them. In our open information age - the most wanted information even though seem open is usually out-of-reach, like in those 2 cases ]

So, I’ve decided to do the next best thing - build a ngram frequency list from Wikipedia. It’s not quite as big as Google’s (with trillion or so n-grams and 5 DVDs), but the licensing terms are much better (”Free” vs Google’s “$180″, “you can use it” vs “you can’t use it” Google’s license).

Read more and download is here

Posted by admin under downloads, word frequency | Comments Off

30th Jul 2008

Global Web Functions marketplace - a possible machine for making millionaires out of programmers

Well, I’ve been playing with my new toy, which might replace “Craziest Ideas” as it has a little more usefulness in it. Well, it’s actually a kind of “Google Sets”, but using slightly different technology (”Google Sets” particularly looks for <ul><li>red<li>white<li>blue</ul> on the Web).

Okay, so I’ve been playing with the “web framework” phrase when suddenly I’ve got a million dollar phrase: “web functions” :) Well, don’t get too excited - this million is up for grabs but it’s not low-hanging fruit.

Please note that this is only an IDEA of CONCEPT, not a description of some real framework/library.

The concept is simple. We have a lot of API’s scattered around the Web in RSS, REST, XML, JSON, Atom, etc… Each of them has it own rules, registrations, signing mechanisms, etc. etc.. etc…. More popular ones get more attention, so the libraries are available in more languages, others are less popular, so you have to roll your own.

Okay, but why don’t someone (Amazon, Google, Yahoo, Facebook, I’m looking in your direction) actually create an open platform for remote calls, so that every API could be called with a simple call in one huge database of APIs. (”open” as in “we welcome all developers and programmers”, cause “open sourcing” here wouldn’t really be too applicable, because of billing involved…) So, if I want Google Images for “mars”, so I go to some site, let’s say globalwebfunctions.com (it’s not an actual site)  and search for Google Images, I find google_images call to be what I need (and also google_images2, google_images_with_descriptions or google_images_by_color - each of those are developed by independent developers, some of which would be doing exactly the same, but maybe for different price) and let’s say I do this in PHP:

$wf=new GlobalWebFunctions('my_login','my_secret_key');
$list_of_images = $wf->call('google_images','q:mars', 'expect:list');

So, now my program connects centralized site, finds out what server is responsible for google_images function, signs my request, deducts let’s say 0.1 cent from my account and returns me google images.

Here’s a kicker. Independant programmers could write those simple reusable functions, like:

submit_to_digg('url:myurl')
digg_get_my_recommendations('login:rarestwords')
define_from_urbandictionary('busted')
weather_in('city:Chicago','when:today', 'in:Celsius')
flickr_creative_commons_image('big sale', 'expect:jpeg')
get_page_obey_robotstxt('url:http://therarestwords.com/','ua:TheRarestParser/0.4b')
geolocate('city:Chicago')
resize_image('jpg:'+$myjpeg, 'w:800', 'h:600')
get_wordfrequency('disobedient')
big_distributed_table_set('n:user_1353_name', 'v:Mr V.')
alexa_grep('q:<li>(.*?)</li>')
convert_xml_to_json('q:<book><title>test</title></book>')

or maybe even:

map_reduce('mapper:global-function:resize','reduce:global-function:group_images_by_color')

Now for the interesting part. Some of those functions could be free, some could be paid (to cover the traffic expenses and machine time), so now anyone would register on that central site and become either developer or prgrammer:

Developer

Develops Global Web Functions - places the code on his own server in his own language of choice, using some kind of Global Web Functions API in his language of choice (Java, C, Perl, PHP, Python, Erlang, you name it…) Earns money for each call or just sets a number of free calls per user per day (per second, etc).

Programmer

Uses those Global Web Functions, pays some parts of cents for the usage :)
The idea is this would solve the learning curve for all those APIs. I’ve never got to the end of most of the APIs. And for most parts the usage patterns are the same. I bet a lot of people use Google Maps only for to display their place on Earth, a lot of people rewrite resize_image function in every possible language and have you ever tried to read Amazon’s APIs, when all you need is s3_put(’bucket’,'file’,'key’,’secret-key’,'text-text-text’) function and similar s3_read ???

Also, other two examples from the smaller world. Some people asked me for API for my TheRarestWords project, particularly to current word frequencies. And if I develop it - it would overload my server without even a cent of profits. I bet a lot of you have a lot of information they could sell or write resize_jpeg function in your language, put a few servers to do it and earn lifetime income :)
There should be a local caching mechanism included into the Global Web Functions API so that get_all_world_color_names() for example could be called just once, not for each furniture store order form load.

More ideas:

add_comment('id:http://rarestblog.com/2008/07/global-web-functions-how-to-make-web-more-interactive', 'comment:This idea really sucks')
get_comments('id:http://rarestblog.com/2008/07/global-web-functions-how-to-make-web-more-interactive', 'expect:html', '<ul><li>[[comment]]</li></ul>')
$instance_id=provision_virtualized_10_percent_part_of_amazon_ec2('duration:10days');
prolong_ec2('instance:'+$instance_id, 'duration: 20days')

and pay 10% price for 7% of resources of minimum machine instead of paying for full (where some enterpreneur buys full machine, divides it to 10 parts and oversells, even earning a profit of 30% for doing nothing)

write_blog_post('topic:World War 3', 'http_post_result_when_ready:url(http://myserver.com/accept)');

as an function-based interface for GetAFreelancer, where someone would manually take care of finding author, making him write and then return article to you
(think Amazon Mechanical Turk)

and even

amazon_mechanical_turk('task:Write an article','http_post:http:.....')

Some other ideas might include: programmers might request particular new function with prepared unit test, you should probably pass “prices:27_jun_2008″ to centralized server so that any call to function that changed it’s price after that period would be blocked. And you have a chance of either agreeing or switching to other similar cheaper function :) Damn, we would have a lot of resize_image_283909230 functions :)
Oh and Global Web Functions API is just a set of protocols that define how to use all those functions from you language, like for python it could be:

wf=GlobalWebFunctions('my_login','my_secret_key')
list_of_images = wf.call('google_images', q='mars', expect='list','price:0.0005')

PHP example is in the beginning (Perl would pretty much be the same). Maybe C:

list_of_images=GlobalWebFunctions.call('my_login', 'my_secret_key', 'google_images', 'q:mars', 'expect:list');

Et cetera…

And it should also define return format. I’d think REST returning JSON would be great idea (sometimes over https) would be great except that it really doesn’t define how to send binary data (like images).

Well :) I have neither idea, nor finances to create this kind of Behemoth :) Also I have some doubts about profitability of this for small startups, but maybe for guys like Google/Amazon it could be a big marketplace to expand their we-rule-all-of-the-world-knowledge efforts :) And yet another way for ultra-profitable Google to disburse cash :)
But, WARNING. If you are going to do this:

  1. The initialization should be as SIMPLE as
    wf=GlobalWebFunctions('my_login','my_secret_key')
  2. The call must be as SIMPLE as:
    list_of_images=GlobalWebFunctions.call('my_login', 'my_secret_key', 'google_images', 'q:mars', 'expect:list');
  3. The call must return NATIVE data structures (arrays, arrays-of-arrays, hashes (if any) or tuples, strings and integers), I don’t any JSON or XML to parse.
  4. The probable return of all functions in languages that doesn’t support Exceptions should be array of ( array of ( status=’ok’, exception ), actual_data), so that I could check ret[0][0]==’OK’ before proceeding, in those who support - well, only the result, but with throwing the Exception where appropriate.
    array( array('OK'), actual_result)
    array( array('ThrottleException','You are requesting too fast'), '' )
    array( array('InputException','Input values are wrong'), '' )

Just in case you want to tell me something - e-mail me rarestwords@mail.ru .

I don’t think Open Source community could raise something like this, but I might be wrong. If you believe in it more than I do - well, let’s try. My mail is rarestwords@mail.ru with ideas or what resources you could provide. But we would need servers, programmers in different language, money to pay for traffic and promotion, without any guarantee it’ll at least pay for itself :)

Posted by admin under ideas | Comments Off

28th Jul 2008

All hail the Cuil, SearchMe, Technorati! New age Internet is ripoff-based and we need to evolve because of this.

Short version: If you are user - hail Cuil ! If you are developer/designer/any kind of creative person - possibly fear Cuil !

As you might already know - there’s a new sheriff in town. Well, not quite the sheriff, but rather the bunch of ex-Google guys (or so they say) that have built a new (not quite new) search engine - Cuil (at the moment of writing - unavailble, guess from the load).

Actually I like this engine. Mostly due to the fact that it matches in traffic numbers today to Google - i.e. the number of people came to TheRarestWords from Google at the moment is EQUAL to Cuil’s people. And if TheRarestWords were making money - today I would have been enjoying double profits :) I guess this is only temporary as today everybody is talking about them, anyway. Tomorrow we’re going to see much less traffic than today from them.

But with this great opportunity - there’s also a big evil in Cuil.

What worries me is the amount of text they show on the search page. It’s becoming much and much more of a nuisance that search companies think it’s okay to massively copy parts of your site and display them. Look at searchme.com, particularly at this page about one of the greatest people in history. Do you even need to visit those pages? No, because you can read it all right in SearchMe. But that means - no more advertising profits for sites they display, lost profits mean web owners would be less encouraged to create more content for the sites, because now they’re creating content for SearchMe (I’m deliberately avoiding linking to that site). And isn’t this site a one big obvoius web-scale copyright infrigement?

Okay, but they seem to have some law in their hands, since nobody sued them yet. And my sites don’t generate any kind of measurable profit, so even if I lose something due to SearchMe - it’s going to be less than a cent per month I guess. But some of you lose profits.

Okay, back to Cuil. The example SearchMe is setting for next-gen search engines is really bad. If every engine would start copying all other sites content…. And Cuil is showing much more of contiguous text from web page, I think in many cases it wouldn’t even be necessary to visit the page to get the info. And that’s a problem.

Some say that Internet advertising comes to an end, because it has artificially inflated prices (due to the fact that Google and others set MINIMUM price for keyword and the fact that you sometimes can buy a word from Google and sell it other serach engine for even bigger price, which means that it’s even more inflated [it was called AdWords Arbitrage, it's not really longer possible due to the fact that Google RAISED their minimum prices for many keywords even more a year or so back]); some say copyright laws are going to change due to the Internet being a very big copy machine and that you can’t really protect copy rights anymore of anything that CAN be copied (webpages being the example), but rather you can only protect scarce (I’m not sure that it’s a right word) things, like reputation, integrity, etc… the point is that last argument is pretty much a “doomsday proclamation” for creators.. but I don’t believe in doomsdays. There were too many flopped ones in past to be afraid of those predictions.

The problem is that last argument is really becoming more and more of a reality. No longer the Torrents are problem for Music Industry or Film Industry. But rather now we’re are witnessing an beginning of an era where more and more of our work is copied everyday. And it seems to be legal (nobody closed SearchMe yet, which does that massively, I can only block their bot, which I do). The problem is that this kind of engines (competing who would show bigger snippet of my text) become reality RIGHT NOW. And it’s kind of Torrents for websites.

See Technorati.com - another example of full post copying. Legal? Take this page for example. It’s nearly a full copy of my post. And Technorati enjoys much bigger PageRank and whateverelserank there is. That means they COULD get my post indexed FASTER than me. Think they have noindex for those pages, so that it’s not massive copyright infrigement, but rather a service for users? Think again! 1 000 000 INDEXED BY GOOGLE infriged blogs in .com domain only. If you blog - you are probably there too. I think WordPress even pings service which tips off technorati that there’s new content on your blog.

Do you think that doesn’t affect you? Think again! Do you know how many searches from Google/Yahoo for your text lands on Technorati instead, because they had your text indexed BEFORE Google indexed post on your site? Does Google really know that you are the originator of this “duplicate content” or possibly they’ll think Technorati is, since text was there first (at least for Google who indexes millions of Technorati pages a day vs. your mom-and-dad blog being indexed once a day or even a week?)?

The problem being is if that’s my new reality - I need to evolve from thinking of my material as copyrighted and that nobody would copy it without at least facing a moral dilemma (I can’t sue US people since I’m in Russia).

So let me be a doomsday crier too. The only way now to evolve is to think of our content as unprotected and somehow use something like Creative Commons model for our good. I.e. assume that your content will be copied and used somewhere. But how could you earn a least something as a reward for all the trouble you went through to create something if we assume it’s going to be copied?

That’s the question each of us need to think through.

Well, some music groups found business models that work. But I doubt any of you are going to buy a $30 copy of my article in a beatiful box if you could read it for free on the Internet :)
The problem is that blogs/sites rarely have real fans who want to support them with money. I’ve read about one experiment in software where a guy tried everything he could to make people “buy him a beer” in exchange for his freeware. He had 50 000 downloads and barely broke $50 mark. 50 000! Dammit that’s a population of a city I live in and all of them paid just 50$!? That’s not the way to go.

We need to think ahead people. We need to think.

Posted by admin under site | Comments Off

26th Jul 2008

Suggestan released

So the project “Suggestan” is released. As usual I have no perfect idea of what it is or the direction it is going. Well, it’s kind of “define a thing” project, where you can find or share the knowledge about the subjects/hobbies/professions/ideas that you know in form of suggestive questions.

Well, go and see for yourself and we’ll see if that’s going somewhere besides Trash Bin :) Go Suggestan!

Posted by admin under site | Comments Off

26th Jul 2008

Another project coming soon from the land of Suggestan

Probably within 24 hrs I would release my 4th hobby project (In case you’ve just turned on your TV - the first three are The Rarest Words, The Rarest News and The Craziest Ideas) - the 4th is called “Suggestan“.

The Webster defines “Suggestan” as “1. geo. A little country where everyone is suggesting something.” Ok, I’m just kidding :) The project is going to be yet another joke project which has some meaning. Expect something kind of like “The Rarest Words” but for the hobbies, proffesions and ideas instead of words. Expect something slightly more complex than 1 textbox under a hobby name. :)
This project is going to once again “tap into crowdsourcing”, and maybe even into “semantics” as it’s going to define some relations between words, hobbies, ideas, places, etc. It’s definitely not something revolutionary, but if you like “The Rarest Words” - you’re going to enjoy Suggestan too.

Posted by admin under ideas | Comments Off

20th Jul 2008

Testing SQL engines/queries with Django (avg.query time)

I love Django for many reasons and here’s one of them. Testing average time for queries I’ve done today to compare engines (mySQL vs postgreSQL) is easily done with django.

Read the rest of this entry »

Posted by admin under python, site | No Comments »

20th Jul 2008

I don’t get it - real web application with PostgreSQL vs mySQL MyISAM vs mySQL InnoDB (with Django’s ORM, 2008)

UPDATE: This has been Reddit. Read the comments. The main thing to understand that those results are for default settings of both databases for my case and my priorities. Yours could (and maybe even should) be different.

Well, this and last year I hear everywhere that PostgreSQL is the way to go and that usage of mySQL in 2008 makes people puke… But without any real arguments (besides “Postgres is the way to go”).Well, I don’t usually buy into fashion-style technologies shopping (it’s when someone can’t prove something’s better that what I use) and this time it wouldn’t be an exception.

Ok, so scouring the Internet I’ve found some comparative tests. Mostly in form of “INSERT 10000 items WITH COMMIT AT THE END”. Okay, how many people actually inserted 10000 items in a real web-application (besides dumping-restoring-moving data)? Some people did, but they were both unavailable for comments :) Just kidding.

Ok, so since I’m with Django - moving to Postgres and testing my application (RarestNews) should be a snap, isn’t it? Just change the database string in settings.py and install PostgreSQL, right? Wrong! :) But there’s a time for everything step-by-step.

Read the rest of this entry »

Posted by admin under site | No Comments »