TheRarestWords | RarestNews | Suggestan | TheCraziestIdeas| SemanticKernelBot | Flim.me | My dev.blog | Йои Хаджи


Archive for July, 2008

30th Jul 2008

Global Web Functions marketplace - a possible machine for making millionaires out of programmers

Well, I’ve been playing with my new toy, which might replace “Craziest Ideas” as it has a little more usefulness in it. Well, it’s actually a kind of “Google Sets”, but using slightly different technology (”Google Sets” particularly looks for <ul><li>red<li>white<li>blue</ul> on the Web).

Okay, so I’ve been playing with the “web framework” phrase when suddenly I’ve got a million dollar phrase: “web functions” :) Well, don’t get too excited - this million is up for grabs but it’s not low-hanging fruit.

Please note that this is only an IDEA of CONCEPT, not a description of some real framework/library.

The concept is simple. We have a lot of API’s scattered around the Web in RSS, REST, XML, JSON, Atom, etc… Each of them has it own rules, registrations, signing mechanisms, etc. etc.. etc…. More popular ones get more attention, so the libraries are available in more languages, others are less popular, so you have to roll your own.

Okay, but why don’t someone (Amazon, Google, Yahoo, Facebook, I’m looking in your direction) actually create an open platform for remote calls, so that every API could be called with a simple call in one huge database of APIs. (”open” as in “we welcome all developers and programmers”, cause “open sourcing” here wouldn’t really be too applicable, because of billing involved…) So, if I want Google Images for “mars”, so I go to some site, let’s say globalwebfunctions.com (it’s not an actual site)  and search for Google Images, I find google_images call to be what I need (and also google_images2, google_images_with_descriptions or google_images_by_color - each of those are developed by independent developers, some of which would be doing exactly the same, but maybe for different price) and let’s say I do this in PHP:

$wf=new GlobalWebFunctions('my_login','my_secret_key');
$list_of_images = $wf->call('google_images','q:mars', 'expect:list');

So, now my program connects centralized site, finds out what server is responsible for google_images function, signs my request, deducts let’s say 0.1 cent from my account and returns me google images.

Here’s a kicker. Independant programmers could write those simple reusable functions, like:

submit_to_digg('url:myurl')
digg_get_my_recommendations('login:rarestwords')
define_from_urbandictionary('busted')
weather_in('city:Chicago','when:today', 'in:Celsius')
flickr_creative_commons_image('big sale', 'expect:jpeg')
get_page_obey_robotstxt('url:http://therarestwords.com/','ua:TheRarestParser/0.4b')
geolocate('city:Chicago')
resize_image('jpg:'+$myjpeg, 'w:800', 'h:600')
get_wordfrequency('disobedient')
big_distributed_table_set('n:user_1353_name', 'v:Mr V.')
alexa_grep('q:<li>(.*?)</li>')
convert_xml_to_json('q:<book><title>test</title></book>')

or maybe even:

map_reduce('mapper:global-function:resize','reduce:global-function:group_images_by_color')

Now for the interesting part. Some of those functions could be free, some could be paid (to cover the traffic expenses and machine time), so now anyone would register on that central site and become either developer or prgrammer:

Developer

Develops Global Web Functions - places the code on his own server in his own language of choice, using some kind of Global Web Functions API in his language of choice (Java, C, Perl, PHP, Python, Erlang, you name it…) Earns money for each call or just sets a number of free calls per user per day (per second, etc).

Programmer

Uses those Global Web Functions, pays some parts of cents for the usage :)
The idea is this would solve the learning curve for all those APIs. I’ve never got to the end of most of the APIs. And for most parts the usage patterns are the same. I bet a lot of people use Google Maps only for to display their place on Earth, a lot of people rewrite resize_image function in every possible language and have you ever tried to read Amazon’s APIs, when all you need is s3_put(’bucket’,'file’,'key’,’secret-key’,'text-text-text’) function and similar s3_read ???

Also, other two examples from the smaller world. Some people asked me for API for my TheRarestWords project, particularly to current word frequencies. And if I develop it - it would overload my server without even a cent of profits. I bet a lot of you have a lot of information they could sell or write resize_jpeg function in your language, put a few servers to do it and earn lifetime income :)
There should be a local caching mechanism included into the Global Web Functions API so that get_all_world_color_names() for example could be called just once, not for each furniture store order form load.

More ideas:

add_comment('id:http://rarestblog.com/2008/07/global-web-functions-how-to-make-web-more-interactive', 'comment:This idea really sucks')
get_comments('id:http://rarestblog.com/2008/07/global-web-functions-how-to-make-web-more-interactive', 'expect:html', '<ul><li>[[comment]]</li></ul>')
$instance_id=provision_virtualized_10_percent_part_of_amazon_ec2('duration:10days');
prolong_ec2('instance:'+$instance_id, 'duration: 20days')

and pay 10% price for 7% of resources of minimum machine instead of paying for full (where some enterpreneur buys full machine, divides it to 10 parts and oversells, even earning a profit of 30% for doing nothing)

write_blog_post('topic:World War 3', 'http_post_result_when_ready:url(http://myserver.com/accept)');

as an function-based interface for GetAFreelancer, where someone would manually take care of finding author, making him write and then return article to you
(think Amazon Mechanical Turk)

and even

amazon_mechanical_turk('task:Write an article','http_post:http:.....')

Some other ideas might include: programmers might request particular new function with prepared unit test, you should probably pass “prices:27_jun_2008″ to centralized server so that any call to function that changed it’s price after that period would be blocked. And you have a chance of either agreeing or switching to other similar cheaper function :) Damn, we would have a lot of resize_image_283909230 functions :)
Oh and Global Web Functions API is just a set of protocols that define how to use all those functions from you language, like for python it could be:

wf=GlobalWebFunctions('my_login','my_secret_key')
list_of_images = wf.call('google_images', q='mars', expect='list','price:0.0005')

PHP example is in the beginning (Perl would pretty much be the same). Maybe C:

list_of_images=GlobalWebFunctions.call('my_login', 'my_secret_key', 'google_images', 'q:mars', 'expect:list');

Et cetera…

And it should also define return format. I’d think REST returning JSON would be great idea (sometimes over https) would be great except that it really doesn’t define how to send binary data (like images).

Well :) I have neither idea, nor finances to create this kind of Behemoth :) Also I have some doubts about profitability of this for small startups, but maybe for guys like Google/Amazon it could be a big marketplace to expand their we-rule-all-of-the-world-knowledge efforts :) And yet another way for ultra-profitable Google to disburse cash :)
But, WARNING. If you are going to do this:

  1. The initialization should be as SIMPLE as
    wf=GlobalWebFunctions('my_login','my_secret_key')
  2. The call must be as SIMPLE as:
    list_of_images=GlobalWebFunctions.call('my_login', 'my_secret_key', 'google_images', 'q:mars', 'expect:list');
  3. The call must return NATIVE data structures (arrays, arrays-of-arrays, hashes (if any) or tuples, strings and integers), I don’t any JSON or XML to parse.
  4. The probable return of all functions in languages that doesn’t support Exceptions should be array of ( array of ( status=’ok’, exception ), actual_data), so that I could check ret[0][0]==’OK’ before proceeding, in those who support - well, only the result, but with throwing the Exception where appropriate.
    array( array('OK'), actual_result)
    array( array('ThrottleException','You are requesting too fast'), '' )
    array( array('InputException','Input values are wrong'), '' )

Just in case you want to tell me something - e-mail me rarestwords@mail.ru .

I don’t think Open Source community could raise something like this, but I might be wrong. If you believe in it more than I do - well, let’s try. My mail is rarestwords@mail.ru with ideas or what resources you could provide. But we would need servers, programmers in different language, money to pay for traffic and promotion, without any guarantee it’ll at least pay for itself :)

Posted in ideas | Comments Off

28th Jul 2008

All hail the Cuil, SearchMe, Technorati! New age Internet is ripoff-based and we need to evolve because of this.

Short version: If you are user - hail Cuil ! If you are developer/designer/any kind of creative person - possibly fear Cuil !

As you might already know - there’s a new sheriff in town. Well, not quite the sheriff, but rather the bunch of ex-Google guys (or so they say) that have built a new (not quite new) search engine - Cuil (at the moment of writing - unavailble, guess from the load).

Actually I like this engine. Mostly due to the fact that it matches in traffic numbers today to Google - i.e. the number of people came to TheRarestWords from Google at the moment is EQUAL to Cuil’s people. And if TheRarestWords were making money - today I would have been enjoying double profits :) I guess this is only temporary as today everybody is talking about them, anyway. Tomorrow we’re going to see much less traffic than today from them.

But with this great opportunity - there’s also a big evil in Cuil.

What worries me is the amount of text they show on the search page. It’s becoming much and much more of a nuisance that search companies think it’s okay to massively copy parts of your site and display them. Look at searchme.com, particularly at this page about one of the greatest people in history. Do you even need to visit those pages? No, because you can read it all right in SearchMe. But that means - no more advertising profits for sites they display, lost profits mean web owners would be less encouraged to create more content for the sites, because now they’re creating content for SearchMe (I’m deliberately avoiding linking to that site). And isn’t this site a one big obvoius web-scale copyright infrigement?

Okay, but they seem to have some law in their hands, since nobody sued them yet. And my sites don’t generate any kind of measurable profit, so even if I lose something due to SearchMe - it’s going to be less than a cent per month I guess. But some of you lose profits.

Okay, back to Cuil. The example SearchMe is setting for next-gen search engines is really bad. If every engine would start copying all other sites content…. And Cuil is showing much more of contiguous text from web page, I think in many cases it wouldn’t even be necessary to visit the page to get the info. And that’s a problem.

Some say that Internet advertising comes to an end, because it has artificially inflated prices (due to the fact that Google and others set MINIMUM price for keyword and the fact that you sometimes can buy a word from Google and sell it other serach engine for even bigger price, which means that it’s even more inflated [it was called AdWords Arbitrage, it's not really longer possible due to the fact that Google RAISED their minimum prices for many keywords even more a year or so back]); some say copyright laws are going to change due to the Internet being a very big copy machine and that you can’t really protect copy rights anymore of anything that CAN be copied (webpages being the example), but rather you can only protect scarce (I’m not sure that it’s a right word) things, like reputation, integrity, etc… the point is that last argument is pretty much a “doomsday proclamation” for creators.. but I don’t believe in doomsdays. There were too many flopped ones in past to be afraid of those predictions.

The problem is that last argument is really becoming more and more of a reality. No longer the Torrents are problem for Music Industry or Film Industry. But rather now we’re are witnessing an beginning of an era where more and more of our work is copied everyday. And it seems to be legal (nobody closed SearchMe yet, which does that massively, I can only block their bot, which I do). The problem is that this kind of engines (competing who would show bigger snippet of my text) become reality RIGHT NOW. And it’s kind of Torrents for websites.

See Technorati.com - another example of full post copying. Legal? Take this page for example. It’s nearly a full copy of my post. And Technorati enjoys much bigger PageRank and whateverelserank there is. That means they COULD get my post indexed FASTER than me. Think they have noindex for those pages, so that it’s not massive copyright infrigement, but rather a service for users? Think again! 1 000 000 INDEXED BY GOOGLE infriged blogs in .com domain only. If you blog - you are probably there too. I think WordPress even pings service which tips off technorati that there’s new content on your blog.

Do you think that doesn’t affect you? Think again! Do you know how many searches from Google/Yahoo for your text lands on Technorati instead, because they had your text indexed BEFORE Google indexed post on your site? Does Google really know that you are the originator of this “duplicate content” or possibly they’ll think Technorati is, since text was there first (at least for Google who indexes millions of Technorati pages a day vs. your mom-and-dad blog being indexed once a day or even a week?)?

The problem being is if that’s my new reality - I need to evolve from thinking of my material as copyrighted and that nobody would copy it without at least facing a moral dilemma (I can’t sue US people since I’m in Russia).

So let me be a doomsday crier too. The only way now to evolve is to think of our content as unprotected and somehow use something like Creative Commons model for our good. I.e. assume that your content will be copied and used somewhere. But how could you earn a least something as a reward for all the trouble you went through to create something if we assume it’s going to be copied?

That’s the question each of us need to think through.

Well, some music groups found business models that work. But I doubt any of you are going to buy a $30 copy of my article in a beatiful box if you could read it for free on the Internet :)
The problem is that blogs/sites rarely have real fans who want to support them with money. I’ve read about one experiment in software where a guy tried everything he could to make people “buy him a beer” in exchange for his freeware. He had 50 000 downloads and barely broke $50 mark. 50 000! Dammit that’s a population of a city I live in and all of them paid just 50$!? That’s not the way to go.

We need to think ahead people. We need to think.

Posted in site | Comments Off

26th Jul 2008

Suggestan released

So the project “Suggestan” is released. As usual I have no perfect idea of what it is or the direction it is going. Well, it’s kind of “define a thing” project, where you can find or share the knowledge about the subjects/hobbies/professions/ideas that you know in form of suggestive questions.

Well, go and see for yourself and we’ll see if that’s going somewhere besides Trash Bin :) Go Suggestan!

Posted in site | Comments Off

26th Jul 2008

Another project coming soon from the land of Suggestan

Probably within 24 hrs I would release my 4th hobby project (In case you’ve just turned on your TV - the first three are The Rarest Words, The Rarest News and The Craziest Ideas) - the 4th is called “Suggestan“.

The Webster defines “Suggestan” as “1. geo. A little country where everyone is suggesting something.” Ok, I’m just kidding :) The project is going to be yet another joke project which has some meaning. Expect something kind of like “The Rarest Words” but for the hobbies, proffesions and ideas instead of words. Expect something slightly more complex than 1 textbox under a hobby name. :)
This project is going to once again “tap into crowdsourcing”, and maybe even into “semantics” as it’s going to define some relations between words, hobbies, ideas, places, etc. It’s definitely not something revolutionary, but if you like “The Rarest Words” - you’re going to enjoy Suggestan too.

Posted in ideas | Comments Off

20th Jul 2008

Testing SQL engines/queries with Django (avg.query time)

I love Django for many reasons and here’s one of them. Testing average time for queries I’ve done today to compare engines (mySQL vs postgreSQL) is easily done with django.

(more…)

Posted in python, site | No Comments »

20th Jul 2008

I don’t get it - real web application with PostgreSQL vs mySQL MyISAM vs mySQL InnoDB (with Django’s ORM, 2008)

UPDATE: This has been Reddit. Read the comments. The main thing to understand that those results are for default settings of both databases for my case and my priorities. Yours could (and maybe even should) be different.

Well, this and last year I hear everywhere that PostgreSQL is the way to go and that usage of mySQL in 2008 makes people puke… But without any real arguments (besides “Postgres is the way to go”).Well, I don’t usually buy into fashion-style technologies shopping (it’s when someone can’t prove something’s better that what I use) and this time it wouldn’t be an exception.

Ok, so scouring the Internet I’ve found some comparative tests. Mostly in form of “INSERT 10000 items WITH COMMIT AT THE END”. Okay, how many people actually inserted 10000 items in a real web-application (besides dumping-restoring-moving data)? Some people did, but they were both unavailable for comments :) Just kidding.

Ok, so since I’m with Django - moving to Postgres and testing my application (RarestNews) should be a snap, isn’t it? Just change the database string in settings.py and install PostgreSQL, right? Wrong! :) But there’s a time for everything step-by-step.

(more…)

Posted in site | No Comments »

20th Jul 2008

Django ORM + threading = memleak (workaround)

Well, after trying hundreds of ways to make Python’s carbage collector work with Django’s ORM and threading (see here - scroll to “Python 2.5 bug”) and sing many tools (heapy, valgrind) to try to find the leaks (all the tools show 15-30MB used, no leaks, but in reality program uses all available memory and starts to swap within minutes) I’ve to conclude that there doesn’t seem to be a workaround.

I’ve tried:

  1. passing only integers instead of Django objects;
  2. adding +” to strings to make copies of strings, not references, (copy.copy and copy.deepcopy too);
  3. creating threads inside of a threads, hoping that would lose references somehow;
  4. del Object; del everything;
  5. weakrefs;
  6. moving all Django code into a function and only passing integer to it;
  7. disabling Django and only leaving lxml (parsing library) in thread, and vice versa - still leaks;
  8. something else too, but can’t remember all.

The worst part is that nothing detects where those GIGABYTES are going, every tool I used shows 15-30MB memusage.

The workaround I’ve settled for - running separate child processes and connecting to parent “queuer” process via xmlrpc. Takes a lot of memory (each process is 17MB vs some KB for thread) and my guess is that xml isn’t the most effective, but at least no memleaks even if child is running infinite loop.

If you have other ideas to try - let me know.

Posted in site | No Comments »

19th Jul 2008

RarestNews, scalability, 100000 news per day, databases, de-normalization and bugs in Python language

UPDATE: The response to this post by one of the developers of CouchDB.

RarestNews project is currenlty under re-tooling, for some reasons. First of all there’s a design flaw. The accumulation of 100 000 news articles every day in a single MySQL database is a bad idea. The database started to crawl (i.e. being really slow) on 10th day, and on 20th day it nearly came to stop. Actually the site stopped being responsive at all (even at the start it wasn’t really responsive, but it was all due to normalization, which I was taught is good, but it’s not… in some cases). So the technical description of scalability problems I’m facing are following.

The only thing that wasn’t giving me the problems is the new Amazon EC2 High-CPU instances. That’s a terrific thing. Everything else is crap. :) Ok, not everything, but probably being misused like I was doing it - it is.

MySQL problems

So, to be technical here I’ve used MyISAM tables (never really liked InnoDB because of it’s slow writes and at 100k new articles a day with lots of meta-data to write about them, like tags, dates, snippets, word frequencies, etc) - it seemed like a good decision. The bad part was that on write MyISAM locks the whole table. So 50 bots scouring the Web for news writing and locking whole table made site almost unresponsive.

I’m not yet sure how to solve it - with InnoDB, with PostgreSQL or with some kind of new-age databases like CouchDB, StrokeDB, maybe Amazon’s SimpleDB, etc…

(more…)

Posted in site | No Comments »