TheRarestWords | RarestNews | Suggestan | TheCraziestIdeas| SemanticKernelBot | Flim.me | My dev.blog | Йои Хаджи


Archive for the 'site' Category

27th Oct 2008

Sematic Kernels generator

I’m launching yet another mini-project that actually is a part of RarestNews project. This project can generate the semantic kernels for any query. The semantic kernel is words that are closely related to query, like “balls” to “baseball” query; or “wordpress” to “blogging” query.

The project is yet in a pre-alpha stage (it works, but requires some automation), so the kernels are processed with much delay cause they require running a EC2 farm just to gather those and it’s just ineffective to run it for each query.

The kernels are showing only 40 first entries for each query, that should be pretty much enough for any non-commercial use I can think of; full kernels will probably be available for a small fee (haven’t decided yet).

So, welcome to “Semantic Kernel Bot“.

Posted in site | Comments Off

21st Aug 2008

TheCraziestIdeas reborn

TheCraziestIdeas has been expanded. It has now much more potential and an input field :) It’s still a joke project, but with more abilities. The “bikini inspector” query now yields better things, like “blouses exterminator”, “wax adjuster” and “bikini superintendant”.

It can also be used in some serious ways, like when you are searching for a good idea for domain name or company. Let’s say you have a car club, but the carclub.com is obviously taken. Spin it off with exclamation point, like so: “!car club”, and here you go: carcondo.com, carcafe.com, carsmart.com, etc…

For all you domaineers - there’s export to .com domains list, and there’s option to export to AdWords lists and plain text too.

Enjoy all new TheCraziestIdeas.com .

Posted in site | Comments Off

21st Aug 2008

“Rarest Synonyms” or auto-related words magic

I was quiet for some time now. Actually, I was on vacation in a middle of nowhere, without TV, radio or Internet. Let me tell you something - those who tell you “Russia has two problems - idiots and roads” are wrong about the roads. I drove nearly 1000 miles and the worst problem was local road police extortion, not the roads :) Okay, enough chit-chat, let’s get back to business!

Ever thought of words that are close / almost synonymous to “ufos“? :)
Me neither, but still: ufos, alien, aliens, extraterrestrials, sedition, strange, ghosts, foreigners, predator, crop, extra, ufo, robots, colors, strangers, spooks, monsters, animals, weird, dead, …. Ok, that wasn’t really tought.

How about synonymous words for “TechCrunch“? Easy! Techcrunch, mashable, techmeme, slashdot, digg, engadget, correct, cnet, com, readwriteweb, gigaom, perezhilton.

Magic? No way, you can do it too now. New feature added to TheRarestWords - just go to the word page, as in http://therarestwords.com/word/apple and see for yourself.

If you want to know the technical details of how it’s done - take 3-grams (like from the wikipedia n-grams I released) and search for “red and blue“, “techcrunch or gigaom“, etc. See? Easy!

No, I didn’t use wikipedia’s n-grams, I used much broader sample from the pages all around the Internet with phrases detection and some more of smart stuff, but even if you parse the Wikipedia’s n-grams - you’d still get pretty good results (although, no synonyms for “TechCrunch” and other rare words).

Posted in site | Comments Off

28th Jul 2008

All hail the Cuil, SearchMe, Technorati! New age Internet is ripoff-based and we need to evolve because of this.

Short version: If you are user - hail Cuil ! If you are developer/designer/any kind of creative person - possibly fear Cuil !

As you might already know - there’s a new sheriff in town. Well, not quite the sheriff, but rather the bunch of ex-Google guys (or so they say) that have built a new (not quite new) search engine - Cuil (at the moment of writing - unavailble, guess from the load).

Actually I like this engine. Mostly due to the fact that it matches in traffic numbers today to Google - i.e. the number of people came to TheRarestWords from Google at the moment is EQUAL to Cuil’s people. And if TheRarestWords were making money - today I would have been enjoying double profits :) I guess this is only temporary as today everybody is talking about them, anyway. Tomorrow we’re going to see much less traffic than today from them.

But with this great opportunity - there’s also a big evil in Cuil.

What worries me is the amount of text they show on the search page. It’s becoming much and much more of a nuisance that search companies think it’s okay to massively copy parts of your site and display them. Look at searchme.com, particularly at this page about one of the greatest people in history. Do you even need to visit those pages? No, because you can read it all right in SearchMe. But that means - no more advertising profits for sites they display, lost profits mean web owners would be less encouraged to create more content for the sites, because now they’re creating content for SearchMe (I’m deliberately avoiding linking to that site). And isn’t this site a one big obvoius web-scale copyright infrigement?

Okay, but they seem to have some law in their hands, since nobody sued them yet. And my sites don’t generate any kind of measurable profit, so even if I lose something due to SearchMe - it’s going to be less than a cent per month I guess. But some of you lose profits.

Okay, back to Cuil. The example SearchMe is setting for next-gen search engines is really bad. If every engine would start copying all other sites content…. And Cuil is showing much more of contiguous text from web page, I think in many cases it wouldn’t even be necessary to visit the page to get the info. And that’s a problem.

Some say that Internet advertising comes to an end, because it has artificially inflated prices (due to the fact that Google and others set MINIMUM price for keyword and the fact that you sometimes can buy a word from Google and sell it other serach engine for even bigger price, which means that it’s even more inflated [it was called AdWords Arbitrage, it's not really longer possible due to the fact that Google RAISED their minimum prices for many keywords even more a year or so back]); some say copyright laws are going to change due to the Internet being a very big copy machine and that you can’t really protect copy rights anymore of anything that CAN be copied (webpages being the example), but rather you can only protect scarce (I’m not sure that it’s a right word) things, like reputation, integrity, etc… the point is that last argument is pretty much a “doomsday proclamation” for creators.. but I don’t believe in doomsdays. There were too many flopped ones in past to be afraid of those predictions.

The problem is that last argument is really becoming more and more of a reality. No longer the Torrents are problem for Music Industry or Film Industry. But rather now we’re are witnessing an beginning of an era where more and more of our work is copied everyday. And it seems to be legal (nobody closed SearchMe yet, which does that massively, I can only block their bot, which I do). The problem is that this kind of engines (competing who would show bigger snippet of my text) become reality RIGHT NOW. And it’s kind of Torrents for websites.

See Technorati.com - another example of full post copying. Legal? Take this page for example. It’s nearly a full copy of my post. And Technorati enjoys much bigger PageRank and whateverelserank there is. That means they COULD get my post indexed FASTER than me. Think they have noindex for those pages, so that it’s not massive copyright infrigement, but rather a service for users? Think again! 1 000 000 INDEXED BY GOOGLE infriged blogs in .com domain only. If you blog - you are probably there too. I think WordPress even pings service which tips off technorati that there’s new content on your blog.

Do you think that doesn’t affect you? Think again! Do you know how many searches from Google/Yahoo for your text lands on Technorati instead, because they had your text indexed BEFORE Google indexed post on your site? Does Google really know that you are the originator of this “duplicate content” or possibly they’ll think Technorati is, since text was there first (at least for Google who indexes millions of Technorati pages a day vs. your mom-and-dad blog being indexed once a day or even a week?)?

The problem being is if that’s my new reality - I need to evolve from thinking of my material as copyrighted and that nobody would copy it without at least facing a moral dilemma (I can’t sue US people since I’m in Russia).

So let me be a doomsday crier too. The only way now to evolve is to think of our content as unprotected and somehow use something like Creative Commons model for our good. I.e. assume that your content will be copied and used somewhere. But how could you earn a least something as a reward for all the trouble you went through to create something if we assume it’s going to be copied?

That’s the question each of us need to think through.

Well, some music groups found business models that work. But I doubt any of you are going to buy a $30 copy of my article in a beatiful box if you could read it for free on the Internet :)
The problem is that blogs/sites rarely have real fans who want to support them with money. I’ve read about one experiment in software where a guy tried everything he could to make people “buy him a beer” in exchange for his freeware. He had 50 000 downloads and barely broke $50 mark. 50 000! Dammit that’s a population of a city I live in and all of them paid just 50$!? That’s not the way to go.

We need to think ahead people. We need to think.

Posted in site | Comments Off

26th Jul 2008

Suggestan released

So the project “Suggestan” is released. As usual I have no perfect idea of what it is or the direction it is going. Well, it’s kind of “define a thing” project, where you can find or share the knowledge about the subjects/hobbies/professions/ideas that you know in form of suggestive questions.

Well, go and see for yourself and we’ll see if that’s going somewhere besides Trash Bin :) Go Suggestan!

Posted in site | Comments Off

20th Jul 2008

Testing SQL engines/queries with Django (avg.query time)

I love Django for many reasons and here’s one of them. Testing average time for queries I’ve done today to compare engines (mySQL vs postgreSQL) is easily done with django.

(more…)

Posted in python, site | No Comments »

20th Jul 2008

I don’t get it - real web application with PostgreSQL vs mySQL MyISAM vs mySQL InnoDB (with Django’s ORM, 2008)

UPDATE: This has been Reddit. Read the comments. The main thing to understand that those results are for default settings of both databases for my case and my priorities. Yours could (and maybe even should) be different.

Well, this and last year I hear everywhere that PostgreSQL is the way to go and that usage of mySQL in 2008 makes people puke… But without any real arguments (besides “Postgres is the way to go”).Well, I don’t usually buy into fashion-style technologies shopping (it’s when someone can’t prove something’s better that what I use) and this time it wouldn’t be an exception.

Ok, so scouring the Internet I’ve found some comparative tests. Mostly in form of “INSERT 10000 items WITH COMMIT AT THE END”. Okay, how many people actually inserted 10000 items in a real web-application (besides dumping-restoring-moving data)? Some people did, but they were both unavailable for comments :) Just kidding.

Ok, so since I’m with Django - moving to Postgres and testing my application (RarestNews) should be a snap, isn’t it? Just change the database string in settings.py and install PostgreSQL, right? Wrong! :) But there’s a time for everything step-by-step.

(more…)

Posted in site | No Comments »

20th Jul 2008

Django ORM + threading = memleak (workaround)

Well, after trying hundreds of ways to make Python’s carbage collector work with Django’s ORM and threading (see here - scroll to “Python 2.5 bug”) and sing many tools (heapy, valgrind) to try to find the leaks (all the tools show 15-30MB used, no leaks, but in reality program uses all available memory and starts to swap within minutes) I’ve to conclude that there doesn’t seem to be a workaround.

I’ve tried:

  1. passing only integers instead of Django objects;
  2. adding +” to strings to make copies of strings, not references, (copy.copy and copy.deepcopy too);
  3. creating threads inside of a threads, hoping that would lose references somehow;
  4. del Object; del everything;
  5. weakrefs;
  6. moving all Django code into a function and only passing integer to it;
  7. disabling Django and only leaving lxml (parsing library) in thread, and vice versa - still leaks;
  8. something else too, but can’t remember all.

The worst part is that nothing detects where those GIGABYTES are going, every tool I used shows 15-30MB memusage.

The workaround I’ve settled for - running separate child processes and connecting to parent “queuer” process via xmlrpc. Takes a lot of memory (each process is 17MB vs some KB for thread) and my guess is that xml isn’t the most effective, but at least no memleaks even if child is running infinite loop.

If you have other ideas to try - let me know.

Posted in site | No Comments »

19th Jul 2008

RarestNews, scalability, 100000 news per day, databases, de-normalization and bugs in Python language

UPDATE: The response to this post by one of the developers of CouchDB.

RarestNews project is currenlty under re-tooling, for some reasons. First of all there’s a design flaw. The accumulation of 100 000 news articles every day in a single MySQL database is a bad idea. The database started to crawl (i.e. being really slow) on 10th day, and on 20th day it nearly came to stop. Actually the site stopped being responsive at all (even at the start it wasn’t really responsive, but it was all due to normalization, which I was taught is good, but it’s not… in some cases). So the technical description of scalability problems I’m facing are following.

The only thing that wasn’t giving me the problems is the new Amazon EC2 High-CPU instances. That’s a terrific thing. Everything else is crap. :) Ok, not everything, but probably being misused like I was doing it - it is.

MySQL problems

So, to be technical here I’ve used MyISAM tables (never really liked InnoDB because of it’s slow writes and at 100k new articles a day with lots of meta-data to write about them, like tags, dates, snippets, word frequencies, etc) - it seemed like a good decision. The bad part was that on write MyISAM locks the whole table. So 50 bots scouring the Web for news writing and locking whole table made site almost unresponsive.

I’m not yet sure how to solve it - with InnoDB, with PostgreSQL or with some kind of new-age databases like CouchDB, StrokeDB, maybe Amazon’s SimpleDB, etc…

(more…)

Posted in site | No Comments »

27th Jun 2008

TheRarestParser has been upgraded to 0.4b

I’ve upgraded the bot for TheRarestWords (about TheRarestWords) to 0.4b today, the new version has these improvements:

  • Umlauts are now recognized as letters and actually…. all national letters recognized, except for Japanese, Chinese, etc - the words there are actually phrases and due to the fact they don’t use spaces to separate words - I’ve no idea how to split them into the words. (Ideas, anyone?)
  • External domain redirects are recognized and ignored (this is usually either misspellings or SPAM-like-technique)
  • Internal domains redirects are recognized (META redirects too)
  • Multiple pages instead of just one (if your main page has less than 100 words - the bot goes further up to 10 pages deep, to find some)
  • Frames are now recognized too
  • Improved HTML support (more tolerable to errors)

Also the new bot stores datetime component for words, so now the trends can be built after a few walks around the web (one walk - about 55 days :) since this project still can’t make any money to cover the expenses and it still is on a single server).

Posted in site | No Comments »