RarestBlog

Redshift: PostgreSQL-like in the Cloud (Benchmark)

Amazon has opened it’s RedShift today.

image

Naturally I was curious about it’s performance. Basically it’s database with PostgreSQL-compatible protocol in the cloud

image

The full test results are under the cut, just a gist of it:

Insert 1M entries: local 8 seconds, RedShift 49 seconds.
Select by 2 rows (without indices): local 129ms seconds, RedShift ~54ms
Select by 1 row (without indices): local 44ms, RedShift ~448ms

(SELECT 1 was 100ms, so that's the round-trip time, I subtracted it 
from RedShift's values)

The test is VERY SYNTHETIC. Don’t take it as a comprehensive comparison.

Of course this does not prove anything. But it seems that 1M rows is not the use case for RedShift. I did not try JOINs, maybe it’s better with JOINs.

Afterall, they do say it’s for Business Intelligence systems. They also say “petabyte-scale”. So let’s try to scale up…

After some more testing it seems like yes > 5M is where you start to feel the difference:

Items count: 6000000
select by shop_code and article  0.766868
select by shop_code  0.052745
test: redshift
select by shop_code and article  0.277642 (round-trip subtracted)
select by shop_code  0.179468  (round-trip subtracted)

And it’s quite obvious that the complex queries are running faster on RedShift.

Though it continues to grow also:

Items count: 10000000
test:local
select by shop_code and article  1.449393
select by shop_code  0.046846
test: redshift
select by shop_code and article  0.653305
select by shop_code  0.349823

Let’s try a weird self-join, maybe that’ll help:

SELECT * FROM items i1 INNER JOIN items i2 ON i1.shop_code=i2.shop_code LIMIT 10
self-join local  0.001838
self-join remote  2.838698

What a disappointment!

And finally I’ve got something:

SELECT * FROM items i1 
    INNER JOIN items i2 ON i1.shop_code=i2.shop_code 
    WHERE i1.article='578' LIMIT 10000
self-join local   1.394531
self-join remote  3.997136

Let’s try even more stress:

SELECT * FROM items i1 
    INNER JOIN items i2 ON i1.shop_code=i2.shop_code 
    INNER JOIN items i3 ON i2.article=i3.article 
    WHERE i1.article='921' LIMIT 10000
self-join local  1.432942
self-join remote^C^Z

Ok, I waited for a couple minutes and got bored. RedShift did not plan to respond to my non-sense.

SELECT count(*) FROM (SELECT * FROM items LIMIT 100) i1 
  INNER JOIN (SELECT * FROM items LIMIT 100) i2 ON i1.shop_code=i2.shop_code 
  WHERE i2.article='478'

self-join local   0.002615
self-join local   0.000733
self-join local   0.000586
self-join local   0.000573
self-join local   0.000542
self-join remote  25.864774
self-join remote   4.878707
self-join remote   5.463815
self-join remote   1.992176
self-join remote   1.796341

Then it hovered around 1.9 for a while. What if we kick it up a bit?

Getting weird:

SELECT count(*) FROM (SELECT * FROM items LIMIT 10000) i1 
  INNER JOIN (SELECT * FROM items LIMIT 10000) i2 ON i1.shop_code=i2.shop_code 
  WHERE i2.article='880'
self-join local   0.008671
self-join remote   1.448579

(I always do 5 tests and if the numbers are more or less the same - I keep only it)

Finally RedShift is doing something better:

SELECT count(*) FROM items i1 
  INNER JOIN (SELECT * FROM items LIMIT 100000) i2 ON i1.shop_code=i2.shop_code 
  WHERE i2.article='208'
self-join local 15.323511
self-join remote  8.412499

count(*) = 67 Mil

And finally:

Items count: 10000000

SELECT count(*) FROM items i1 
  INNER JOIN (SELECT * FROM items LIMIT 300000) i2 ON i1.shop_code=i2.shop_code 
  WHERE i2.article='575'

self-join local 38.425009
self-join remote 23.762500
self-join remote 17.709055
self-join remote 14.058966
self-join remote 15.046434
self-join remote 15.029501

The count(*) is 250+Mil

So the plot looks like this:

image

EDIT: vertical = seconds to run a query, horizontal = number of millions of rows in count(*), which is proportional to number of rows analyzed, which should be about 1000 times bigger than count(*), since there were 1000 articles in test.

As a final accord I tried to scale the cluster to 4 smaller machines (which is x-large).

image

After scaling the same query went from 15.029501 to 8.644006, but only once. Then the cluster stopped responding:

$ ruby test.redshift.rb 
pg.rb:38:in `initialize': could not connect to server: Connection timed out (PG::Error)
Is the server running on host "test1.****.us-east-1.redshift.amazonaws.com" (xxx.xxx.xxx.xxx) and accepting

In about 5 minutes it returned back to normal and in fact scaled linearly (I started 4 node cluster):

SELECT count(*) FROM items i1 
  INNER JOIN (SELECT * FROM items LIMIT 300000) i2 ON i1.shop_code=i2.shop_code 
  WHERE i2.article='73'

self-join local 38.425009
self-join remote  3.040255
self-join remote  2.763937
self-join remote  3.071994
self-join remote  2.962063
self-join remote  2.567816

It was fun, but was time to say good bye!

image

There is also an interesting feature - snapshots. So basically, you can snapshot your data, destroy the cluster. I assume that then you don’t pay anything… But I might be wrong here. And then you can restore it.

image

The prices

0.850*24*31                = 632$ / smaller instance / month (monthly)
(2500+0.215*24*365)/12     = 365$ / smaller instance / month (prepay 1 year)
(3000+0.114*24*365*3)/12/3 = 166$ / smaller instance / month (prepay 3 years)

EDIT: I miscalculated the “3 year” price.

Nice things

Yes, you can connect with standard psql

image

You can also do many regularly tedious tasks with a push of a button

image

This is also nice:

image

Resume

For me the result is that mostly RedShift is on par with local PostgreSQL, sometimes even winning for <5M rows. With better PostgreSQL tuning you can probably stretch it, but not for as much as RedShift can do for REALLY big data.

My understanding is that you can start thinking of RedShift when you have significantly more than 10M rows to analyze.

Also the big deal was that RedShift scaled linearly. I added more nodes - I’ve got proportionally faster. Using SQL.

The local setup

  • PostgreSQL 9.2
  • Mint 13
  • default conf
  • in VirtualBox in iMac i5 12GB
  • (read: home computer, no tuning)

Source

require 'active_record'
require 'benchmark'

require 'pg'

def test_connection(conn, name)

  puts "test: #{name}"
  begin
    conn.exec("DROP TABLE items")
  rescue 
  end

  conn.exec("CREATE TABLE items (shop_code VARCHAR, article VARCHAR, price DECIMAL)")

  Benchmark.bm do |b|
    [10000, 100000, 1000000].each do |n|
      string = "BEGIN;"
      string += "INSERT INTO items (shop_code, article, price) VALUES ";
      values = []
      n.times {
        values << "('shop#{Random.rand(15)}', '#{Random.rand(1000)}', #{Random.rand(30*1000+995)})"
      }
      string += values.join(','); 
      string += "; commit;"

      b.report "insert #{n} in one call" do
        conn.exec(string)
      end
    end
  end

  Benchmark.bm do |b|
    5.times do
      b.report "select by shop_code and article" do
        conn.exec("SELECT * FROM items WHERE shop_code='shop#{Random.rand(15)}' AND article='#{Random.rand(1000)}'") do |result|
          result.values
        end
      end
    end

    5.times do
      b.report "select by shop_code" do
        conn.exec("SELECT * FROM items WHERE shop_code='shop#{Random.rand(15)}' limit 10000") do |result|
          result.values
        end
      end
    end
  end

end


conn_remote = PG.connect( 
  dbname: 'test1', 
  host: '*****.us-east-1.redshift.amazonaws.com',
  user: "redshift",
  password: '*****',
  port: 5439,
)


conn_local = PG.connect( 
  host: '127.0.0.1',
  dbname: 'redshift',
  user: 'redshift',
  password: '*****'
)

test_connection(conn_local, 'local')
test_connection(conn_remote, 'redshift')