April 30th, 2013
|01:16 am - Importance of intra-query parallelism.|
Oracle Database has a feature which allows it to query millions of rows in parallel while executing a join which has a big fanout.
How important is it that a database server has a lot of intra-query concurrency? Does it still make a lot of sense to run an analytical query in parallel threads, on a single machine?
While at Percona Live, there was a lot of talk about the future of MySQL, and some even mentioned this as being part of the future.
The reason for intra-query parallelism has always been to fill up the pipeline to disk with lots of parallel queries. Indeed, this pipe is thick and long - and if used, it'd better produce a lot of data at once. Efficiency of CPU utilization is sacrificed to achieve efficiency of a rotating disk drive.
Yet in DaaS world this all fails to make sense to me. In a cloud, one execution unit is not one CPU, but one instance, and one database instance equals to a cluster of virtual machines. Map/Reduce was only the first sign of the change - it is stupid, indeed, but network is faster than disk, and if a query needs to inspect a million of rows, they'd better be on thousands of disks, not on a single one.
It's funny how MySQL technology is steadily pulled up-market. I haven't seen a single project use MySQL Stored Procedures, which were created for SAP R/3 integration, in applications they were created for. Perhaps, when parallel query in MySQL is ready, it also will be used for something completely different.
Meanwhile, I think the task of coming up with an efficient join algorithm to run across sharded data is more in line with the way hardware is going to look like in the future. Sharding is done best when not done at all. But so is concurrency.
I don't fully understand your argument against parallel query. If you use EBS it is quite likely the data are in fact spread across multiple underlying devices. Parallelism still makes sense to balance CPU and I/O much as it does in non-virtual environments.
Indeed. Let me try explaining the point differently, taking elastic block store as an example.
The way traditional databases implement parallel query is to run multiple threads each doing I/O and filtering data. So there is a lot of concurrency *in the database*, for block cache, for latches, etc. Whereas the point of map/reduce is to take your filter closer to your data. In a cloud environment, it means that there is a lot of concurrency *in the system*, i.e. there are a lot of instances of virtual machines, each instance being a sole owner of its own piece of memory and disk, running calculcations for a join.
The key is being able to take advantage of both. You have big servers and small servers in the cloud and in the data center. You want to be able to scale up on the big servers and scale out over small servers, or combine both.
The same concepts apply to both, really.
Shard-Query handles both:
On each node partitions can be used for parallelism. Multiple nodes(shards) can be used to store data. There are sharded and unsharded tables. Sharded tables can be freely joined to unsharded tables, as the unsharded tables are duplicated on all shards. There are many other complexities (like subqueries in the from clause) but these can be handled.
Shard-Query pushes down filtering, joining and aggregation to the shards. For example, if you have 100 shards and you run a select count(*) over a 500 billion row table, there will only be 100 rows returned to the "coordinator" node. All aggregation functions are supported but non-distributable aggregate functions like stddev() and group_concat cause push-down aggregation to be disabled (those 500 billion rows would be sent, which is not practical). Avg is pushed down as count and sum, etc.
Let me know if you want more info. It works with all version of MySQL, including MariaDB. With the CONNECT storage engine, cross-shard queries (like self joins of a sharded table) are possible since Shard-Query supports a MySQL proxy and the CONNECT engine could connect to the proxy.
Edited at 2013-04-30 04:47 pm (UTC)
You do indeed benefit from parallel query when joining sharded tables with local tables. Thank you for your comment, it puts things in perspective.
Edited at 2013-05-03 05:32 pm (UTC)