Ideas for improving subquery performance #15

rgranadino · 2014-11-20T16:58:17Z

Just thought I'd open a discussion on ideas to help improve (sub)query performance. The clever mechanism in which this module works is by having a DB collection query a subquery:

$collection = Mage::getModel('cleansql/reportCollection', $connection);
$collection->getSelect()->from(new Zend_Db_Expr('(' . $this->getData('sql_query') . ')'));

The problem arises when dealing with large result sets. If you're querying the sales_flat_order table for data, the subquery will execute fetching all the rows, then the pagination for the first 20 rows in the grid will execute.

mysql> EXPlAIN  EXTENDED
    -> SELECT `t`.* FROM (select * from sales_flat_order) AS `t` LIMIT 20;
+----+-------------+------------------+------+---------------+------+---------+------+-------+----------+-------+
| id | select_type | table            | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra |
+----+-------------+------------------+------+---------------+------+---------+------+-------+----------+-------+
|  1 | PRIMARY     | <derived2>       | ALL  | NULL          | NULL | NULL    | NULL | 66351 |   100.00 |       |
|  2 | DERIVED     | sales_flat_order | ALL  | NULL          | NULL | NULL    | NULL | 67987 |   100.00 |       |
+----+-------------+------------------+------+---------------+------+---------+------+-------+----------+-------+
2 rows in set, 1 warning (6.07 sec)

Compared to just running the subquery with the LIMIT:

mysql> EXPLAIN EXTENDED
    -> SELECT `t`.* FROM sales_flat_order AS `t` LIMIT 20;
+----+-------------+-------+------+---------------+------+---------+------+-------+----------+-------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra |
+----+-------------+-------+------+---------------+------+---------+------+-------+----------+-------+
|  1 | SIMPLE      | t     | ALL  | NULL          | NULL | NULL    | NULL | 67987 |   100.00 |       |
+----+-------------+-------+------+---------------+------+---------+------+-------+----------+-------+
1 row in set, 1 warning (0.00 sec)

The reason for using a subquery is that it's not (easily) possible to just take in random SQL someone types and add it to the select object and have pagination (and filtering) from the grid work. Short of writing our own query parser one ~~idea~~ hack was to include place holders in the query:

SELECT * FROM sales_flat_order WHERE ${WHERE} ${LIMIT}

These would be optional and only replaced if they're found within the report query. One problem I can see with this approach is that if the existing query already has a WHERE clause then it would need to be WHERE column='some value' AND (${WHERE}). If there isn't a WHERE clause but the grid is filterable but rendered without any filters we would have to cheese it and replace the place holder with a "dummy" filter like so: WHERE column='some value' AND (1=1) in order to not break the query.

Here's what the above query would look like when analyzed:

mysql> EXPlAIN  EXTENDED
    -> SELECT `t`.* FROM (select * from sales_flat_order limit 20) AS `t` LIMIT 20;
+----+-------------+------------------+------+---------------+------+---------+------+-------+----------+-------+
| id | select_type | table            | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra |
+----+-------------+------------------+------+---------------+------+---------+------+-------+----------+-------+
|  1 | PRIMARY     | <derived2>       | ALL  | NULL          | NULL | NULL    | NULL |    20 |   100.00 |       |
|  2 | DERIVED     | sales_flat_order | ALL  | NULL          | NULL | NULL    | NULL | 67987 |   100.00 |       |
+----+-------------+------------------+------+---------------+------+---------+------+-------+----------+-------+
2 rows in set, 1 warning (0.00 sec)

I haven't gone deep enough to try and implement this and see what other issues I'd run into. I realize this approach is dirty/hacky. Hopefully someone can come up with a better approach. Open to suggestions/feedback.

The text was updated successfully, but these errors were encountered:

kalenjordan · 2014-11-21T04:49:35Z

What if we just abstracted some kind of Json format to map to the select object methods.

So instead of sql (or perhaps an alternative to sql) would be a Json format that basically maps to the zend db select methods that would generate the right query.

That would probably be safer than allowing raw sql.

Downside is that it's one step removed from sql so it's not as easy to write up quickly. But if it's only an alternative for better performace it could be good.

rgranadino · 2014-11-21T07:40:15Z

That is definitely an option but would take away from the flexibility of using SQL, e.g. various types of joins, aggregate functions, group by, (ironically) sub queries and so on. Being able to leverage SQL is what makes this extension great. Having to re-implement that would be a lot of work and error prone.

Another approach could be to have an option of either putting in raw SQL to run in the subquery or to display a single table (or view) in the grid/result set. The idea being that any complex query could be saved as a view, removing the need run the complex query in a subquery and we could just call:

public function getReportCollection {
...
$collection->getSelect()->from($this->getTableName());

That would remove the need for cheesy place holders and having to come up with our own fauxQL.

kalenjordan · 2014-11-21T14:53:42Z

But I don't think we'd lose any flexibility. The zend select object can generate basically any SQL as far as I understand - definitely everything you mentioned - joins, group by's, etc.

For example:

"select": {
    "from":  "sales_flat_order",
    "columns": {
        "sum_revenue": "SUM(grand_total)",
        "customer_email": "customer_email",
    },
    "group": "customer_email",
    "limit": 10
}

Would map to:

$select->from('sales_flat_order', array())
    ->columns(array(
        'sum_revenue' => 'SUM(grand_total)',
        'customer_email' => 'customer_email',
    ))
    ->group('customer_email')
    ->limit(10)

I haven' quite thought through the implementation details yet. I'm sure it wouldn't be entirely trivial to setup the mapping. But it seems like it would be relatively straight forward.

rgranadino · 2014-11-21T17:53:11Z

I think that could work. It's will probably just be possible to call $collection->getSelect()->setPart() on each of those.

An edge cases: custom query for a list top 100 customers that had the highest order value in the last six months. There would be a LIMIT in there but the grid would want to paginate that to the first 20 rows, thus nuking the desired result.

An interim solution I had was to do a very simple match for "SELECT * FROM (table)". If the query is just that, then we could just a regular expression to get the table name from the query. It would be a better approach than having to chose raw SQL vs a drop down with a list of tables/views ("automagic").

peterjaap added the enhancement label Feb 29, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas for improving subquery performance #15

Ideas for improving subquery performance #15

rgranadino commented Nov 20, 2014

kalenjordan commented Nov 21, 2014

rgranadino commented Nov 21, 2014

kalenjordan commented Nov 21, 2014

rgranadino commented Nov 21, 2014

Ideas for improving subquery performance #15

Ideas for improving subquery performance #15

Comments

rgranadino commented Nov 20, 2014

kalenjordan commented Nov 21, 2014

rgranadino commented Nov 21, 2014

kalenjordan commented Nov 21, 2014

rgranadino commented Nov 21, 2014