Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for improving subquery performance #15

Open
rgranadino opened this issue Nov 20, 2014 · 4 comments
Open

Ideas for improving subquery performance #15

rgranadino opened this issue Nov 20, 2014 · 4 comments

Comments

@rgranadino
Copy link
Contributor

Just thought I'd open a discussion on ideas to help improve (sub)query performance. The clever mechanism in which this module works is by having a DB collection query a subquery:

$collection = Mage::getModel('cleansql/reportCollection', $connection);
$collection->getSelect()->from(new Zend_Db_Expr('(' . $this->getData('sql_query') . ')'));

The problem arises when dealing with large result sets. If you're querying the sales_flat_order table for data, the subquery will execute fetching all the rows, then the pagination for the first 20 rows in the grid will execute.

mysql> EXPlAIN  EXTENDED
    -> SELECT `t`.* FROM (select * from sales_flat_order) AS `t` LIMIT 20;
+----+-------------+------------------+------+---------------+------+---------+------+-------+----------+-------+
| id | select_type | table            | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra |
+----+-------------+------------------+------+---------------+------+---------+------+-------+----------+-------+
|  1 | PRIMARY     | <derived2>       | ALL  | NULL          | NULL | NULL    | NULL | 66351 |   100.00 |       |
|  2 | DERIVED     | sales_flat_order | ALL  | NULL          | NULL | NULL    | NULL | 67987 |   100.00 |       |
+----+-------------+------------------+------+---------------+------+---------+------+-------+----------+-------+
2 rows in set, 1 warning (6.07 sec)

Compared to just running the subquery with the LIMIT:

mysql> EXPLAIN EXTENDED
    -> SELECT `t`.* FROM sales_flat_order AS `t` LIMIT 20;
+----+-------------+-------+------+---------------+------+---------+------+-------+----------+-------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra |
+----+-------------+-------+------+---------------+------+---------+------+-------+----------+-------+
|  1 | SIMPLE      | t     | ALL  | NULL          | NULL | NULL    | NULL | 67987 |   100.00 |       |
+----+-------------+-------+------+---------------+------+---------+------+-------+----------+-------+
1 row in set, 1 warning (0.00 sec)

The reason for using a subquery is that it's not (easily) possible to just take in random SQL someone types and add it to the select object and have pagination (and filtering) from the grid work. Short of writing our own query parser one idea hack was to include place holders in the query:

SELECT * FROM sales_flat_order WHERE ${WHERE} ${LIMIT}

These would be optional and only replaced if they're found within the report query. One problem I can see with this approach is that if the existing query already has a WHERE clause then it would need to be WHERE column='some value' AND (${WHERE}). If there isn't a WHERE clause but the grid is filterable but rendered without any filters we would have to cheese it and replace the place holder with a "dummy" filter like so: WHERE column='some value' AND (1=1) in order to not break the query.

Here's what the above query would look like when analyzed:

mysql> EXPlAIN  EXTENDED
    -> SELECT `t`.* FROM (select * from sales_flat_order limit 20) AS `t` LIMIT 20;
+----+-------------+------------------+------+---------------+------+---------+------+-------+----------+-------+
| id | select_type | table            | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra |
+----+-------------+------------------+------+---------------+------+---------+------+-------+----------+-------+
|  1 | PRIMARY     | <derived2>       | ALL  | NULL          | NULL | NULL    | NULL |    20 |   100.00 |       |
|  2 | DERIVED     | sales_flat_order | ALL  | NULL          | NULL | NULL    | NULL | 67987 |   100.00 |       |
+----+-------------+------------------+------+---------------+------+---------+------+-------+----------+-------+
2 rows in set, 1 warning (0.00 sec)

I haven't gone deep enough to try and implement this and see what other issues I'd run into. I realize this approach is dirty/hacky. Hopefully someone can come up with a better approach. Open to suggestions/feedback.

@kalenjordan
Copy link
Owner

What if we just abstracted some kind of Json format to map to the select object methods.

So instead of sql (or perhaps an alternative to sql) would be a Json format that basically maps to the zend db select methods that would generate the right query.

That would probably be safer than allowing raw sql.

Downside is that it's one step removed from sql so it's not as easy to write up quickly. But if it's only an alternative for better performace it could be good.

@rgranadino
Copy link
Contributor Author

That is definitely an option but would take away from the flexibility of using SQL, e.g. various types of joins, aggregate functions, group by, (ironically) sub queries and so on. Being able to leverage SQL is what makes this extension great. Having to re-implement that would be a lot of work and error prone.

Another approach could be to have an option of either putting in raw SQL to run in the subquery or to display a single table (or view) in the grid/result set. The idea being that any complex query could be saved as a view, removing the need run the complex query in a subquery and we could just call:

public function getReportCollection {
...
$collection->getSelect()->from($this->getTableName());

That would remove the need for cheesy place holders and having to come up with our own fauxQL.

@kalenjordan
Copy link
Owner

But I don't think we'd lose any flexibility. The zend select object can generate basically any SQL as far as I understand - definitely everything you mentioned - joins, group by's, etc.

For example:

"select": {
    "from":  "sales_flat_order",
    "columns": {
        "sum_revenue": "SUM(grand_total)",
        "customer_email": "customer_email",
    },
    "group": "customer_email",
    "limit": 10
}

Would map to:

$select->from('sales_flat_order', array())
    ->columns(array(
        'sum_revenue' => 'SUM(grand_total)',
        'customer_email' => 'customer_email',
    ))
    ->group('customer_email')
    ->limit(10)

I haven' quite thought through the implementation details yet. I'm sure it wouldn't be entirely trivial to setup the mapping. But it seems like it would be relatively straight forward.

@rgranadino
Copy link
Contributor Author

I think that could work. It's will probably just be possible to call $collection->getSelect()->setPart() on each of those.

An edge cases: custom query for a list top 100 customers that had the highest order value in the last six months. There would be a LIMIT in there but the grid would want to paginate that to the first 20 rows, thus nuking the desired result.

An interim solution I had was to do a very simple match for "SELECT * FROM (table)". If the query is just that, then we could just a regular expression to get the table name from the query. It would be a better approach than having to chose raw SQL vs a drop down with a list of tables/views ("automagic").

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants