Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature discussion: Explicit Order #114

Closed
den1k opened this issue May 9, 2022 · 6 comments
Closed

Feature discussion: Explicit Order #114

den1k opened this issue May 9, 2022 · 6 comments

Comments

@den1k
Copy link
Contributor

den1k commented May 9, 2022

@tonsky recently published a post of ideas for a DataScript 2.0. One of the topics is lack of ordering in DataScript which equally affects datalevin. One of the suggestions is to create an order index.

I imagine this would look like EAVO - Entity Attribute Value Order such that order can be set on the relation-level (EAV) such that ref values can reappear in other relations with different order. @ccorcos suggested fractional indexing to set and update order.

@huahaiy while I'm not clear yet what the API might like and what the tradeoffs are I'm curious to know is this something you'd be open to having in datalevin?

@huahaiy
Copy link
Contributor

huahaiy commented May 9, 2022

I do not understand what you mean by "ordering". I will need to see examples. You mean "order-by" in SQL? What's wrong with ordering results afterwards? It's simple enough to add that in the query language of Datalevin. It's not a problem, just adding a :order-by clause. In the end of the day, users care only about the order in the results, isn't it?

But putting order in index is a bad idea, and no database does that. The reason is simple, there are unbounded number of orderings and database systems often do sorting as needed to speed up query, e.g. to allow merge joins. There's no point to store a specific order in the index. I am trying hard to minimize storage. Whatever storage budget I want to spend, I want to spend on speeding up the query.

As to his blog post of Datascript 2.0, most of the concerns are either already being addressed in Datalevin, or plainly wrong from a database point of view. For example, the idea of removing query is just ridiculous. The only reason for many people to be interested in Datomic-like stores is the query language. It is at least the case for me. Without the query language, I wouldn't consider using such a thing. With a fast query engine, most of his concerns can be resolved. For example, if the query is fast, there's no problem adding an additional step to sort the results in whatever order users like. This whole business of pull API is also due to the slow query engine. If it is fast, there's no need for a pull API.

Yes, it is hard to build fast query engines, but it is doable. I am making great progress here. Just finished adding two types of indices to Datalevin. The query is going to be much faster. For example, all the shown queries in the current benchmark requires no joins at all (except q5, which is too slow so we normally comment it out), as they are on the same entity type, so they should be equally fast. My goal is to be competitive to RDBMS. It is doable, just do what RDBMS does, i.e. we build virtual tables behind the scene during transaction.

In any case, I have very different goals from tonsky. He's a UI guy and he built Datascript to experiment with building UI in a new way.

I want to build a database in the traditional sense. I want the database to be flexible and easy to use, so that people don't have to be a relational algebra experts to use a database, i.e. to fully realize the dream of database researchers of having completely fool proof database systems. My ultimate ambition, is to realize the dream of early AI researchers, to have a powerful knowledge graph and reasoning platform, and use that as the basis of new generation of AI (no, deep learning is not going to solve AI, no matter what the hype is saying).

I took Datascript as a starting point, because it is there. The most value for me is the test cases. Without that, I wouldn't have attempted to start the project. I hope this much is clear.

@den1k
Copy link
Contributor Author

den1k commented May 9, 2022

@huahaiy agreeing on your points that minimizing storage and fast queries are crucial. To clarify, what I mean by ordering is an explicit order, e.g. in the EAV sense: Todo List (E) Items (A) todo-items (V) – todo items require an order. Currently order is often added as an attribute on the todo-item itself. This has at least two problems:

  1. since the DB is a graph the same todo-item could have a different order value in another Todo List, e.g. it should show up at the top in one and at the bottom in the other. this means order values need to be tracked for each possible parent
  2. offsets / pagination can become expensive queries since in the current system the whole query needs to be re-run, sorted each time. Additionally, since this requires code, it is not expressible through the query syntax. e.g. of a given todo list, give me the first 10 todos requires query + code.

While the above are UI examples I can imagine explicit ordering to be useful in AI use cases as well. The main question is how does explicit ordering of nodes work in a graph where those nodes can reappear under different parent nodes? Currently, I believe, there's no straightforward solution to this in Clojure's flavor of datalog databases.

@den1k den1k changed the title Feature discussion: Ordering Feature discussion: Explicit Ordering May 9, 2022
@den1k den1k changed the title Feature discussion: Explicit Ordering Feature discussion: Explicit Order May 9, 2022
@huahaiy
Copy link
Contributor

huahaiy commented May 9, 2022

These ordering problems will not be solved by adding order in the index. They can be solved by allowing users to supply an order-by function, where they can do whatever they please.

Pagination is a different problem. The query engine can be enhanced to hold some state to handle that, i.e. to have a transducer friendly API. We are addressing that in KV API (waiting for the PR from @ieugen), the same can be done with query API as well. Just modify the syntax a bit or use a different q, which Datomic does provide, so can we.

@den1k
Copy link
Contributor Author

den1k commented May 9, 2022

Thanks for clarifying!

@zeitstein
Copy link

zeitstein commented May 9, 2022

Hope you don't mind a "UI guy" jumping in here.

I've been using :xform to solve ordering with pull. It would be nicer if one could work with vectors directly, but... Asami has support for arrays. My experience is that it can be challenging to work with. A combination of tx functions and order-by should be good enough.

this means order values need to be tracked for each possible parent

I've been solving this by wrapping the 'todo' in another entity for each parent. Makes sense for my use cases, since order is not the only attribute that is location-dependent.

Wholeheartedly agree with @huahaiy about Tonsky's query comments. However, I will say that pull is very useful for UI and I think it is fairly true that for many UIs you might not need queries at all. I think pull makes a lot of sense for the "give me my data as a denormalized tree" use cases; it is simpler than using queries. Though, since many CLJS UI stores keep data normalised, I find myself planning to write a pull which produces normalised data.

One thing I'd like is ref support in heterogenous tuples. Any thoughts on that?

Finally, @huahaiy, following your work with great interest and excitement. Thanks!

@huahaiy
Copy link
Contributor

huahaiy commented May 10, 2022

We will need to support tuple first, then we can think about handle ref in them. Thanks for the suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants