-
Notifications
You must be signed in to change notification settings - Fork 924
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TASK][CHALLENGE] Support Spark Connect Frontend/Backend #5383
Comments
I think this is very challenging, but I want to give it a try. Can you assign it to me and help me |
sure, thank you @yehere ! This is a kind of umbrella, we can create sub issue one by one later. |
This huge task could be divided into several different level tasks, feel free to go ahead ~ all your contributions will be counted eventually :) |
cc @cfmcgrady |
I'm also interested in it, hope to work together. |
thank you @zhaomin1423 , glad to see you are interested in. |
how about co-located mode with kyuubi's sparksql engine? separated service is good and basic, but also needs more resources for more spark instances. |
@minyk there are in different process, just like Spark thirftserver and connect server. We are going to add a new module and new server for Kyuubi connect. We can do it together if you are interested in. |
@ulysses-you hello,i'm interested with this component,hope work with you |
@yaooqinn @pan3793 @ulysses-you |
I haven't had deep look at it, my current thought is,
|
@davidyuan1223 sure, please go ahead. +1 for @pan3793 thought. |
@ulysses-you @pan3793 |
Just to clarify here, the intention is to support spark connect client as another connection type to the engine - so you could still use jdbc or notebook (via rest) to the same Spark engine and have all those clients to the same application? |
Yes, my initial assumption is to create a 3.4-based sparkSession by providing the configuration item remote connection str and then merging it with thrift service to provide the corresponding engine(so this configuration must force a check of the spark version > 3.4, while spark-connect-client has already written sparkSession to reduce our development process), what do you think? |
@tgravescs that's a good question, and we did have an offline discussion about it. TL;DR, your assumption will be the ultimate version, but not at the beginning. As you know the current main flow of Kyuubi is:
The engine itself is kind of a regular Spark app that basically only consumes Spark's public API, making it easily compatible with multiple Spark versions. As connect is a new feature and Another important case is Once the PoC is completed, we can consider merging servers and engines to achieve the final vision as you said.
Maybe @yaooqinn can share more information |
@pan3793 @yaooqinn @ulysses-you @tgravescs
As mentioned above, I believe that in the RPC request process of kyuubi based on SparkConnect, we no longer need the involvement of SparkSession, so I have designed the following process:
Based the rpc client, we don't need create sparkSession What do you think? |
@pan3793 @yaooqinn Hi! Just to clarify - do I understand correctly, that for the first iteration, we need to somehow allow gRPC-based engines to coexist with Thrift ones (all current engines) in order to add the Or it is expected to start directly from rewriting the current internal RPC mechanism from Thrift (HS2) to gRPC and changing the internal API (
|
@tigrulya-exe Exactly! I'm doing some experiments in this way, and it does involve lots of refactoring work to support both Thrift and gRPC and reuse code as much as possible. I can not promise an ETA since I'm not sure how much time I can spend on this task in the next few months. But I will open a draft PR once I make the pipeline work (for example, successfully executing |
@pan3793 great! I would like to participate in the development process, if it's possible :) Do you already have a list of kyuubi parts/classes/modules to refactor, so we can break this big task down into smaller parts to be able to work simultaneously? Btw, I also noticed that there is #6412 PR, related to this issue. @davidyuan1223 Hi! Is it still active? |
@tigrulya-exe I will share with you more details in the next one or two weeks. |
Yeah, it's active, you could see this pr #6412. We first need to verify the feasibility of this solution, but the spark-connect latest version 3.5.1 has some question, so i'm waitting for the new version 3.5.2 release(currently it's released). And i will verify the spark-connect-3.5.2 this week |
A quick and dirty version of Kyuubi Connect is available at #6642 |
@pan3793 Hi! I checked your PoC and built it locally. I tried to run some queries using pyspark and they finished successfully, nice work! Now, I suggest creating a list of tasks that are required to complete this solution. These tasks include supporting all gRPC Spark Connect API methods and refactoring the current code to seamlessly integrate the PoC. This will allow us to work simultaneously and add functionality to the master branch more quickly. Could you please share any changes that break the current thrift-based logic and any things that need to be refactored that you noticed during the implementation of this solution, so we can use this information as a starting point? |
Code of Conduct
Search before creating
Mentor
Skill requirements
Background and Goals
Make Kyuubi server compatible with Spark Connect protocol, so that people can use Spark Connect client to connect to Kyuubi Server.
Implementation steps
Add a new Spark Connect frontend
1.1 Add basic gRpc server as frontend
1.2 Compatible with Spark Connect protocol, see https://github.com/apache/spark/blob/master/connector/connect/common/src/main/protobuf/spark/connect/base.proto
1.3 Support ExecutePlan
1.4 Support AnalyzePlan
1.5 Support Config
1.6 Support AddArtifacts
1.7 Support ArtifactsStatus
1.8 Support Interrupt
1.9 Support ReattachExecute
1.10 Support ReleaseExecute
1.11 Serialize the protobuf based request
Add a new Spark Connect backend
2.1 Imprort Sprak-Connect-Server and rewrite SparkConnectService https://github.com/apache/spark/blob/master/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectServer.scala
2.2 Deserialize response to protobuf based
Add IT
Add docs
Additional context
Introduction of #6232
The text was updated successfully, but these errors were encountered: