-
Notifications
You must be signed in to change notification settings - Fork 318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we breathe life back into this project? #1162
Comments
Update 30th Aug 2023 |
but is there any way to bring it back to life? |
@GoEddie @GeorgeS2019 since the project is open-source and part of the .NET Foundation is this something the community would be interested in contributing to in order to help move it forward? |
@luisquintanilla We need to identify members here who are interested to maintain and merge the e.g. PR For members here who are interested, please let @luisquintanilla know. |
@luisquintanilla i'm definitely interested in helping to keep the project moving forward, I stopped raising/reviewing pr's as they were not getting reviewed and merged in but if there is a committer who is available to do that or if there was an opportunity for community members to become committers then I would be interested. |
Thanks @GoEddie & @GeorgeS2019. This is definitely something for us to look into how we can unblock you and by extension the project. |
@luisquintanilla I can try to help. I've encountered some minor bugs that may be low-hanging fruit. One thing that might bog us down the most is not having the means of updating the Microsoft nuget package. Can you please explain (or give us links that explain) how the nuget packaging works for community projects, and whether it is still possible to publish new versions of it (even after Microsoft has abandoned the community)? Will someone other than Microsoft need to start publishing a different nuget? Also I think there are portions of this project that need to be killed as the first order of business (especially if they were done on behalf of stakeholders who have left). For example, I'm pretty eager to kill all the weird cruft related to "Microsoft.Data.Analysis". That is a very minor amount of code that never worked well, and caused a lot of confusion. For example there are critical overloaded class names like "DataFrame" which are part of both namespaces! It was a bad fit for this project. Anyone who still needs to do an integration with "Microsoft.Data.Analysis" can do their own independent work to reintroduce that mess on their own. (That other project isn't even v.1 yet, in any case.) |
@GoEddie I am pretty disgruntled about the Synapse side of this story. In 2022 I had migrated all my projects from Databricks to Synapse where Microsoft was trying to monetize .Net for Spark. After spending several months working on this migration of my .Net projects, I encountered a relatively innocent bug that only affected Synapse (not OSS and not Databricks). I report the bug right around the time that the engineers were exiting the company. Only at the very end of an eight month support case did they say that they can no longer support .Net, and they are removing it from future versions of Synapse Analytics. It was a pretty painful experience, as you can probably imagine. (As a side the bug turned out to be just some stupid DNS configuration issue in their Bionic Beaver image, and wasn't specific to .Net. It would have affected the other language bindings as well.) It doesn't stop there. After they removed .Net from their 3.3 runtime, this Synapse team started to deliberately spread misinformation to discourage people from using .Net on Spark at all. See https://learn.microsoft.com/en-us/azure/synapse-analytics/spark/apache-spark-33-runtime They say this is a "project under the .NET Foundation that currently requires the .NET 3.1 library, which has reached the out-of-support status". Of course, the claim that Spark requires .Net Core 3.1 is categorically false. I'm able to run the project on .Net 6 without any problems. To add insult to injury they tell people "We recommend that users with existing workloads ... migrate to Python or Scala." I'm guessing that they are worried about the possibility that their .Net customers will just leave Synapse and find a better Spark offering. The PG's documentation doesn't even refer people back to this community, in order to pursue an alternative path forward with .Net. I find their communication to be pretty dishonest. And it almost seems like a deliberate attempt to sabotage this project. I keep hoping we will see first-class support for .Net in Databricks, now that they have scalped some of the smart engineers from Synapse. (I do a regular google search, but it hasn't quite panned out yet.) I had very high hopes for Synapse when they advertised their first-class .Net language bindings. There were even promotional community sessions where .Net for Spark was discussed, like here: https://www.youtube.com/watch?v=-VpQheD-vE8 In any case, I am not that worried about the future. I am pretty sure .Net isn't going to die any time soon; and neither is Spark. The marriage of this pair is off to a rocky start. But they can't be kept apart forever! |
@dbeavon Thanks for filling in the missing details! I completely agree about getting rid of some other the other components, probably the first thing to do is to get the core Microsoft.Spark working with the newer versions of Spark and get rid of things like the extensions etc and maybe bring them back in the future. |
@GoEddie Thanks for not giving up on this project. The fact that we are having this discussion is somehow sad! Microsoft can not advocate the latest AI, large language model, and copilot without empowering .NET users on Spark.NET => the key and perhaps one of the very few available BIG DATA ANALYTICS pipeline for .NET community to continue their support for Microsoft's latest AI leadership. |
As @luisquintanilla pointed it out: The project is part of the .NET Foundation which is open source and everybody can join to be a part of the .NET success story 😃: Become a member But nevertheless, I think the current issue this project has is that it is missing a a maintainer who is willing to actively review/merge PRs, handle issues, publish new releases etc. @GoEddie @dbeavon @GeorgeS2019 Can you think to be part of the success story for .NET for Apache Spark ? |
@leo-schick yes but im not sure what the process would be, at the moment it seems like no maintainers are active. It would be amazing if Microsoft invested paid developers in the project but even without that we, as a community, can maintain the project but have no access to anything! |
As it as pointed out earlier by @GeorgeS2019 in his post, a maintainer is missing. If you want to become one, please contact @luisquintanilla |
Thanks @leo-schick have you got any contact details for yourself or @luisquintanilla? My email is ed.elliott@outlook.com |
I think dotnet-spark can be more liberal about which Apache Spark versions it supports. Instead of erroring out, it can just put out a warning. Apache Spark releases are more frequent, but I have found that existing dotnet-spark code works fine with newer versions of Apache Spark. But I have to update the dotnet-spark code locally to make it work with new Spark versions (as the nuget packages don't work). A more liberal version matching policy should reduce some of ongoing support effort required. |
So, is there a next step here ? I'm also really interested in helping move this project forward @luisquintanilla |
Hey @bolcman we are trying, it will take a bit more time but we will get there one way or another! Lets keep this issue as a way for people to say if they want to contribute, it would be good to get an idea of numbers |
@GoEddie how are things progressing here? I did a chunk of the .NET 6 PR a while back but sadly it was never released. Now .NET 8 is out and .NET 9 is under development... As well as the publish-to-Nuget problem @dbeavon identified above I also ran into problems accessing the PR builds on Azure DevOps, which made troubleshooting the tests and other similar issues rather tricky, and ultimately relied on someone from Microsoft to resolve--@AFFogarty if I remember rightly. This will need addressing too. My company has a small handful dotnet-spark jobs, but these are in our legacy pile with further investment focussed on Pyspark. dotnet-spark was perfect for us ~3 years ago, but we've now moved on. I think a key part of reviving this project needs to be resurrecting the Spark conversations (perhaps with the assistance of those people who have since moved on to Databricks, if they're willing), to put dotnet-spark on a level with Pyspark in terms of support and documentation. I'm interested in offering my time to this project if it has still some life in it. |
@GoEddie I've been analyzing the .NET market and individual projects for a few years since .NET was not so good in China market. I may give you some detailed evidence about this project here. I analyzed major contributors of this project just now and I notice this project is mainly maintained by MSFTs. And the founder of this project @imback82 joined Databricks since Apr, 2022 according to his Linkedin. He is the most senior developer in the MSFT contributor team with principle title. And the second major contributor @suhsteve is also looking for a new job according to his Linkedin or he has left Microsoft. I have no idea if they are from the same team, but it looks someone in Microsoft made a decision to stop this project obviously. The major reason may be .NET is not so popularly used in the big data market including using .NET to operate Spark. The nuget download rate of this poject is even less than 1M, which is very low. Also ML.NET is not so popular (Microsoft.ML package only reaches 6M download). I checked with a few data scientists around me in the past few years. Some are my ex-colleagues and some are community friends. No one is using .NET at all. The problem of this project is lack of key community contributors like you. And the donation to the .NET foundation doesn't really help attract new contributors. I've been in .NET foundation project committee for a few while. To be honest, I don't think this foundation works as expected compared with Apache foundation and CNCF foundation. They did promote projects with social media account. But the level of this kind of promotion only helps developers know some new project but totally not enough to attract developers to join them as a contributor. Although I'm willing to see that .NET booms in the market, we have to face the fact that .NET totally failed to get the market in data science category. Python and Java are still the leading language in it. Nothing changed in the past 5-7 years. And I did notice that MSFT staff are contributing to some new open source projects like Semantic Kernel and Aspire now. I guess these projects are their new focus. They are changing the track for business perspective. |
@GoEddie There is another problem of this project. The projects starts in July, 2019 and almost stopped maintainence in end of 2022 (or even earlier). It was maintained for just 3 years. Usually, it takes at least 5 years to attract more contributors. And the major contributor should continue the contribution all the time. Otherwise, developers may think that the project is dying. And they are not willing to contribute anymore although someone believes that there will be another hero who forked the project and restart it. I did analyzed a lot of existing .NET open source projects. This kind of fork occationally happens but it's very rare. |
@tonyqus I agree that this project has been stalled a little. I think it is primarily because everyone was waiting to see if Microsoft would come to their senses and try to re-hire some new engineers, like the ones they lost to Databricks (eg. Terry and Rahul). In any case, Microsoft has a lot of issues these days. Based on what I can tell, the Azure Synapse Analytics platform is falling apart, and it has nothing to do with the merits of C#.Net. Microsoft doesn't seem to have a great sense of direction or purpose in the area of big data. The C# bindings for Spark were a very important innovation. But right now Microsoft seems to be losing creativity and they are just dumping all of their customers into a mediocre swamp of tools called "Fabric", whether we like it or not! This approach is not likely to go very well in the long run. The approach is favored by those SaaS customers who were already heavily invested in Power BI. But it seems like an odd strategy for those of us looking for PaaS/IaaS options.
Going back to C#.Net.... the language is becoming more popular over time. It won the Tiobe language of the year: https://www.tiobe.com/tiobe-index/ I think it is still too early to say if C# will be adopted as a popular language for Spark workloads. One thing that I've learned about Spark is that you can't just reduce it to a "data science" platform. Or to a "data analyst" tool. It is used for a lot of other types of MPP workloads. It can be found under the covers of tons of many cloud platforms. OSS Spark is a fairly cheap commodity, almost as inexpensive as the VM's that it runs on. I generally think Spark as a general-purpose "container" for hosting data-oriented software algorithms at scale. I'm a big fan of Spark but a bigger fan of C#. There are many reasons. The tooling and the nuget ecosystem are both amazing. But I also like the ability to exchange code between a REST API (hosted in an on-premise IIS environment) and a Spark application hosted on an MPP cluster in the cloud. I can re-use the same underlying logic & data, and interact with it via synchronous requests or via asynchronous batches. We can do go back and forth very easily without switching between programming languages. There is no need to find a python programmer and ask them to create another copy of the application using a different language - simply for the sake of hosting on a cluster and getting the MPP advantages. I've been building applications using C# for 20 years; yet I'm still finding ways to use it more effectively and efficiently. It is very versatile and there are few applications that would be a bad fit for C#. I suspect that the audience who would be using C# for their Spark development would not necessarily be data scientists as you imply. They would be software engineers, who are already building software solutions which are hosted in various other containers (eg. in web servers, kubernetes, and so on). I'm not trying to detract from python. It is a productive language and easy to pick up. But python it is never going to take the place of C#.Net. C# is extremely well suited for high-performing applications that need to evolve over a very long period of time. It performs well, has lots of value-type data structures, and even its heap data is very efficient. On a related note, one of the new performance benefits that I'm very excited about in .Net 8 is the AOT compilation. This should soon be accessible to my Spark jobs as well. What does this mean? It means the UDF's built on C#.net will be so fast that they will probably exceed the performance of the OSS Spark core itself. The only performance implication in selecting C# over Java/Scala is that it will always require Apache Arrow to exchange data between the Spark core and the UDF's. |
@GoEddie Can you help me review #1166 That issue (binary serialization) was a concern that was expressed by @AFFogarty FYI, I really don't think binary serialization is a significant concern, aside from the fact that Microsoft is deprecating their class. We are simply replacing it with another binary serialization library (presumably with the same "vulnerabilities"). In any case the vulnerabilities are greatly overshadowed by the remote code execution that is a key feature of Spark solutions. Also can you tell me how people become committers or writers on a project like this? I don't have that much prior experience with community github projects. I would love to help in some way. I think it is unlikely that Microsoft will give this project any TLC for another year or two. In the meantime there are obvious things we can do to keep it alive, like keep up with versions of Spark 3.x.x and keep up with .Net 8. |
I have spent some time looking at the Spark Connect gRPC API and have put together a new .NET version of DataFrame API that uses the Spark Connect interface, which is actually working pretty well - it works against a local Spark Server as well as Databricks. If anyone is interested in trying it or contributing, please see: https://github.com/GoEddie/spark-connect-dotnet It is my hope that one day we can get access to this repo and nuget packages but in the meantime this supports Spark 3.4.0+ Repo: Nuget: |
Hey @wudanzy I saw your note on my decimal type or that you had gained write access to the repo which is great news, can you tell us:
Thanks! Ed |
Hi @GoEddie, sorry I don't know the background and stories behind the projects in the previous years. I just joined Microsoft early this year and I am from another team. I can share with you about our team's story: I am from an internal platform team where we provide big data infrastructures for other Microsoft teams. We had some customer that uses Spark.NET and we notice the OSS spark.NET is not making progress (for example, we need to upgrade to 3.3), so we negotiated with the owner team and obtained the write permission. |
I am personally a fan of OSS projects, so would like to pushing this OSS project forward instead of only maintaining a private copy. |
Perhaps MSFT China can take over this project. Let me check with some devs in MSFT. Mainly they should be volunteered to contribute to this project. Since I have left MSFT China for many years, it's hard to guarantee anything here. But I can try. @wudanzy I added you on Linkedin |
Thanks, it feels like we really need some governance, perhaps by joining the dotnet foundation as a first step. What I'd really like to see is a structure put in place to stop the project being in the position it has been in as users can't rely on it as it stands. |
I'm afraid DNF cannot help. At least, you cannot get any new contributors from DNF. Let's rely on MSFT or the open source community itself. |
I'd like to share my experience as a senior big data architect at Gridsum (NASDAQ: GSUM) during 2019 to 2020. I was the most senior architect (T4) in the Gridsum Shanghai office. Use Case from Gridsum We use Spark.NET to connect to the Spark cluster in our system. Gridsum creates a data visualization system based on .NET Core 3.1. We use Postgresql as the central database and pull data from other sources such as Oracle database, Hadoop cluster, Spark cluster, SQL Server, MySQL, and so on. |
Can someone work on preparing a list work item and pain points you would like to see resolved in this project. I can work you all and Dan to get these items prioritized.
Vikas
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Tony Qu.eth ***@***.***>
Sent: Saturday, January 4, 2025 9:10:04 AM
To: dotnet/spark ***@***.***>
Cc: Subscribed ***@***.***>
Subject: Re: [dotnet/spark] Can we breathe life back into this project? (Issue #1162)
I'd like to share my experience as a senior big data architect at Gridsum. I was the most senior architect (T4) in the Gridsum Shanghai office.
Use Case from Gridsum
We use Spark.NET to connect to the Spark cluster in our system. Gridsum creates a data visualization system based on .NET Core 3.1. We use Postgresql as the central database and pull data from other sources such as Oracle database, Hadoop cluster, Spark cluster, SQL Server, MySQL, and so on.
—
Reply to this email directly, view it on GitHub<#1162 (comment)> or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADVCO2J6LFL7IRQECBSCLNT2JAIWZBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVEYTQMRYGQ4TANJRQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRRHA4TOMBRHEYDSNFHORZGSZ3HMVZKMY3SMVQXIZI>.
You are receiving this email because you are subscribed to this thread.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I'd say there are a couple areas that need to be figured out:
Is this possible? |
Hi @GoEddie, to be honest I don't know the answer of non-technical questions, but I can help on technical aspects. From technical perspective we faced the same challenges like you, for example we wanted to upgrade it to 3.3 and 3.5(we already did that) so it matches with our Scala offerings, we also faced some inefficiencies around the UDF and we did some improvements internally. As Vikas suggested, if you can share a lists, we can combine that with ours and move things forward. I don't know the answers to things that are out of my control, but I am eager to see this project go back to the right track. |
As I recall two things happened:
I think 1 drove 2. We were 90% of the way to delivering a major update based on this and had to abandon 3 months of work. We just assumed that was it. Even with the project being revived if Synapse or Fabric doesn’t support it I think it’s still DOA for a lot of teams. |
I am very hopeful; with the support of this community, and some of the internal Microsoft needs, we will be able to revive this. There is lot of existing c# code, that developer would like leverage in dotnet/spark.
Let focus on pain points, asks so we can collectively prioritize and push this forward.
Regards,
Vikas
…________________________________
From: Brooke Philpott ***@***.***>
Sent: Tuesday, January 7, 2025 6:39 AM
To: dotnet/spark ***@***.***>
Cc: Comment ***@***.***>; Subscribed ***@***.***>
Subject: Re: [dotnet/spark] Can we breathe life back into this project? (Issue #1162)
As I recall two things happened:
1. Microsoft announced that synapse wouldn’t support .net spark dlls anymore
2. The project here was abandoned
I think 1 drove 2. We were 90% of the way to delivering a major update based on this and had to abandon 3 months of work. We just assumed that was it. Even with the project being revived if Synapse or Fabric doesn’t support it I think it’s still DOA for a lot of teams.
—
Reply to this email directly, view it on GitHub<#1162 (comment)> or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADVCO2MF4IYD3VSKSOG7X232JPRLXBFKMF2HI4TJMJ2XIZLTSOBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLLDTOVRGUZLDORPXI6LQMWWES43TOVSUG33NNVSW45FGORXXA2LDOOJIFJDUPFYGLKTSMVYG643JORXXE6NFOZQWY5LFVEYTQMRYGQ4TANJRQKSHI6LQMWSWS43TOVS2K5TBNR2WLKRRHA4TOMBRHEYDSNFHORZGSZ3HMVZKMY3SMVQXIZI>.
You are receiving this email because you commented on the thread.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
I’m hopeful but cautious. In the absence of a commitment from the Synapse team I’m very cautious. I’m curious if we’ve heard anything on that end. |
Hi @Macromullet , noticed that you are in Microsoft, if that's the case, you can connect with our team (@vsabharwal is our PM lead) and see how we can help you. Reviving this project can benefit all developers & users who use C# to run spark, and this is independent of the service provider. I don't know the story on the Synapse side, but sometimes it is a chick-and-egg problem: when the community is health and there are many spark.NET users, they may review their decision. |
Hi @tonyqus, thanks for the list! We will review those PRs and help them proceed. Due to limited bandwidth in our team, we have to control the number of PRs under review. If you think that some PRs are much important to others, please move them to the head of the list. Thanks! |
Nice to see more people showing interest here! The .NET 8 PR is ready, while .NET 9 is not LTS and, in my opinion, not critical. Once .NET 8 is merged, I plan to add updates for
It would also be great to address .NET Interactive UDFs, as this is a key obstacle preventing curious users from proceeding with Dotnet.Spark, I'll have a look when there's some time. |
Hi All,
It is pretty obvious that the project has come to a bit of a halt and I wondered if there was anything that we can do to get it up and running again?
I don't know the reason why it stopped in February, maybe if we knew that we could support it in some way?
I do know that I have had multiple orgs who wanted to use .NET and Spark in Databricks but in all honestly I couldn't recommend using it as it seems the project is now dead.
What would it take to bring it up to date with both .NET 7 support and support up to Apache Spark 3.4.1?
Is Microsoft willing to invest in the project or is the community able to take it on and progress it?
I'm just hoping to start a discussion really as it is a shame so much work went into this and we were so close to being able to use .NET instead of Python or Scala but it feels frustrating that we can't use .NET.
cc: @MikeRys @AFFogarty @suhsteve
@GeorgeS2019
The text was updated successfully, but these errors were encountered: