Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [Documentation] UDF Guide #416

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

elvaliuliuliu
Copy link
Contributor

This PR documents user-defined function guide using Row object as examples ( which is implemented via #376), including how to define UDFs, how to use UDFs with DataFrame and etc.

Copy link
Member

@bamurtaugh bamurtaugh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for writing this up! Left some initial comments.

@@ -0,0 +1,43 @@
# User-Defined Functions - C#
This documentation contains user-defined function (UDF) examples. It shows how to define UDFs and how to use UDFs with Row objects as examples.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This documentation contains user-defined function (UDF) examples. It shows how to define UDFs and how to use UDFs with Row objects as examples.
A user-defined function, or UDF, is a routine that can take in parameters, perform some sort of calculation, and then return a result. This document explains how to construct UDFs in C# and includes example functions, such as how to use UDFs with Row objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are UDFs applicable to any C# app, or just .NET for Spark apps? If they're just used in .NET for Spark apps, I'd add a sentence or two explaining how UDFs apply to/are useful in .NET for Spark.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we're focusing on Row object examples? Could we include other examples and then make this intro more general (i.e. "...This document explains how to construct UDFs in C# and includes example functions.")?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are UDFs applicable to any C# app, or just .NET for Spark apps? If they're just used in .NET for Spark apps, I'd add a sentence or two explaining how UDFs apply to/are useful in .NET for Spark.

I think we talk about UDF used within .NET for Spark here.

docs/user-defined-functions-c#.md Outdated Show resolved Hide resolved
## Pre-requisites:
Install Microsoft.Spark.Worker. When you want to execute a C# UDF, Spark needs to understand how to launch the .NET CLR to execute this UDF. Microsoft.Spark.Worker provides a collection of classes to Spark that enable this functionality. Please see more details at [how to install Microsoft.Spark.Worker](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started#5-install-net-for-apache-spark) and [how to deploy worker and UDF binaries](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/deploy-worker-udf-binaries).

## UDF that takes in Row objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we only showing examples of UDFs with Row objects? It seems like it'd be valuable to have this document explain how to write any UDF and show examples of all (or at least more types) of UDFs?

Or is the goal of this doc to only show Row-based UDFs (in this case, we should change the title and intro of the doc to reflect that, because right now it seems like it should explain all UDFs)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the purpose of this doc is using UDF with Row objects readme file. This goes with the recent PR which exposes the UDF that returns Row objects. I think we can add more types later. @imback82 what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start with UDFs with Row since there are few gotchas with them, and we can expand this.

Copy link
Contributor Author

@elvaliuliuliu elvaliuliuliu Feb 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can start with UDFs with Row since there are few gotchas with them, and we can expand this.

Sounds good!

Install Microsoft.Spark.Worker. When you want to execute a C# UDF, Spark needs to understand how to launch the .NET CLR to execute this UDF. Microsoft.Spark.Worker provides a collection of classes to Spark that enable this functionality. Please see more details at [how to install Microsoft.Spark.Worker](https://docs.microsoft.com/en-us/dotnet/spark/tutorials/get-started#5-install-net-for-apache-spark) and [how to deploy worker and UDF binaries](https://docs.microsoft.com/en-us/dotnet/spark/how-to-guides/deploy-worker-udf-binaries).

## UDF that takes in Row objects

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some sentences providing context to this example?

For instance, as a reader, I have the following questions:

  • When would I use a UDF that takes in Row objects (as opposed to other types of UDFs)?
  • Do all UDFs just take in or return Row objects (since that's all that is shown in this doc)?
  • What is the goal of this code? What calculation or filtering is it performing and why?
  • What would be the output of this code?
  • Is this the only way to define UDFs (using Func<> myUdf = Udf<>(...))? What about spark.Udf().Register<>...?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion! I was looking at UDF docs here. I am not sure how much detail we want to go with this intro guide. Should we just consider this as a using UDF with Row objects readme file or UDF tutorial? This goes with your previous question also.

```

## UDF that returns Row objects
Please note that `GenericRow` objects need to be used here.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same questions as above, so I think it'd be great to provide some additional context here. Also, why does GenericRow need to be used here?

df.Select(udf(df["id"])).Show();
```

## Chained UDF with Row objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above, it would be great to add some context/explanation. What is a scenario when I'd need to chain UDFs? What does this code do?

```csharp
// Chained UDF using udf1 and udf2 defined above.
df.Select(udf1(udf2(df["id"]))).Show();
```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a Next Steps or Resources or Wrap Up section at the end could be really helpful. i.e., "If you'd like to see more examples of UDFs in action, check out our XYZ examples in the .NET for Apache Spark GitHub repo."

elvaliuliuliu and others added 2 commits February 5, 2020 14:21
Co-Authored-By: Brigit Murtaugh <brigit.murtaugh@microsoft.com>
@elvaliuliuliu elvaliuliuliu changed the title [Documentation] UDF Guide [WIP] [Documentation] UDF Guide Feb 26, 2020
Base automatically changed from master to main March 18, 2021 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants