Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add inner and left joins with a single join condition #19

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

AuPath
Copy link
Collaborator

@AuPath AuPath commented Apr 9, 2024

Implemented functionality in a VTL template to facilitate left and inner joins between two dataframes. Users can now utilize $functions.leftJoin($df1, $df2, "key of df1", "key of df2") or $functions.innerJoin(...) within the template.

For convenience, both functions support a "shortcut" when the joining column names are identical in both dataframes. This shortcut can be invoked as $function.leftJoin($df1, $df2, "key").

In cases where the resulting dataframe from the join operation contains multiple columns with the same name, the system now generates a runtime exception. This exception notifies the user regarding which columns need renaming in the input dataframes and instructs them to use the renameDataframeColumn function.

The renameDataframeColumn function is accessible from the VTL through $functions.renameDataframeColumn($df, "oldColumn", "newColumn").

One thing that is not addressed in this PR is the possibility of specifying multiple join conditions to be evaluated as a chain of ANDs. It should however be relatively straightforward to implement at a later date should we need it.

@AuPath AuPath requested a review from marioscrock April 9, 2024 16:17
@AuPath AuPath self-assigned this Apr 9, 2024
if (s.contains(prefix)) {
return s.replace(prefix, "");
}
if (s != null && prefix != null) if (s.contains(prefix)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move again as separate row

}
else
throw new IllegalArgumentException("FILE: " + fileName + " FOR RDFREADER DOES NOT EXIST");
if (fileName != null) if ((new File(fileName)).exists()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the consecutive if on the same line, but I assume this is automatically done by some plugin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work on the test cases!

}
}

public List<Map<String, String>> renameDataframeColumn(List<Map<String, String>> dataFrame, String oldColumn, String newColumn) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to use the Java Stream API. Check the last message here: https://chat.openai.com/share/9e8482b2-3c1b-4ddc-b6f4-f131ce39f2f0

return leftJoin(leftTable, rightTable, key, key);
}

public List<Map<String, String>> leftJoin(List<Map<String, String>> leftTable, List<Map<String, String>> rightTable, String leftKey, String rightKey) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to check the code in details but I would use the Java Stream API, check https://chat.openai.com/share/9e8482b2-3c1b-4ddc-b6f4-f131ce39f2f0

return innerJoin(leftTable, rightTable, key, key);
}

public List<Map<String, String>> innerJoin(List<Map<String, String>> leftTable, List<Map<String, String>> rightTable, String leftKey, String rightKey) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to refactor and reuse portion of the code between the two joins?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement a default behaviour if no condition is provided? i.e., cross-join btw the dataframes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add a separate crossJoin method. If the user knows that he wants to perform a crossJoin he would have to do so explicitly. Additionally this would make the template easier to reason about. Someone reading the template knows that i am doing a crossJoin and does not have to know that it is the behaviour of an innerJoin when no condition is specified. From what i gathered it is non-standard to do a cross join when no condition is given to a leftJoin. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on having a separate method to do it explicitly but I would also add a method .join(table1,table2,key,key) that behaves like JOIN in SQL, i.e., that (if I got it correctly):

  • performs by default an inner join if the conditions are provided
  • performs by default a cross-join if the conditions are not provided (null/empty keys)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants