Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++] Integrate SOMAColumn: Arrow adapter methods, part 1 #3405

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

XanthosXanthopoulos
Copy link
Collaborator

@XanthosXanthopoulos XanthosXanthopoulos commented Dec 6, 2024

This PR replaces the Arrow schema to TileDB schema transformation to use the SOMAColumn create methods.
Also there are a set of new data converters from arrow arrays to std::array for simplification.

This migration also enforces a current domain restriction for string dimensions to libtiledbsoma in addition to the restriction being present only on the R and Python APIs.

@XanthosXanthopoulos XanthosXanthopoulos changed the title Integrate SOMAColumn in Arrow adapter methods [WIP] Integrate SOMAColumn in Arrow adapter methods Part 2 Dec 8, 2024
@XanthosXanthopoulos XanthosXanthopoulos changed the title Integrate SOMAColumn in Arrow adapter methods Part 2 [c++] Integrate SOMAColumn in Arrow adapter methods Part 2 Dec 8, 2024
@XanthosXanthopoulos XanthosXanthopoulos marked this pull request as ready for review December 8, 2024 16:25
@johnkerl johnkerl changed the title [c++] Integrate SOMAColumn in Arrow adapter methods Part 2 [c++] Integrate SOMAColumn in Arrow adapter methods, part 2 Dec 9, 2024
Copy link
Member

@nguyenv nguyenv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what the Skip and Take are for and document it? It looks like Take is the index of the column to retrieve and Skip is relevant only for geometry columns (where it's always 2)?

Also is there a way to use std::variant or a templated type instead of std::any or would that make things too complicated?

Comment on lines 743 to 867
/**
* Return a copy of the data in a specified column of an arrow table.
* Complex column types are supported. The for each sub column are an
* std::array<T, 2> casted as an std::any object.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/**
* Return a copy of the data in a specified column of an arrow table.
* Complex column types are supported. The for each sub column are an
* std::array<T, 2> casted as an std::any object.
*/
/**
* Return a copy of the data in a specified column of an arrow table.
* Complex column types are supported. The type for each sub column is
* an std::array<T, 2> casted as an std::any object.
*/

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skip and Take are used in 2 places with 2 specific sets on values (either Skip=3 and Take=2 or Skip=0 and Take=2) and are independent of the geometry column. Their usage is to extract specific subranges of ArrowArray data and they come in handy during ArrowSchema -> TileDBSchema where the arrow array provided has 5 values per dimension and we only need the last 2 to set the current domain.

As to using std::variant, adding more SOMAColumn types would require changing multiple variants. The use of std::any here is to enable runtime polymorphism and indirectly introduces a runtime type check (via any_cast, make_any) between the templated function and the actual dimension type. std::variant can provide all the above it is just a different style I am open to discuss further.

@XanthosXanthopoulos XanthosXanthopoulos force-pushed the xan/sc-59427/soma-column-arrow-integration branch 2 times, most recently from 5485141 to 0e69ed7 Compare December 13, 2024 16:07
@XanthosXanthopoulos XanthosXanthopoulos changed the base branch from xan/sc-59427/soma-column to xan/sc-59427/soma-geometry-column December 13, 2024 16:08
@XanthosXanthopoulos XanthosXanthopoulos changed the title [c++] Integrate SOMAColumn in Arrow adapter methods, part 2 [c++] Integrate SOMAColumn: Arrow adapter methods, part 1 Dec 13, 2024
@XanthosXanthopoulos XanthosXanthopoulos force-pushed the xan/sc-59427/soma-column-arrow-integration branch from 0e69ed7 to d6d6187 Compare December 13, 2024 16:10
@XanthosXanthopoulos XanthosXanthopoulos force-pushed the xan/sc-59427/soma-geometry-column branch 4 times, most recently from 8daf17e to a426c7a Compare January 8, 2025 19:04
Base automatically changed from xan/sc-59427/soma-geometry-column to main January 8, 2025 19:38
@XanthosXanthopoulos XanthosXanthopoulos marked this pull request as draft January 14, 2025 14:04
@XanthosXanthopoulos XanthosXanthopoulos force-pushed the xan/sc-59427/soma-column-arrow-integration branch from d6d6187 to af1f010 Compare January 14, 2025 14:05
Copy link

codecov bot commented Jan 14, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.27%. Comparing base (7616147) to head (77a01e1).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3405      +/-   ##
==========================================
+ Coverage   86.22%   86.27%   +0.04%     
==========================================
  Files          55       55              
  Lines        6410     6410              
==========================================
+ Hits         5527     5530       +3     
+ Misses        883      880       -3     
Flag Coverage Δ
python 86.27% <ø> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 86.27% <ø> (+0.04%) ⬆️
libtiledbsoma ∅ <ø> (∅)

@XanthosXanthopoulos XanthosXanthopoulos force-pushed the xan/sc-59427/soma-column-arrow-integration branch from af1f010 to 2401416 Compare January 14, 2025 18:38
@XanthosXanthopoulos XanthosXanthopoulos marked this pull request as ready for review January 14, 2025 18:39
@XanthosXanthopoulos XanthosXanthopoulos force-pushed the xan/sc-59427/soma-column-arrow-integration branch from 2401416 to 77a01e1 Compare January 15, 2025 12:02
@jp-dark jp-dark requested a review from nguyenv January 15, 2025 14:51
Comment on lines +846 to +847
template <typename T, size_t Take, size_t Skip = 0>
static std::vector<std::array<T, Take>> get_table_column_by_name(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't appear to be used anywhere. Can you pull it out into a separate PR and/or add it to the branch where it is used?

Copy link
Collaborator

@jp-dark jp-dark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the get_table_column_by_name function, but everything else looks good to me. However, someone with more familiarity with the arrow adapter code should look over this as well.

Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @XanthosXanthopoulos !

@@ -976,8 +980,8 @@ void ArrowAdapter::_set_current_domain_slot(
LOG_DEBUG(std::format(
"[ArrowAdapter] {} current_domain float {} to {}",
name,
std::to_string(lo),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do seem to recall these were about crash avoidance on some platform (I don't recall which). I'd rather leave these as-is please.

Comment on lines +1135 to +1139
columns.begin(), columns.end(), [&](auto col) {
return strcmp(
col->name().c_str(),
index_column_schema->children[i]->name) == 0;
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be pulled out into a utility function

if (column == columns.end()) {
throw TileDBSOMAError(std::format(
"[ArrowAdapter][tiledb_schema_from_arrow_schema] Index column "
"{} missing",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"{} missing",
"'{}' missing",

Strings in error messages should always be quoted/bracketed. Even if you think it's impossible for the string to ever be empty. Heaven forbid someday there is some bug somewhere somehow ... and an empty string gets in here ... that needs to be clear to everyone that sees the error message.

index_column_schema->children[i]->name));
}

if ((*column)->tiledb_dimensions().has_value()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit awkward.

  • column needs to be column_it or some such -- it is not a column, it is an iterator
  • Then, const auto column = *column_itafter you check thatcolumn _it != columns.end()`
  • Then the rest of these (*column)->foo become column->foo as they should be

if (strcmp(child->name, col_name) != 0) {
continue;
if (column->name() == SOMA_GEOMETRY_COLUMN_NAME) {
std::vector<std::any> dom;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As on previous PRs: we cannot simply say dom or domain, ever.

There are four things it can mean:

  • core domain (which is soma maxdomain)
  • soma domain (which is core current domain)

The names are confusing (and too late to change), and confusion is too easy, and developer confusion is high-risk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rename dom to cdslot

const void* buff,
NDRectangle& ndrect,
std::string name);
template <typename T, size_t Take, size_t Skip = 0>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please state as a compact summary, right here as a code comment, what Take and Skip are for, what they do, and an example usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[c++] Add an abstraction layer between SOMA columns and TileDB dimensions and attributes
4 participants