Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRILL-8474: Adding Daffodil to Drill as a contrib. #2909

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

mbeckerle
Copy link
Contributor

@mbeckerle mbeckerle commented Apr 27, 2024

DRILL-8474: Adding Daffodil to Drill as a contrib.

This PR replaces: #2836 which is closed. That was to retain history/comments while squashing numerous debug-related commits together into this PR.

Description

Requires Daffodil version 3.7.0 or higher.

New format-daffodil module created

Still uses absolute paths for the schemaFileURI.
(which is cheating. Wouldn't work in a true distributed drill environment.)

We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved.

The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice.

Note: Fix boxed Boolean vs. boolean problem. Don't use boxed primitives in Format config objects.

Test show this works for data as complex as having nested repeating sub-records.

These DFDL types are supported:

  • int
  • long
  • short
  • byte
  • boolean
  • double
  • float (does not work. Bug DAFFODIL-2367)
  • hexBinary
  • string

Documentation

TBD: feature is incomplete still. It will require substantial documentation for users.

Testing

See tests under src/test in the new daffodil contrib module.

Comment on lines +68 to +93
<build>
<plugins>
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<executions>
<execution>
<id>copy-java-sources</id>
<phase>process-sources</phase>
<goals>
<goal>copy-resources</goal>
</goals>
<configuration>
<outputDirectory>${basedir}/target/classes/org/apache/drill/exec/store/daffodil
</outputDirectory>
<resources>
<resource>
<directory>src/main/java/org/apache/drill/exec/store/daffodil</directory>
<filtering>true</filtering>
</resource>
</resources>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this section do? I noticed that the msaccess contrib doesn't have this section in its pom.xml.

@mbeckerle mbeckerle changed the title Adding Daffodil to Drill as a 'contrib' DRILL-2835: Adding Daffodil to Drill as a contrib. Apr 27, 2024
@mbeckerle mbeckerle changed the title DRILL-2835: Adding Daffodil to Drill as a contrib. DRILL-8474: Adding Daffodil to Drill as a contrib. Apr 27, 2024
@shfshihuafeng
Copy link
Contributor

shfshihuafeng commented Apr 28, 2024

This fails its tests due to a maven checkstyle failure. It's complaining about Drill:Exec:Vectors, which my code has no changes to.

Can someone advise on what is wrong here?

/home/runner/work/drill/drill/exec/vector/src/main/java/org/apache/drill/exec/record/metadata/MapBuilder.java:201:5:
you should use '{} for if' construct

 if (Objects.isNull(parent)) {
    throw new IllegalStateException("Call to resume() on MapBuilder with no parent.");
}

@mbeckerle
Copy link
Contributor Author

Tests are now failing due to these two things in TestDaffodilReader.scala

  String schemaURIRoot = "file:///opt/drill/contrib/format-daffodil/src/test/resources/";

That's an absolute URI that is used to obtain access to the schema files in this statement:

  private String selectRow(String schema, String file) {
    return "SELECT * FROM table(dfs.`data/" + file + "` " + " (type => 'daffodil'," + " " +
        "validationMode => 'true', " + " schemaURI => '" + schemaURIRoot + "schema/" + schema +
        ".dfdl.xsd'," + " rootName => 'row'," + " rootNamespace => null " + "))";
  }

This is assembling a select statement, and puts this absolute schemaURI into the schemaURI part of the select.

What should I be doing to arrange for these schema URIs to be found.

The schemas are a large complex set of files, not just a single file. Many files must be found relative to the initial root schema file. (Hundreds of files potentially). As they include/import other schema files using relative paths.

@cgivre
Copy link
Contributor

cgivre commented May 6, 2024 via email

@mbeckerle
Copy link
Contributor Author

Hi Mike, Are you free at all this week? My apologies... We're in the middle of putting an offer on a house and my life is very hectic at the moment. Best, -- C

Lots of availability. I'll send you separate email.

*/
public class DaffodilMessageParser {

private static final Logger logger = LoggerFactory.getLogger(DaffodilMessageParser.class);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logger needs to be Drill's logger for log messages originating in the drill bit.

rootName);
}
try {
dmp.loadSchema(schemaFileURI);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This right here is depending on the schemaFileURI being class path relative so that the schema, and things referenced from within the schema, are found on the classpath. Daffodil makes extensive use of Java's Service Provider Interface (SPI) to dynamically load its own flavor of UDFs, custom charsets, and "layer" definitions, all of which are compiled java/scala code in jars and those jars can have other dependent jars, and so on.

So somehow the drill user has to be able to specify a path of multiple schema jar files, in order (the order can matter), and that has to end up on the class path of the drill bits so that this loadSchema call can work.

private List<Diagnostic> compileSchema(URI schemaFileURI, String rootName, String rootNS)
throws URISyntaxException, IOException, CompileFailure {
Compiler c = Daffodil.compiler();
ProcessorFactory pf = c.compileSource(schemaFileURI, rootName, rootNS);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema source files can be spread over multiple jars (sometimes dozens of files), and they contain relative paths within them that must resolve relative to that same classpath.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This compilation really wants to happen once on a central node and the resulting binary schema file wants to be cached somewhere that gets copied out to the drill bit locations.

Right now this either loads a binary DFDL schema, which needs to happen on the drill bits, or it needs to compile a DFDL schema source (which needs to happen centrally) to create the binary DFDL schema, which then gets loaded by the drill bits.

This has to get refactored because the drill bits should never be compiling a DFDL schema. They should only be loading up pre-compiled binary schemas.

mbeckerle added 3 commits July 5, 2024 11:27
Requires Daffodil version 3.7.0 or higher.

New format-daffodil module created

Still uses absolute paths for the schemaFileURI.
(which is cheating. Wouldn't work in a true distributed
drill environment.)

We have yet to work out how to enable Drill to provide
access for DFDL schemas in XML form with include/import
to be resolved.

The input data stream is, however, being accessed in the
proper Drill manner. Gunzip happened automatically. Nice.

Note: Fix boxed Boolean vs. boolean problem. Don't use
boxed primitives in Format config objects.

Test show this works for data as complex as having
nested repeating sub-records.

These DFDL types are supported:

- int
- long
- short
- byte
- boolean
- double
- float (does not work. Bug DAFFODIL-2367)
- hexBinary
- string

apache#2835
@mbeckerle mbeckerle force-pushed the daffodil-2835-squashed branch from bf6d52c to 8fed39a Compare July 5, 2024 15:59
@Category(RowSetTest.class)
public class TestDaffodilReader extends ClusterTest {

String schemaURIRoot = "file:///opt/drill/contrib/format-daffodil/src/test/resources/";
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What, exactly, do I change this to, if I want to retrieve files from $DRILL_CONFIG_DIR/lib ?

}

private String selectRow(String schema, String file) {
return "SELECT * FROM table(dfs.`data/" + file + "` " + " (type => 'daffodil'," + " " +
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How exactly would this change in order to retrieve things from $DAFFODIL_CONFIG_DIR/lib ?

public void testSimpleQuery1() throws Exception {

QueryBuilder qb = client.queryBuilder();
QueryBuilder query = qb.sql(selectRow("simple", "data01Int.dat.gz"));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If queries are to retrieve files from $DAFFODIL_CONFIG_DIR/lib, how do tests arrange for that to contain test schemas, and test data? Is it as simple as just loading a resource from the classpath?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants