-
Notifications
You must be signed in to change notification settings - Fork 981
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRILL-8474: Adding Daffodil to Drill as a contrib. #2909
base: master
Are you sure you want to change the base?
Conversation
<build> | ||
<plugins> | ||
<plugin> | ||
<artifactId>maven-resources-plugin</artifactId> | ||
<executions> | ||
<execution> | ||
<id>copy-java-sources</id> | ||
<phase>process-sources</phase> | ||
<goals> | ||
<goal>copy-resources</goal> | ||
</goals> | ||
<configuration> | ||
<outputDirectory>${basedir}/target/classes/org/apache/drill/exec/store/daffodil | ||
</outputDirectory> | ||
<resources> | ||
<resource> | ||
<directory>src/main/java/org/apache/drill/exec/store/daffodil</directory> | ||
<filtering>true</filtering> | ||
</resource> | ||
</resources> | ||
</configuration> | ||
</execution> | ||
</executions> | ||
</plugin> | ||
</plugins> | ||
</build> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this section do? I noticed that the msaccess contrib doesn't have this section in its pom.xml.
/home/runner/work/drill/drill/exec/vector/src/main/java/org/apache/drill/exec/record/metadata/MapBuilder.java:201:5:
|
exec/vector/src/main/java/org/apache/drill/exec/record/metadata/MapBuilder.java
Outdated
Show resolved
Hide resolved
Tests are now failing due to these two things in TestDaffodilReader.scala
That's an absolute URI that is used to obtain access to the schema files in this statement:
This is assembling a select statement, and puts this absolute schemaURI into the schemaURI part of the select. What should I be doing to arrange for these schema URIs to be found. The schemas are a large complex set of files, not just a single file. Many files must be found relative to the initial root schema file. (Hundreds of files potentially). As they include/import other schema files using relative paths. |
Hi Mike,
Are you free at all this week? My apologies... We're in the middle of putting an offer on a house and my life is very hectic at the moment.
Best,
-- C
… On Apr 28, 2024, at 10:11 PM, Mike Beckerle ***@***.***> wrote:
Tests are now failing due to these two things in TestDaffodilReader.scala
String schemaURIRoot = "file:///opt/drill/contrib/format-daffodil/src/test/resources/";
That's an absolute URI that is used to obtain access to the schema files in this statement:
private String selectRow(String schema, String file) {
return "SELECT * FROM table(dfs.`data/" + file + "` " + " (type => 'daffodil'," + " " +
"validationMode => 'true', " + " schemaURI => '" + schemaURIRoot + "schema/" + schema +
".dfdl.xsd'," + " rootName => 'row'," + " rootNamespace => null " + "))";
}
This is assembling a select statement, and puts this absolute schemaURI into the schemaURI part of the select.
What should I be doing to arrange for these schema URIs to be found.
The schemas are a large complex set of files, not just a single file. Many files must be found relative to the initial root schema file. (Hundreds of files potentially). As they include/import other schema files using relative paths.
—
Reply to this email directly, view it on GitHub <#2909 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKB7PT327D7FTY34D7Z6ULY7WT5DAVCNFSM6AAAAABG4LUKI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOBRG44DCNJUGY>.
You are receiving this because you commented.
|
Lots of availability. I'll send you separate email. |
*/ | ||
public class DaffodilMessageParser { | ||
|
||
private static final Logger logger = LoggerFactory.getLogger(DaffodilMessageParser.class); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logger needs to be Drill's logger for log messages originating in the drill bit.
rootName); | ||
} | ||
try { | ||
dmp.loadSchema(schemaFileURI); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This right here is depending on the schemaFileURI being class path relative so that the schema, and things referenced from within the schema, are found on the classpath. Daffodil makes extensive use of Java's Service Provider Interface (SPI) to dynamically load its own flavor of UDFs, custom charsets, and "layer" definitions, all of which are compiled java/scala code in jars and those jars can have other dependent jars, and so on.
So somehow the drill user has to be able to specify a path of multiple schema jar files, in order (the order can matter), and that has to end up on the class path of the drill bits so that this loadSchema call can work.
private List<Diagnostic> compileSchema(URI schemaFileURI, String rootName, String rootNS) | ||
throws URISyntaxException, IOException, CompileFailure { | ||
Compiler c = Daffodil.compiler(); | ||
ProcessorFactory pf = c.compileSource(schemaFileURI, rootName, rootNS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema source files can be spread over multiple jars (sometimes dozens of files), and they contain relative paths within them that must resolve relative to that same classpath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This compilation really wants to happen once on a central node and the resulting binary schema file wants to be cached somewhere that gets copied out to the drill bit locations.
Right now this either loads a binary DFDL schema, which needs to happen on the drill bits, or it needs to compile a DFDL schema source (which needs to happen centrally) to create the binary DFDL schema, which then gets loaded by the drill bits.
This has to get refactored because the drill bits should never be compiling a DFDL schema. They should only be loading up pre-compiled binary schemas.
Requires Daffodil version 3.7.0 or higher. New format-daffodil module created Still uses absolute paths for the schemaFileURI. (which is cheating. Wouldn't work in a true distributed drill environment.) We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved. The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice. Note: Fix boxed Boolean vs. boolean problem. Don't use boxed primitives in Format config objects. Test show this works for data as complex as having nested repeating sub-records. These DFDL types are supported: - int - long - short - byte - boolean - double - float (does not work. Bug DAFFODIL-2367) - hexBinary - string apache#2835
bf6d52c
to
8fed39a
Compare
@Category(RowSetTest.class) | ||
public class TestDaffodilReader extends ClusterTest { | ||
|
||
String schemaURIRoot = "file:///opt/drill/contrib/format-daffodil/src/test/resources/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What, exactly, do I change this to, if I want to retrieve files from $DRILL_CONFIG_DIR/lib ?
} | ||
|
||
private String selectRow(String schema, String file) { | ||
return "SELECT * FROM table(dfs.`data/" + file + "` " + " (type => 'daffodil'," + " " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How exactly would this change in order to retrieve things from $DAFFODIL_CONFIG_DIR/lib ?
public void testSimpleQuery1() throws Exception { | ||
|
||
QueryBuilder qb = client.queryBuilder(); | ||
QueryBuilder query = qb.sql(selectRow("simple", "data01Int.dat.gz")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If queries are to retrieve files from $DAFFODIL_CONFIG_DIR/lib, how do tests arrange for that to contain test schemas, and test data? Is it as simple as just loading a resource from the classpath?
DRILL-8474: Adding Daffodil to Drill as a contrib.
This PR replaces: #2836 which is closed. That was to retain history/comments while squashing numerous debug-related commits together into this PR.
Description
Requires Daffodil version 3.7.0 or higher.
New format-daffodil module created
Still uses absolute paths for the schemaFileURI.
(which is cheating. Wouldn't work in a true distributed drill environment.)
We have yet to work out how to enable Drill to provide access for DFDL schemas in XML form with include/import to be resolved.
The input data stream is, however, being accessed in the proper Drill manner. Gunzip happened automatically. Nice.
Note: Fix boxed Boolean vs. boolean problem. Don't use boxed primitives in Format config objects.
Test show this works for data as complex as having nested repeating sub-records.
These DFDL types are supported:
Documentation
TBD: feature is incomplete still. It will require substantial documentation for users.
Testing
See tests under src/test in the new daffodil contrib module.