Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI tools #22

Open
Jefffrey opened this issue Mar 9, 2024 · 4 comments
Open

CLI tools #22

Jefffrey opened this issue Mar 9, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request low Low priority

Comments

@Jefffrey
Copy link
Collaborator

Jefffrey commented Mar 9, 2024

Implement some CLI binaries for working with ORC files such as reading schema, getting stats, etc.

Tools to have:

  • View footer metadata
  • View stripe metadata (be able to filter specific stripes since can have a lot)
  • View statistics
  • Indexes, bloom filter stuff, etc.

Also need to ensure these are tested as part of CI

Some references:

@Jefffrey Jefffrey self-assigned this Mar 25, 2024
@Jefffrey Jefffrey added enhancement New feature or request low Low priority labels Apr 1, 2024
@klangner
Copy link
Contributor

Is this issue taken?
I think it would be nice to have same examples how to use this library (or cli tools), since now it is not obvious how to use it. Maybe we could create list of tools/examples here (or as a separate issues?) so people could work on them.
From my use case I'm interested in tools for:

  • Show metadata and schema
  • Show top N records
  • Show last N records

I would also be interested in reading them from S3, but that probably not in this project?

I'm also willing to help with those implementations

@waynexia
Copy link
Collaborator

A similar one is parquet-tools, I used it several times when debugging with parquet files.

@klangner
Copy link
Contributor

Yes, it looks nice. I can crate this cli tool (probably will start first with some simpler version first) if it fits this project and nobody is working on it yet.

@Jefffrey
Copy link
Collaborator Author

Hey @klangner I assigned it to myself initially because I did an initial commit as mentioned in the issue, but it isn't one of my priorities right now. Feel free to enhance the existing tool or add a separate one if you have a different use case 👍

WenyXu referenced this issue in datafusion-contrib/datafusion-orc Apr 16, 2024
* #62 cli tool for printing file stats

* Update Cargo.toml

Co-authored-by: Weny Xu <wenymedia@gmail.com>

* fixed formatting

---------

Co-authored-by: Weny Xu <wenymedia@gmail.com>
Jefffrey referenced this issue in datafusion-contrib/datafusion-orc Apr 19, 2024
* #62 Added cli tool to export data in a csv format

* CR fixes
klangner referenced this issue in klangner/datafusion-orc Apr 21, 2024
Jefffrey referenced this issue in datafusion-contrib/datafusion-orc Apr 22, 2024
* #62 added filtering by rows and columns

* CR fixes
waynexia referenced this issue in datafusion-contrib/datafusion-orc Oct 24, 2024
* #62 cli tool for printing file stats

* Update Cargo.toml

Co-authored-by: Weny Xu <wenymedia@gmail.com>

* fixed formatting

---------

Co-authored-by: Weny Xu <wenymedia@gmail.com>
waynexia referenced this issue in datafusion-contrib/datafusion-orc Oct 24, 2024
* #62 Added cli tool to export data in a csv format

* CR fixes
waynexia referenced this issue in datafusion-contrib/datafusion-orc Oct 24, 2024
* #62 added filtering by rows and columns

* CR fixes
@waynexia waynexia transferred this issue from datafusion-contrib/datafusion-orc Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request low Low priority
Projects
None yet
Development

No branches or pull requests

3 participants