-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sniffer #35
Sniffer #35
Conversation
@metasoarous Finally had some time to fix this up. I squashed the commits. There is now a working function called |
Awesome! Thanks. I'll try to take a look at it soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mahinshaw This looks great! Thanks so much for your work on this, and sorry for taking so long to get around to reviewing it. I think there are a few things I'd like to think about a bit more in depth here, but I'd like to merge this soon. Please chime in on the line notes when you get a chance.
Thanks again!
src/semantic_csv/core.clj
Outdated
;; ...}} | ||
;; The cast class defines the class of values (i.e. numeric values or date/time values). | ||
;; The cast type defines the semantic type within that class (i.e. integer, decimal, rational) | ||
(def cast-rules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mahinshaw I'd suggest default-cast-rules
for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree
src/semantic_csv/core.clj
Outdated
* `:cast-rules-map` - Map defining rules for sniffing and casting. | ||
* `:cast-with-options` - Options to be passed to `cast-with`." | ||
([rows] (sniff-data {} rows)) | ||
([{:keys [do-cast rows-to-sniff cast-rules-map cast-with-opts] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts/questions:
- I see what you're doing with
do-cast
here, making it possible to get back just the abstract result of the cast, as apposed to the casted data itself. However, the reason you would do one versus the other seems to be so different, that I wonder if thedo-cast
behavior shouldn't just be a separate function? Perhaps something likeautocast
?sniff-cast
sounds a little silly, but could also be an option. - Should
rows-to-sniff
be somethingn-rows-to-sniff
or even justn-rows
? Not convinced it's necessary, as short is somewhat nice, but a) we should probably stick to convention if it exists, and b) since I'm not sure precedent exists (skimming the code), we should make sure of the pattern we want to set for the future, and I welcome your thoughts on this. - Again,
cast-rules-map
seems like maybe it's something better suited to our secondautocast
(or whatever) function, if we end up doing that. Either way, maybe we just call itcast-rules
? I think that fits better with established pattern wrtcast-fns
(see Would it be too warty to have :cast-with be the process option instead of :cast-fns? #28).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I plead guilty, for all silly naming here. Moreover, it's been a few years, so I had to sit down and read some of this code to refresh myself on what it all does.
I'll reply in order of the above bullets:
do-cast
is a silly name. another optionsapply?
. If memory serves me, I wanted a way for the user to not have to dig into impl to get out values fromsniff-test
. It's handy in the repl. I'm not sure breaking it out into another function adds much value, either. If we did do that, it would manifest as what amounts to an alias tosniff-test
. However, I'm totally game to do that and remove the option. And the more that I think about it, it's 6 in one hand, half dozen in the other. If we break them out into the applied version, I likeautocast
. Thesniff-test
alias could just besniff
.- for
rows-to-sniff
- I liken-rows
for brevity and reusability. :%s/cast-rules-map/cast-rules/g
- agreed.
@@ -52,6 +52,22 @@ | |||
[["this" "is data"]]))))) | |||
|
|||
|
|||
(deftest sniff-data-test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mahinshaw Tests... 🤤 Thank you kindly sir!
So it's been a while since I wrote this. Given a cursory look, I think my intentions were good. I don't know how performant it is, but it's flexible, so that's nice. I wonder though if there is a way we can break this up and leverage it with transducers for pipelining common files. For instance, the user sniffs the first file and reuses the rules to process many files. Basically, this would require some api changes (which go along with what you brought up with Also, thanks for looking at it. I don't get to do clojure much anymore (java pays the bills these days), and it's always fun when I get to match parens. |
rebase on master. |
Hi @mahinshaw. Is there a reason you decided to close this PR? I realize it's been a couple of years without me having gotten around to it (for which I apologize), but I was hoping circle back around on this after working through some big features coming up for Oz. Thanks |
@metasoarous feel free to reopen if your still interested. I'll try and as assist as time allows. |
This is still a Work in progress. I have only had a few lunches to get some things down, but it has been a little while, so I thought I would shoot this your way and let you have a look. I moved away from regexes, because casting is just easier. I also reduced the input data, which i called the
sniff-map
(naming suggestions appreciated), to just contain the overall data type as the key (i.e.:numeric
), and a map containing the:hierarchy
, and all the subtypes (i.e.:integer
). The value for:hierarchy
being a vector, and each subtype has a casting function. This seemed like a simple and intuitive way to order this, but I am happy to alter it if you have suggestions. Also, the return value forsniff-value
is a vector that acts as a path into thesniff-map
(i.e.[:numeric :integer]
). This seemed like an intuitive way to get back to the casting function when the time comes.Some things I want to do:
Overall, I am still not totally satisfied, but I feel like it is going in the right direction.