Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sniffer #35

Closed
wants to merge 1 commit into from
Closed

Sniffer #35

wants to merge 1 commit into from

Conversation

mahinshaw
Copy link
Contributor

This is still a Work in progress. I have only had a few lunches to get some things down, but it has been a little while, so I thought I would shoot this your way and let you have a look. I moved away from regexes, because casting is just easier. I also reduced the input data, which i called the sniff-map (naming suggestions appreciated), to just contain the overall data type as the key (i.e. :numeric), and a map containing the :hierarchy, and all the subtypes (i.e. :integer). The value for :hierarchy being a vector, and each subtype has a casting function. This seemed like a simple and intuitive way to order this, but I am happy to alter it if you have suggestions. Also, the return value for sniff-value is a vector that acts as a path into the sniff-map (i.e. [:numeric :integer]). This seemed like an intuitive way to get back to the casting function when the time comes.

Some things I want to do:

  • Add a function the deals with promoting/demoting types. For example - some column may have a mix of integers and decimals, and at the end we want to make sure that the output is a decimal. Also, there are cases where a string may be in a column that also has what could be numerics, and we need to signal that it should remain a string. There are probably a number of ways to ensure this, but I haven't decided on one yet.
  • Add some tests
  • Add a function that applies the cast to the csv data.

Overall, I am still not totally satisfied, but I feel like it is going in the right direction.

@mahinshaw
Copy link
Contributor Author

@metasoarous Finally had some time to fix this up. I squashed the commits. There is now a working function called sniff-data in core. Let me know what you think.

@metasoarous
Copy link
Owner

Awesome! Thanks. I'll try to take a look at it soon.

Copy link
Owner

@metasoarous metasoarous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahinshaw This looks great! Thanks so much for your work on this, and sorry for taking so long to get around to reviewing it. I think there are a few things I'd like to think about a bit more in depth here, but I'd like to merge this soon. Please chime in on the line notes when you get a chance.

Thanks again!

;; ...}}
;; The cast class defines the class of values (i.e. numeric values or date/time values).
;; The cast type defines the semantic type within that class (i.e. integer, decimal, rational)
(def cast-rules
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahinshaw I'd suggest default-cast-rules for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

* `:cast-rules-map` - Map defining rules for sniffing and casting.
* `:cast-with-options` - Options to be passed to `cast-with`."
([rows] (sniff-data {} rows))
([{:keys [do-cast rows-to-sniff cast-rules-map cast-with-opts]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts/questions:

  • I see what you're doing with do-cast here, making it possible to get back just the abstract result of the cast, as apposed to the casted data itself. However, the reason you would do one versus the other seems to be so different, that I wonder if the do-cast behavior shouldn't just be a separate function? Perhaps something like autocast? sniff-cast sounds a little silly, but could also be an option.
  • Should rows-to-sniff be something n-rows-to-sniff or even just n-rows? Not convinced it's necessary, as short is somewhat nice, but a) we should probably stick to convention if it exists, and b) since I'm not sure precedent exists (skimming the code), we should make sure of the pattern we want to set for the future, and I welcome your thoughts on this.
  • Again, cast-rules-map seems like maybe it's something better suited to our second autocast (or whatever) function, if we end up doing that. Either way, maybe we just call it cast-rules? I think that fits better with established pattern wrt cast-fns (see Would it be too warty to have :cast-with be the process option instead of :cast-fns? #28).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plead guilty, for all silly naming here. Moreover, it's been a few years, so I had to sit down and read some of this code to refresh myself on what it all does.
I'll reply in order of the above bullets:

  • do-cast is a silly name. another options apply?. If memory serves me, I wanted a way for the user to not have to dig into impl to get out values from sniff-test. It's handy in the repl. I'm not sure breaking it out into another function adds much value, either. If we did do that, it would manifest as what amounts to an alias to sniff-test. However, I'm totally game to do that and remove the option. And the more that I think about it, it's 6 in one hand, half dozen in the other. If we break them out into the applied version, I like autocast. The sniff-test alias could just be sniff.
  • for rows-to-sniff - I like n-rows for brevity and reusability.
  • :%s/cast-rules-map/cast-rules/g - agreed.

@@ -52,6 +52,22 @@
[["this" "is data"]])))))


(deftest sniff-data-test
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahinshaw Tests... 🤤 Thank you kindly sir!

@mahinshaw
Copy link
Contributor Author

So it's been a while since I wrote this. Given a cursory look, I think my intentions were good. I don't know how performant it is, but it's flexible, so that's nice.

I wonder though if there is a way we can break this up and leverage it with transducers for pipelining common files. For instance, the user sniffs the first file and reuses the rules to process many files. Basically, this would require some api changes (which go along with what you brought up with autocast) that would allow for "sniffing", and then using the output of sniffing to apply (or autocast given those rules) in a transducer pipeline. Anyways, it's kinda crazy, but could be interesting. Not sure how strong the use case is though. And as I think about it, it starts to get into the land of full on data pipelining/batch processing. Which may be better served somewhere else. But having an api that supports that might be nice for users.

Also, thanks for looking at it. I don't get to do clojure much anymore (java pays the bills these days), and it's always fun when I get to match parens.

@mahinshaw
Copy link
Contributor Author

rebase on master.

@mahinshaw mahinshaw closed this Oct 31, 2021
@metasoarous
Copy link
Owner

Hi @mahinshaw. Is there a reason you decided to close this PR? I realize it's been a couple of years without me having gotten around to it (for which I apologize), but I was hoping circle back around on this after working through some big features coming up for Oz.

Thanks

@mahinshaw
Copy link
Contributor Author

@metasoarous feel free to reopen if your still interested. I'll try and as assist as time allows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants