Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Messytables 2 #142

Open
pudo opened this issue Aug 24, 2015 · 4 comments
Open

Proposal: Messytables 2 #142

pudo opened this issue Aug 24, 2015 · 4 comments

Comments

@pudo
Copy link
Contributor

pudo commented Aug 24, 2015

messytables just turned 4 years, and I'm getting the sense that it could use a major overhaul to make sure it doesn't turn into a messy thing itself. While @pwalsh proposed starting a clean library (frictionlessdata/tabulator-py#1), I think we should instead do a breaking update. This should incorporate some lessons learned:

  • There are many file types, we should be modular and support external packages (e.g. messytables-pdf, refs Externalize PDF support? #137)
  • A common system of descriptors for columns (i.e. JTS) is useful. It should be the output of the type guesser.
  • We don't really need wrapper types for tables, a table should just be an iterator for a set of tuples.
  • Type conversion should be it's own library, there's plenty of functionality there.
  • Have excellent support for Python 3 and see if we can't cut out the odd dependency (e.g. xlrd and openpyxl -- do we need both? BeautifulSoup and LXML, really?
@pwalsh
Copy link
Member

pwalsh commented Aug 24, 2015

As we discussed over IRC, I'm in agreement, as long we are very careful to have a clean API for "Data Table Iteration" that is not mixed with any other magic that messytables may or may not do.

So, copying my notes from here, I'd like to see:

  • Works on Python 2/3
  • Whatever the input (any encoding, any supported format), the output will be a csv-like iterable of data encoded as utf-8 text strings
  • Input formats:
    • CSV
    • Excel
    • JSON (using ijson)
    • ODS
    • Google Spreadsheet?
  • Want to work with text data as text streams - following Python 3 API preferences
  • Want to build libs (data processors) that handle such data around a common format: csv-like iterable (arrays of utf-8 encoded text strings) is good for this - this is how GoodTables works internally
    • Probably an option to get rows as arrays or as dicts
  • Takes an argument to cast values on iteration, based on both JSON Table Schema and JSON Schema
    • Obviously some input formats, like JSON, may already be typed. In this case "casting" might do something like raise a MismatchedTypeError

@pudo
Copy link
Contributor Author

pudo commented Aug 25, 2015

I've started working on a general clean-up branch at https://github.com/okfn/messytables/tree/cleanup-mt2. Since we'll break compatibility on some of the API anyway, it seems like a valid idea to remove some left-overs.

I also want to adopt an approach where we rather rely on external librares (like six), rather than build our own.

@turicas
Copy link

turicas commented Aug 26, 2015

Hello, guys. I've started working on a library that implement these requirements (but the focus is a bit different): turicas/rows. I'm working now on a complete API rewrite so it'll be very simple and easy to use yet powerful (like automatically identifying field types and converting them). We may share some work among the two libraries. ;-)

@jqnatividad
Copy link

+1. CKAN Datapusher uses messytables and it unfortunately periodically produces literal messytables in the datastore (pardon the pun)

Having the ability to cast JTS/JSON Schema would be nice. Perhaps, this is a more pragmatic way to implement ckan/ideas#150 and address the datapusher issues CKAN implementations encounter rooted in messytables guessing data types incorrectly.

Maybe on the first pass, the guessed schema can be presented to the CKAN user leveraging the existing ...as_JTS methods, and the user can optionally override the JTS datatype guesses, and then leverage the proposed MessyTables 2 JTS-driven casting to insert the dataset as a proper table with the right datatypes into the CKAN datastore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants