Skip to content
Matthew J Collins edited this page Feb 13, 2015 · 13 revisions

Notes on design

Proposed method signatures

function idig_view(UUID=FALSE, type="record")
function idig_search(rq="", query=FALSE, sort=FALSE, limit=0, offset=0, fields=c(..), max_items=100000)
function idig_media(mq="", rq="", query=FALSE, sort=FALSE, limit=0, offset=0, fields=c(..), max_items)
function idig_toprecords(rq="", fields=c("scientificname"), count=10)
function idig_topmedia(mq="", rq="", fields=c("scientificname"), count=10
function idig_count(rq="")
function idig_metafields()

Function names

GET/POST

Just drop GET entirely.

Query Objects

The lightest option would be to skip the object for now and go with named params on the idigSearch method which would also require no user code changes in the future:

results <= idigSearch(rq='json string')
# if people want lists now
# query <= list(family=list("asteraceae","fagaceae"))
# results <= idigSearch(rq=jsonlite::toJSON(q))
# later let people write this
# results <= idigSearch(query=idigQuery(stuff))

Later, we can add idigSearch(query=...) to take a query object without having people change their existing code.

I wrote the below before changing my mind: I think I want to do a query object. Nested lists are just going to be trouble. The object can have a "fromJson" method that just takes JSON text for those who are chaining APIs or who want to just write JSON. Otherwise, probably a 1:1 matching of https://github.com/iDigBio/idigbio-search-api/blob/master/app/lib/query-shim.js is best. And update the iDigBio Query format documentation to include the query shim methods so that there is 1:1:1 correspondence between a query format snipped, a query shim method, and an R query object method.

Results returned

  • Rows named "1", "2", etc
  • Place UUID in row as a column, always present, user can't turn off
  • Allow users to specify column names and pull prefixed dwc:country from data list and unprefixed county from indexTerms list. Why? I imagine at some point indexTerms will be cleaned data and data will be original. This will let people choose. Also, don't have to modify the API returns now with fancy logic to drop some but not all namespace prefixes in the data list. More work on my side to concat indexTerms and data terms though. I assume indexed terms will continue to be un-namespaced
  • Alter API to take parameter "fields" which is just a flat list of terms either from indexTerms list or data list and return only those fields.
  • No "fields" parameter means return some default set, proposed: occur ID, ins code, collection code, catalog number, genus, species, scientific name, date collected, lat, long -> Enough for most people, no raw data only cleaned, and skips the verbatium and text fields which are chunky. (Only slightly afraid of people thinking this means that this is all iDigBio has in it...)
  • Support fields=all to return everything known.
  • Try to type & factor (if appropriate) fields that are from indexTerms according to what they are in ES. Data types may not match up but take a look. Never type fields from data. Always force user to type them.
  • If "fields" is given, probably have enough information to try pre-allocating data frame. Try out some options here to see if that is a useful performance improvement AFTER everything else works and someone says it's slow :)

Attributions

Objects can have attributes, cool. We'll role with Francois' suggestions about setting attribution as a attribute of the returned dataframe and having a formatter function return it given a result data frame.

Mapping

Haven't thought about it.

Bunch of other metadata resources

The other APIs provide access to a some other stuff, taxa, organizations, field names, etc. Probably we should do the same, decision to be made whether that's a bare list of everything or searchable. Taxa is probably the first one to consider and I assume that can be done off ES and added to the search API easily.

Tests

Look @ Francois's stuff for JSON definitions between Python and R