jonny,
@jonny@neuromatch.social avatar

having uh fun specifying any-shaped arrays as lists of lists with @pydantic for @linkml 's new array syntax... how do you specify a recursive python type that can generate recursive JSON schema and do recursive type checks that can use pydantic's fast rust core validators and not upset the type checker????

https://github.com/linkml/linkml/pull/1887#issuecomment-1936814514

jonny,
@jonny@neuromatch.social avatar

Open science discourse sometimes: "whats the big freaking deal let's just come up with one format and put all the data in it"
formats ppl looking up from another hell of the most tedious computer problem of all time: "can we have one (1) funding"

SnoopJ,
@SnoopJ@hachyderm.io avatar

@jonny oh boy

jonny,
@jonny@neuromatch.social avatar

@SnoopJ the things we do to make usable interfaces without magic numbers...

jonny,
@jonny@neuromatch.social avatar

@SnoopJ scientists asking something normal: "i just want to collect some data, the first dimension will be "time" and that will be in "seconds," but then also ther can be any number of other dimensions of data too. that shouldn't be too hard"
me, talking computers which are not normal: "uh....."

SnoopJ,
@SnoopJ@hachyderm.io avatar

@jonny the contrarian part of me reacts to the infinite generation with "well that is technically correct"

There's something cathartic about reading a well-written explanation of this kind of journey through hell. I confess that the schema stuff is largely beyond me, I'm already pretty annoyed if I even have to write a validator (because I am spoiled)

jonny,
@jonny@neuromatch.social avatar

@SnoopJ haha as one of the devs of the data format we're targeting correctly pointed out "why do you need to support infinity dimensions, a computer is hopefully going to throw a recursion error long before you get even close" and suggested we just pick some high number and call it a day. Numpy only supports 32 dimensions, for example. We could have also just forbidden unspecified axes, so you can't have an unbounded shape, you at least have to set some maximum, but that's not a limit that should exist in a schema language - aka it should be possible to express unboundedness even if no implementation does unboundedness. That's an extra fun problem here because this is a meta-schema language that translates into a bunch of other formats (this pydantic model is just one), and each of them has their own limitations, so if you let the limitations of each individual format bleed back up into the meta-schema language then you basically get to express nothing.

all this is only really a problem if you want it to work with for humans. eg. I could just make a gigantic Union[] of 1024 types, one for each possible number of dimensions, and then say "don't look at the generated code, just import it." Or I could make some big long hairy schema spec that is impossible to write and impossible to design interfaces for.

Providing something where you can just be like

class:
```<br></br>```
  name: "My Data"
```<br></br>```
  attributes:
```<br></br>```
    my_timeseries:
```<br></br>```
      range: int
```<br></br>```
      array:

to mean "a dataset with an integer timeseries of any shape"

and get

class MyData(BaseModel):
```<br></br>```
    my_timeseries: AnyShapeArray[int]

to be able to use it is really good for humans, but really sucks for computers.

SnoopJ,
@SnoopJ@hachyderm.io avatar

@jonny HMMM I had not looked into this closely enough to understand at first that this is generating Pydantic, but I will need to take a closer look at this, especially as I'm going on the warpath at work about a standing "it's fine just ship that, our standard is that" problem with datasets. Would be awesome to delete hand-authored Pydantic code and be able to hand over a spec (or suitably generated schema, we already use JSONSchema/Protobuf for other stuff) over to the other teams.

jonny, (edited )
@jonny@neuromatch.social avatar

@SnoopJ that is exactly what we are doing with @linkml by being able to specify arrays in the schema. it's a downright authorable schema language, and i'm working on different interfaces now to be able to actually use that. That's historically been one of the major problems with #LinkedData / #RDF array specifications - you might be able to describe it, but what's the point, that description is totally removed from how i actually use my data.

so if instead you could just start with some schema, use that to generate a bunch of models (that don't suck to use, having used other schema -> code generator tools before) that you can use in your analyisis/whatever code and then also publish the data in some standardized format, that would be an astronomically better situation than what most scientists have to do now.

This lists of lists version is just the default one if you want to add zero dependencies to whatever you're doing (aside from pydantic, it is the pydantic version of the schema after all). It's a little more clumsy but it works out of the box. I'm also cooking up a tiny lil package (also with minimal deps) with a type that lets you use whatever the heck else array format you want to, right now just got numpy, dask, and hdf5, but going to split those out into plugins and make hooks for any additional formats too ( https://github.com/p2p-ld/numpydantic )

SnoopJ,
@SnoopJ@hachyderm.io avatar

@jonny definitely gotta walk myself through an example and see what it would be like if we generated one of our existing classes this way

jonny,
@jonny@neuromatch.social avatar

@SnoopJ
Hell ya when I get the draft running id love to use your data as a test case

SnoopJ,
@SnoopJ@hachyderm.io avatar

@jonny I took a quick scan of the docs and I'm pretty sure I know the answer, but is there any way to go from not-YAML into LinkML? We have some existing JSONSchema and Protobuf floating around at work, it would be cool to ingest those, but imagining it isn't doable

jonny,
@jonny@neuromatch.social avatar

@SnoopJ
Nah there is a whole ingestion and translation kit that will get you most of the way there. I still think its worth handwriting the last mile but ya:
https://linkml.io/schema-automator/

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • DreamBathrooms
  • ngwrru68w68
  • tester
  • magazineikmin
  • thenastyranch
  • rosin
  • khanakhh
  • InstantRegret
  • Youngstown
  • slotface
  • Durango
  • kavyap
  • mdbf
  • tacticalgear
  • megavids
  • osvaldo12
  • normalnudes
  • cubers
  • cisconetworking
  • everett
  • GTA5RPClips
  • ethstaker
  • Leos
  • provamag3
  • anitta
  • modclub
  • JUstTest
  • lostlight
  • All magazines