[TransWarp] Basic "storage jar" design

Sun Jun 30 10:43:24 EDT 2002

At 04:11 PM 6/30/02 +0200, Roché Compaan wrote:
>Hi Phillip
>
>I didn't understand enough of your previous post "PEAK persistence based
>on ZODB4, continued" because my brain exploded every second paragraph. I
>wasn't too concerned because it seemed that you and yourself first
>needed to talk through it :)

Yes, I've started using letters to the mailing list as a substitute for 
talking to Ty to work out my ideas, when he's not readily available.  :)

Not too long ago, I found out there's actually a name for the way I do my 
thinking; it's called "Image Streaming".  The idea is that you dump out the 
contents of your brain to another human being with the intent of having 
them understand the ideas you're putting forth, and it frees you from 
having to hold on tightly to any one idea as you go.  It also creates a 
kind of feedback loop that helps you refine and clarify the initially vague 
intuitive concepts that come to mind.  Anyway, I've been doing it for many 
many years without having a name for it.  It's only been in the last month 
or so, however, that I've realized I can do a form of it by writing down 
the ideas in the form of a letter or proposal or whatever to someone else.  :)

>On Sat, 2002-06-29 at 22:34, Phillip J. Eby wrote:
> > Abstract "Storage Jar"
> > ======================
> >
> > This is a basic design for an abstract implementation of the "storage jar"
> > concept for PEAK/ZODB4.  It can be used as the basis for either a primary
> > key-driven object jar, or a query jar, with appropriate method
> > overrides.  "Alternate key" jars won't have much use for this as a base
> > class, since they don't manage object states but just offer a convenient
> > front-end to retrieving an object from its primary key jar (possibly using
> > the preloadState() mechanism described below).
>
>What will a query jar do?  I assume they will remember query results to
>prevent re-querying the underlying database?

They do several things, none of which I really ever explained thoroughly.  :)

Think about a two-way association between objects - say your 
person/department example.  If the person  table has a foreign key 
reference [1->1] to department, then department has an implicit [1->n] 
relationship to person.  A query jar could be used to represent this 
inverse relationship, so that when a department object's state is loaded, a 
"ghost" from the query jar (with the department ID as its oid) is placed as 
the "people" attribute of the loaded department object.  Any attempt to 
*use* this people attribute will cause its state to be loaded from the 
query jar - a list of ghosts of person objects, retrieved by a query 
against the persons table.  Of course, since you're querying the persons 
table, you may as well pass that state through to 'preloadState()' on the 
person jar, so the person jar won't reload that data when you access one of 
the ghosts.  (Of course, if the state is loaded they won't be ghosts, but 
anyway...)

Notice that this doesn't mean that all the people are loaded upon loading 
the department's state.  The query jar for the inverse relationship will 
just return a ghost.  It knows what type of object the ghost should be (a 
PersistentList, basically), so it doesn't need to load the actual list 
until you try to *do* something with the "people" attribute.

Anyway, so 1->N associations is one function for query jars.  Another 
function is somewhat as you described, to serve as query caching, but I see 
it more as a uniform access method for queries.  (Although, now that you 
mention it, I can see some uses for caching certain kinds of queries within 
or across transactions; I was assuming weak reference caching as the norm 
for query objects.)

Treating queries - or "parameterized collections" if you will - as though 
they were simply persistent objects, lends a nice degree of uniformity, and 
minimizes conceptual entities in the system.  Instead of writing query 
methods as such, you create query jars.  The parameters of a query become 
an oid used to retrieve a list.  You could even have a *writable* query, 
where adding an item to the list or removing it would assert that you 
wanted that object changed in such a way that it was or was not a member of 
that set.  That could be a thoroughly non-trivial exercise, depending on 
the nature of the query, but the point is that if you needed it, you could 
do it.  I like the symmetry and elegance of it.

But mainly, query jars are there because they have to be, to support 
inverse foreign key references as
ghostable attributes of objects' state.  The rest is a very nice 
bonus.  The true brilliance of Jim Fulton's ZODB design is finally exposed 
in ZODB4, with its more-orthogonal interfaces and less structural coupling 
than before.

> > * oidFor(ob) -- Called by save() operations of other jars to get foreign
> > key values for objects referenced in their states.  Implementation: if
> > ob._p_jar is self, return ob._p_oid, unless _p_oid is None, in which case
> > save the object using oid = ob._p_oid = self.new(ob), and return the
> > oid.  If the _p_jar is NOT self, return self.thunk(ob) to try to translate
> > the reference or create a stub.
>
>So if I need to save an instance of "Person" which references an
>instance of "Deparment" I can call "oidFor(ADepartment)" on the
>DepartmentJar to get the department's id.  When will _p_jar not be self?
>Won't all objects returned by the DepartmentJar have their _p_jar set to
>the DepartmentJar?

Yes, *but* it is not necessarily the case that you'll be putting a 
department object from *that* department jar there.  Suppose you were 
working in an RDBMS, but the source of department existence was an LDAP 
directory.  You might set aPerson.department = 
aDepartmentFromAnLDAPJar.  When saving aPerson, you ask the 
SQLDepartmentJar for an oid, and it has to create a thunk or stub reference 
in the SQL database that is referenceable as a department key, but has some 
kind of linkage to the LDAP-based department info.  That's what the thunk() 
method is for.  As I noted, it's not something you'll support often, but Ty 
and I have multiple apps which do this sort of cross-DB referencing for one 
or two object types.

> > Abstract Methods and Attributes
> > -------------------------------
> >
> > (to be redefined as needed in concrete subclasses of AbstractJar)
> >
> > * ghost(oid, state=None) -- given an oid and optional state, return a 
> ghost
> > (empty instance) of the correct class.  If 'state' is supplied, load it
> > into the object with ob.__setstate__() before returning it.  Note that if
> > 'state' is needed to determine the correct class, but it isn't supplied,
> > your implementation can always call self.load(oid) first, examine the
> > state, then create the class instance and stick the state in it.  It's not
> > a ghost at that point, but what else can you do if you need the 
> state?  The
> > reason this method *must* accept an optional state, even if it doesn't 
> need
> > it, is so that multi-row queries and alternate key lookups can provide
> > their results to preloadState(), preventing a re-retrieval of the same 
> data
> > from the underlying DB.
>
>So if an object's state is set to "loaded" by __setstate__ you still
>have an empty instance.  The only difference being that it's state is
>set.  When does data retrieval happen for this instance, especially
>since its "loaded" state will prevent it.  What am I missing?

If the state is loaded, it's not a ghost, and it has everything it needs.

>I understand "ghost" as a state (in memory but state is not loaded) but
>I don't quite follow what "ghost" as a method gives you.

Perhaps it's a poor method name.  It must return either a ghost or a loaded 
object; returning a ghost is the *minimum* requirement.  The ghost() method 
*can* ignore the state and not load it, returning just a ghost, if it wants 
to.  It'll just be inefficient to do so if the state was provided, since 
the state will have to be re-fetched when the ghost is activated.

>"__getitem__" returns an object from the cache or a ghost if its not in
>the cache.

Yes.  preloadState() is similar, except that it *may* return a non-ghost, 
fully loaded object.

>"load" will do the actual data retrieval or provide default states.

Yes; the latter only if you want to treat all oids as "virtually existing" 
whether they are in the underlying DB or not.  Sort of a "sparse" 
algorithm.  If you're not doing that, then load() is strictly for data 
retrieval.

>I can absolutely see the sense in separating direct data retrieval from
>getting objects from the cache, into two separate methods.  I having
>trouble understanding how the application or Jar will know when to
>do which.

I think you misunderstand.  load() isn't an API call, it's a private, 
abstract method for subclasses.  When an attribute of a ghost is accessed, 
the ghost calls 'self._p_jar.setstate()' (by way of the C persistence 
machinery).  The setstate method then calls self.load() to load the state.

> > * new(ob) -- save new object 'ob' and return its oid (by generating it or
> > extracting it from state)
>
>What about foreign key constraints in the underlying db?  Not that I
>really use them - I think it is the application's responsibility to
>govern relationships between objects.

I presume you're talking about ensuring that the referenced object exists 
before it's referred to?  That's actually handled by way of 
'oidFor()'.  Think about it.  When you save the state for 'aPerson', it has 
to get the 'oidFor()' of all its foreign key references before it can do an 
SQL "UPDATE" to save them.  If any of them need new ID's, oidFor() will 
cause them to be created and saved *before* the update can point the 
foreign key to them.  Thus, relational integrity is guaranteed by the 
normal operation of the framework, which is just beautiful, IMHO.  :)

By the way, remember that all of the methods listed under "Abstract 
Methods" are *private* methods, not part of the API.  The API for a jar 
consists solely of the five methods __getitem__(), newItem(), oidFor(), 
preloadState(), and flush().  And of those, only __getitem__() and 
newItem() are for the use of "higher layer" application-level code.  The 
rest are for use by other jars, which are effectively part of the same 
abstraction layer.

>I'm really glad that you put so much thought into avoiding the
>re-loading of state since that can either be the big strenth or major
>downfall of a persistence framework.

One of many lessons learned from ZPatterns, I assure you.  :)

>For those who don't know, "Jar" comes straight from your fridge.  When
>you want to preserve food, you pickle it and put it in a Jar.  The same
>goes for objects that you want to persist: you pickle it and put it in a
>Jar.  Sometimes it helps to explain what was obvious once an has since
>been forgotten.

Actually, "storage jars" for me is a reference to a Monty Python 
sketch!  But I did start with the term "jar" since the ZODB persistence 
framework has the _p_jar concept, which does come from "pickle jars" as 
used by Jim Fulton, which came from Python pickle, which I think came from 
some other language's notion of pickling.  The politically correct term for 
a jar is now a "persistent data manager", as expressed by the 
IPersistentDataManager interface and lots of references to "dm's" and "data 
manager" in the C and Python code of ZODB 4.

But I like "storage jars" better, at least as a working term.  I'm not sure 
it really belongs in the businesslike terminology of PEAK, and we might 
actually be better off calling them "Racks", as they are very close in 
concept and function to the Racks in ZPatterns.  The main difference is 
that there were no "alternate key" racks or "query" racks in ZPatterns, at 
least as a promoted concept.  Which isn't to say that nobody ever 
implemented query or alternate key racks; I'm sure they did.  There just 
weren't names for the concepts.

Anyway, a "rack" goes more with the idea of giving something an ID and 
getting back an object; I think of those motorized racks in the dry-cleaner 
shops, where you hand in your ticket, and they spin round to your 
clothes...  And if the clothes are in an opaque bag, they're like a ghost, 
but as long as the clothes are in there when you open the bag...  :)

Anyway, final terminology can wait a bit, since there's no code as yet.