[TransWarp] Basic "storage jar" design
Phillip J. Eby
pje at telecommunity.com
Sun Jun 30 10:43:24 EDT 2002
At 04:11 PM 6/30/02 +0200, Roché Compaan wrote:
>I didn't understand enough of your previous post "PEAK persistence based
>on ZODB4, continued" because my brain exploded every second paragraph. I
>wasn't too concerned because it seemed that you and yourself first
>needed to talk through it :)
Yes, I've started using letters to the mailing list as a substitute for
talking to Ty to work out my ideas, when he's not readily available. :)
Not too long ago, I found out there's actually a name for the way I do my
thinking; it's called "Image Streaming". The idea is that you dump out the
contents of your brain to another human being with the intent of having
them understand the ideas you're putting forth, and it frees you from
having to hold on tightly to any one idea as you go. It also creates a
kind of feedback loop that helps you refine and clarify the initially vague
intuitive concepts that come to mind. Anyway, I've been doing it for many
many years without having a name for it. It's only been in the last month
or so, however, that I've realized I can do a form of it by writing down
the ideas in the form of a letter or proposal or whatever to someone else. :)
>On Sat, 2002-06-29 at 22:34, Phillip J. Eby wrote:
> > Abstract "Storage Jar"
> > ======================
> > This is a basic design for an abstract implementation of the "storage jar"
> > concept for PEAK/ZODB4. It can be used as the basis for either a primary
> > key-driven object jar, or a query jar, with appropriate method
> > overrides. "Alternate key" jars won't have much use for this as a base
> > class, since they don't manage object states but just offer a convenient
> > front-end to retrieving an object from its primary key jar (possibly using
> > the preloadState() mechanism described below).
>What will a query jar do? I assume they will remember query results to
>prevent re-querying the underlying database?
They do several things, none of which I really ever explained thoroughly. :)
Think about a two-way association between objects - say your
person/department example. If the person table has a foreign key
reference [1->1] to department, then department has an implicit [1->n]
relationship to person. A query jar could be used to represent this
inverse relationship, so that when a department object's state is loaded, a
"ghost" from the query jar (with the department ID as its oid) is placed as
the "people" attribute of the loaded department object. Any attempt to
*use* this people attribute will cause its state to be loaded from the
query jar - a list of ghosts of person objects, retrieved by a query
against the persons table. Of course, since you're querying the persons
table, you may as well pass that state through to 'preloadState()' on the
person jar, so the person jar won't reload that data when you access one of
the ghosts. (Of course, if the state is loaded they won't be ghosts, but
Notice that this doesn't mean that all the people are loaded upon loading
the department's state. The query jar for the inverse relationship will
just return a ghost. It knows what type of object the ghost should be (a
PersistentList, basically), so it doesn't need to load the actual list
until you try to *do* something with the "people" attribute.
Anyway, so 1->N associations is one function for query jars. Another
function is somewhat as you described, to serve as query caching, but I see
it more as a uniform access method for queries. (Although, now that you
mention it, I can see some uses for caching certain kinds of queries within
or across transactions; I was assuming weak reference caching as the norm
for query objects.)
Treating queries - or "parameterized collections" if you will - as though
they were simply persistent objects, lends a nice degree of uniformity, and
minimizes conceptual entities in the system. Instead of writing query
methods as such, you create query jars. The parameters of a query become
an oid used to retrieve a list. You could even have a *writable* query,
where adding an item to the list or removing it would assert that you
wanted that object changed in such a way that it was or was not a member of
that set. That could be a thoroughly non-trivial exercise, depending on
the nature of the query, but the point is that if you needed it, you could
do it. I like the symmetry and elegance of it.
But mainly, query jars are there because they have to be, to support
inverse foreign key references as
ghostable attributes of objects' state. The rest is a very nice
bonus. The true brilliance of Jim Fulton's ZODB design is finally exposed
in ZODB4, with its more-orthogonal interfaces and less structural coupling
> > * oidFor(ob) -- Called by save() operations of other jars to get foreign
> > key values for objects referenced in their states. Implementation: if
> > ob._p_jar is self, return ob._p_oid, unless _p_oid is None, in which case
> > save the object using oid = ob._p_oid = self.new(ob), and return the
> > oid. If the _p_jar is NOT self, return self.thunk(ob) to try to translate
> > the reference or create a stub.
>So if I need to save an instance of "Person" which references an
>instance of "Deparment" I can call "oidFor(ADepartment)" on the
>DepartmentJar to get the department's id. When will _p_jar not be self?
>Won't all objects returned by the DepartmentJar have their _p_jar set to
Yes, *but* it is not necessarily the case that you'll be putting a
department object from *that* department jar there. Suppose you were
working in an RDBMS, but the source of department existence was an LDAP
directory. You might set aPerson.department =
aDepartmentFromAnLDAPJar. When saving aPerson, you ask the
SQLDepartmentJar for an oid, and it has to create a thunk or stub reference
in the SQL database that is referenceable as a department key, but has some
kind of linkage to the LDAP-based department info. That's what the thunk()
method is for. As I noted, it's not something you'll support often, but Ty
and I have multiple apps which do this sort of cross-DB referencing for one
or two object types.
> > Abstract Methods and Attributes
> > -------------------------------
> > (to be redefined as needed in concrete subclasses of AbstractJar)
> > * ghost(oid, state=None) -- given an oid and optional state, return a
> > (empty instance) of the correct class. If 'state' is supplied, load it
> > into the object with ob.__setstate__() before returning it. Note that if
> > 'state' is needed to determine the correct class, but it isn't supplied,
> > your implementation can always call self.load(oid) first, examine the
> > state, then create the class instance and stick the state in it. It's not
> > a ghost at that point, but what else can you do if you need the
> state? The
> > reason this method *must* accept an optional state, even if it doesn't
> > it, is so that multi-row queries and alternate key lookups can provide
> > their results to preloadState(), preventing a re-retrieval of the same
> > from the underlying DB.
>So if an object's state is set to "loaded" by __setstate__ you still
>have an empty instance. The only difference being that it's state is
>set. When does data retrieval happen for this instance, especially
>since its "loaded" state will prevent it. What am I missing?
If the state is loaded, it's not a ghost, and it has everything it needs.
>I understand "ghost" as a state (in memory but state is not loaded) but
>I don't quite follow what "ghost" as a method gives you.
Perhaps it's a poor method name. It must return either a ghost or a loaded
object; returning a ghost is the *minimum* requirement. The ghost() method
*can* ignore the state and not load it, returning just a ghost, if it wants
to. It'll just be inefficient to do so if the state was provided, since
the state will have to be re-fetched when the ghost is activated.
>"__getitem__" returns an object from the cache or a ghost if its not in
Yes. preloadState() is similar, except that it *may* return a non-ghost,
fully loaded object.
>"load" will do the actual data retrieval or provide default states.
Yes; the latter only if you want to treat all oids as "virtually existing"
whether they are in the underlying DB or not. Sort of a "sparse"
algorithm. If you're not doing that, then load() is strictly for data
>I can absolutely see the sense in separating direct data retrieval from
>getting objects from the cache, into two separate methods. I having
>trouble understanding how the application or Jar will know when to
I think you misunderstand. load() isn't an API call, it's a private,
abstract method for subclasses. When an attribute of a ghost is accessed,
the ghost calls 'self._p_jar.setstate()' (by way of the C persistence
machinery). The setstate method then calls self.load() to load the state.
> > * new(ob) -- save new object 'ob' and return its oid (by generating it or
> > extracting it from state)
>What about foreign key constraints in the underlying db? Not that I
>really use them - I think it is the application's responsibility to
>govern relationships between objects.
I presume you're talking about ensuring that the referenced object exists
before it's referred to? That's actually handled by way of
'oidFor()'. Think about it. When you save the state for 'aPerson', it has
to get the 'oidFor()' of all its foreign key references before it can do an
SQL "UPDATE" to save them. If any of them need new ID's, oidFor() will
cause them to be created and saved *before* the update can point the
foreign key to them. Thus, relational integrity is guaranteed by the
normal operation of the framework, which is just beautiful, IMHO. :)
By the way, remember that all of the methods listed under "Abstract
Methods" are *private* methods, not part of the API. The API for a jar
consists solely of the five methods __getitem__(), newItem(), oidFor(),
preloadState(), and flush(). And of those, only __getitem__() and
newItem() are for the use of "higher layer" application-level code. The
rest are for use by other jars, which are effectively part of the same
>I'm really glad that you put so much thought into avoiding the
>re-loading of state since that can either be the big strenth or major
>downfall of a persistence framework.
One of many lessons learned from ZPatterns, I assure you. :)
>For those who don't know, "Jar" comes straight from your fridge. When
>you want to preserve food, you pickle it and put it in a Jar. The same
>goes for objects that you want to persist: you pickle it and put it in a
>Jar. Sometimes it helps to explain what was obvious once an has since
Actually, "storage jars" for me is a reference to a Monty Python
sketch! But I did start with the term "jar" since the ZODB persistence
framework has the _p_jar concept, which does come from "pickle jars" as
used by Jim Fulton, which came from Python pickle, which I think came from
some other language's notion of pickling. The politically correct term for
a jar is now a "persistent data manager", as expressed by the
IPersistentDataManager interface and lots of references to "dm's" and "data
manager" in the C and Python code of ZODB 4.
But I like "storage jars" better, at least as a working term. I'm not sure
it really belongs in the businesslike terminology of PEAK, and we might
actually be better off calling them "Racks", as they are very close in
concept and function to the Racks in ZPatterns. The main difference is
that there were no "alternate key" racks or "query" racks in ZPatterns, at
least as a promoted concept. Which isn't to say that nobody ever
implemented query or alternate key racks; I'm sure they did. There just
weren't names for the concepts.
Anyway, a "rack" goes more with the idea of giving something an ID and
getting back an object; I think of those motorized racks in the dry-cleaner
shops, where you hand in your ticket, and they spin round to your
clothes... And if the clothes are in an opaque bag, they're like a ghost,
but as long as the clothes are in there when you open the bag... :)
Anyway, final terminology can wait a bit, since there's no code as yet.
More information about the PEAK