__getattr__ Considered Harmful
I just refactored a number of internal aspects of Pygr’s object-relational model, to make use of a new pattern I’m calling “subclass binding”, which I’ll try to explain a bit in this post. First I’ll try to explain the problem from the viewpoint of a Python programmer.
Object-relational design makes modularity even more difficult than usual. It illustrates a general problem: when you try to combine two different behaviors (e.g. a local Python object and a back-end database) into one object, all sorts of confusion can ensue.
My painful experience with annotation objects exemplified this problem. At first it seemed obvious that you want to have combined access to the sequence object and the annotation attributes. But this quickly turned into a nightmare: since there was no way for a user to indicate whether he wants to deal with the object as a sequence or as an annotation, the object constantly has to guess which behavior is desired. This quickly leads to nasty bugs in every case where the guessing policy fails… The solution was to separate the two behaviors absolutely, by providing a sequence attribute on the annotation object.
__setattr__ Considered Harmful
These abstract issues turn painfully concrete when you write __getattr__ and __setattr__ methods. These methods do not “play nice with others”, to put it mildly. Any time you write a getattr you create a modularity problem, because all attribute requests will come to this one method. If you later use multiple inheritance (i.e. “combine two behaviors in one object” as above) you will be obligated to write a new layer of getattr to act as traffic cop for all the layers of getattr below it. This does not scale. Complexity begins to grow in a self-reinforcing way. For example, you’ll immediately start to get infinite loops (e.g. during unpickling, the unpickler will request attributes like reduce and setstate that may not exist, so getattr gets called, it requests some attribute that doesn’t exist because the object hasn’t even been initialized yet, so getattr calls itself…). This leads to what I’ll call “protective nonsense”, i.e. your code gets littered with protective tests to block certain vulnerabilities.
This whole approach is error-prone, unscalable and unmodular.
setattr makes this problem about a thousand times worse. Whereas getattr is “the method of last resort” (only gets called if the usual methods fail), setattr is pre-emptive: the very act of adding it to a class re-directs all attribute writes to this one method. So you should consider setattr a form of contamination; it takes over everything that it touches. Note that just the threat of setattr can spread its baleful effect: even if you think a class might someday be subclassed with a setattr, you have to rewrite all your attribute writes to evade it (e.g. by calling setitem on the object’s dict). Note to self: this is ridiculous!
In my experience, the uses that getattr and setattr are applied to are usually very modest, and don’t justify all these disadvantages. The one case that has some is when one object strictly mirrors the attributes from another object.
Tentative Conclusions
-
when in doubt, keep behavior separate. The benefits of completely merging two things into a single object may actually not be very great, compared with simply having a standard way for accessing each object from the other.
-
use properties (descriptors) instead of getattr and setattr: properties act like normal attributes (and thus are as modular as regular attributes), rather than taking over the class like getattr and setattr do. I suspect we are biased against creating lots of separate properties when we could achieve the same effect with one getattr() method. But in most cases this is a false economy — the same amount of code is required, but is handled once at “compile time” (by decorating a class with a number of properties) instead of repeatedly on every attribute request (in getattr). The former is better. Perhaps more importantly, conflicting property names could be detected automatically (in theory), whereas sorting out potentially conflicting getattr activities is for experts only. It also seems that properties can be added at any time — even after the shadow class already exists — unlike getattr code, which is fixed prior to module import! It’s also interesting to override the property by simply setting a value in the object dictionary, in which case the property get never gets called.
-
This pattern should be done in a totally standard way, specifically my current get_shadow_class() function.
-
I need a better name for this pattern than “shadow classing”, which doesn’t explain anything. “Instance subclassing”? “subclass binding”? “automatic subclassing and binding”? I think a function name like get_bound_subclass() is much better than get_shadow_class(). So maybe “subclass binding” is the best suggestion. The name should communicate a couple key ideas
- we create a subclass for each database, in order to bind attributes that are specific to that database.
- this is done automatically
Case Study: SQLSequence
One object both acts as an interface to a row in an SQL database, and as a Pygr sequence object. That means some attributes are mirrored from the database, whereas others are purely local Python data, used for strictly internal purposes. So now we have a problem: how is the code supposed to know which attributes are purely internal, vs. which should be treated as queries to the database? This is obviously crucial for modularity. The SQL row class code should not have any dependency on knowing anything about what other classes it might get mixed with.
Proposal
-
instead of using getattr on the SQL row class, use the shadow class approach and add descriptors (properties) for all the attributes that come from the back-end database. SQLTable already uses the shadow class mechanism, so why not use it to get rid of getattr and setattr?
Comment: the current situation is quite a muddle. SQLTableBase.objclass() is automatically adding ColumnDescriptor properties in place of all actual SQL columns, in line with what I just proposed, but it clashes with what SQLRow expects (to store id attribute locally). I need to clean this up!! Actually, objclass()’s usage of the factory pattern is totally unnecessary. Descriptors can be added to a class at any time; that’s how pygr.Data does it.
Generalizing the Persistence Pattern
Another interesting pattern to look at is the SQLTableBase.new() method. It calls the object constructor first, then inserts into the DB. The idea here is to preserve the usual distinction between construction (init) vs unpickling (setstate). Reading the object from the database is considered to be a form of persistence, so it gets treated using the unpickling route… This makes a lot of sense, to put Python’s built-in support for persistence to work as a systematic solution. The whole reduce, getstate, setstate system is a much more general solution than just for pickling. The obvious implication is that classes like TupleO and SQLRow should be created using the ClassicUnpickler(klass, state) rather than using the usual constructor pattern klass(**state). The converse implication is that we should use getstate to obtain the data to save into the database.
Does this make sense or does it clash with normal pickling usage? On the one hand, it does generalize the idea that you can save an object persistently to work equally well in regular pickling and other applications. On the other hand, this is quite different from the pygr.Data concept of “saving a reference to a persistent object”. That’s a separate issue. Two separate layers:
- standard persistence: extract the dictionary of attributes required for persistent storage.
- persistent ID: look for a persistent ID to store as a reference, instead of actually saving the data (because we know the data is stored somewhere else, and will always be retrievable using its persistent ID).
I’ve never had to think much about all these things, because almost all my usage has been read-only, accessing existing databases (that were created through other mechanisms).
Alternative Models
I think I’m also confused about whether there’s a good way to make the object logic completely modular, separate from the storage implementation. There are several patterns to distinguish:
-
What about “persistence proxy” methods, where data is not actually “unpickled” into a local object, but instead the object simply acts as a proxy that relays data requests to the persistent store… SQLRow is an example of this pattern. This doesn’t fit the standard getstate / setstate model. The persistence issue gets pushed down to the level of each individual attribute, which really means assuming that attributes are “natively” persistent, i.e. types that the transport mechanism can automatically convert like int, str, float.
- “fine-grained persistence”: Writable classes like TupleRW and wildfire Row write each attribute back to the database immediately; there is no operation to “save the complete persistent state”.
We often prefer to use these patterns over a “standard pickling” model for object-relational interfaces, for several reasons:
-
memory & speed: for really large datasets the standard Python attribute model (dict) uses up too much memory. A tuple or slots can use five-fold less memory, and also be much faster.
-
write behavior: if the database is “writable”, then changing an attribute should be propagated to the database (and probably right away). That implies a descriptor. On the other hand, if the database is read-only, then we again need a descriptor to raise an AttributeError on any setattr attempt.
-
if we want “fine-grained” interaction between the local object representation and a persistent store (which is typically what you want for an object-relational model).
Whole Object Persistence vs. Attribute-level Persistence
This all seems to boil down to just two basic patterns:
- conventional unpickling / pickling takes the whole object as the unit of persistence: take data out from database, use it until done, then save back to database. This works OK for a read-only pattern (but doesn’t raise write errors as it should).
- attribute-level persistence: the persistence problem is pushed down to the individual attributes, typically with a descriptor handling all the requests. Any kind of proxy or fine-grained persistence uses this pattern.
Descriptor Categories
There are several kinds of descriptor usages that come up again and again:
-
proxy: forwards query to someone else, typically a remote server
-
computed: computes the requested attribute either by external calls or based on other attributes of the object
-
read-only: raises appropriate exception if user tries to write the attribute. May also provide attribute interface to a more efficient internal storage mechanism (such as slots or tuple)
-
write-consequences: writing to this attribute triggers an action, such as value checking, saving to an external server, or other “consequences”
November 5th, 2008 at 3:59 pm
Interestingly, even for accountants :)))))