Distributed Garbage Collection

Distributed Garbage Collection (DGC) is GarbageCollection for a DistributedSystem.

Whether an object should be destroyed is provably undecidable in the general case (e.g. given partial failures). Thus, it must be assumed that any DGC protocol will have errors. Keeping the error-rate below an acceptable threshold with minimal overhead is the goal for a good DGC protocol.


Feasible Approaches:

Leverage Self-Healing: If a distributed programming system is designed to heal after node-level failures, then the garbage collector can leverage this by destroying a distributed object heuristically if it hasn't been used in a while. Upon later discovering that the destruction was in error, use the self-healing mechanisms to restore it. Protocol and service-layer self-healing may also be leveraged in higher layers.

Partition the Identifiers: From locality of reference, we know that most memory objects (>95%, easily) are never directly referenced from external machines, and so may be garbage-collected locally. Leverage this: use 'intermediate-scale' identifiers rather than jumping straight from local IDs to global IDs. For example, pair-wise IDs or domain-local IDs could be used. IDs with limited distribution allows more localized coordination to determine reachability, and is likely to cover most GC cases in practice... and they also reduce transport and security costs within the domain. Depending on communications patterns (especially the common use of request/reply patterns, future transport, and client-server distribution patterns) pair-wise IDs could easily save another 95% or more, simplifying DGC by limiting or more explicitly tracking distribution. Interest Tracking: By nature, any form of DGC must be cooperative, so make it even more so! If Alice wants to keep a particular globally identified object around on Charlie's machine, then she says "Hey, Charlie, would you hold onto that object for me?". That way Charlie doesn't need to guess - especially useful when Charlie doesn't know Alice (i.e. if Alice obtained that reference through Bob). A minimal cost would be a simple timestamp per object (a 'do not expire before' date) such that Alice could speak up again if an object is about to expire. More complex approaches may keep more information for a number of reasons (low-latency collection, process accounting, system administration), but even the overhead for full reverse-links or explicit interest-groups would be quite acceptable for objects of low-to-medium distribution. Usefully, support for subscriptions and other constructs may serve well as implicit programmer interest. An overlay network that can perform intra-node caching or subscription multi-cast may also reduce overhead for interest tracking (quashing unnecessary requests to extend the lifetime before they reach the primary hosts).

Combine with Explicit Destruction: Allow programmers to explicitly destroy certain objects. Allow programmers to mark survival of some objects as contingent upon the survival of others. This can be a tool for process control, capability revocation, and more. This would further reduce need for DistributedGarbageCollection; fewer collections would mean fewer heuristics failures and improved robustness.

DGC is Not. Worth. It. Once a reference to an object is communicated outwards beyond a domain over which you possess full control of communications, it is theoretically impossible to track all places it might reach... and you probably lack the authority to even ask or search. DGC beyond your domain violates security principles. However, even if you possessed the necessary authority, a network search is not scalable, especially with any additional requirements for disruption tolerant networking. IIRC, it has been proven that with certain assumptions about the network (involving disruption and partitioning) that DGC is totally undecidable... you can't -ever- decide whether something ought be deleted.

Information or objects should disappear when people with the authority to make them disappear want them to disappear (for whatever reason) and even that cannot be performed reliably due to the casual ability to replicate information. If nobody wants the information gone, it can hang around forever and be cached away to more permanent but slower storage as it falls into disuse. Available storage, after all, increases at a greater rate than original content to store, after all, even on the time dimension. Local garbage collection is a different matter... and should be performed normally because, unlike a network (to which one can always add storage), individual devices suffer limited storage.

Better would be a system where you simply demand that those remote users interested in keeping an object be sure to tag it to indicate their interest (similar to a subscription, possibly possessing an expiry clause, but not necessarily involving outwards communications when the object changes or is otherwise processed... still, essentially a reversed link). When it comes time to clean up space (if you must) simply remove some of the old stuff that nobody has tagged as interesting and that has fallen into disuse... much like removing dead threads in a forum. Anybody who cares will have a powerful social-engineering pressure to make that concern known OR to replicate the data for their own use (as security dictates). Since any network protocol must ultimately depend upon social engineering of the agents involved, might as well make it explicit.

DistributedGarbageCollection does not mean "universal uncontrolled collection of garbage". It is here mentioned in the context of NetworkAsComputer.

I never said anything about uncontrolled, but either way it doesn't matter. It still comes down to an issue of domain and social engineering of agents. If references are legally allowed to leave a domain, you simply can't collect those with a DGC... not without violating the very purpose of garbage collection in general (to recycle those resources that are no longer of potential use to any legal user of the resource). If you do have control of the full network across which you are performing collection, then you do some collection, but even in that case, DGC can be undecidable.

I assume what was meant is: Distributed in the sense of connected by network with failure modes and delays that have to be taken into account. Not more, but also not less. In this setting the object never leaves your domain. Maybe the distributed system (might simply be a large distributed VM like MozartOzLanguage) runs on a server farm owned and controlled by you. If mobile or other remote devices are part of the system the untrusted/unowned connections have to be secured of course.

A server farm (any sort of cluster or grid) isn't generally considered 'distributed', though it certainly is concurrent. It's an informal definition, but I don't consider a system truly 'distributed' until it becomes impractical for any one agent to know the name of or talk directly to every other agent (a quality that is, perhaps, corollary to LeslieLamport?'s statement: "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."). Handling garbage collection in a concurrent environment is quite difficult enough. Anyhow, yes, if you do have authority to hunt down every reference to an object (because you own the domain), AND you can handle the concurrency problem (communications delays, references to an object added or deleted from nodes during collection), AND you can handle node and link communications disruption problems (nodes become inaccessible for indeterminate periods of time), THEN you could probably do some garbage collection in a distributed environment.

The name 'DistributedGarbageCollection' would additionally imply that the service itself is distributed such that a failure or partitioning of a subset of nodes performing DGC doesn't (entirely) halt the collection process, doesn't cause deadlock, doesn't cause livelock, and doesn't leave the system in an undefined state. However, I think this topic here was in reference only to collection of resources in a distributed computation environment.

Considering how difficult it is to perform simple distributed transactions with rollback in a distributed environment, I remain greatly skeptical that DGC will ever surpass those solutions that involve local agent decisions on whether to keep an object around or not.

See MozartOzLanguage (esp. http://www.mozart-oz.org/download/mozart-ftp/store/1.3.2-2006-06-15-print/other/DistributionSubsystem.ps.gz) Mozart uses local garbage collection and weighted object references (where weight is based on how -much- of the object you have; it essentially tracks whether remote references exist). It does not collect objects with weighted references, but might hand the weight off to another agent carrying a reference to that object.

RE: Partitioned Identifiers (page_anchor: opaque and pairwise)

I agree that [partitioning the identifiers] optimizes lots of cases. But the programmers should not care about it as long as he doesn't care. I mean IDs should be of one opaque type, the system would choose (and switch between) a suitable implementation and the programmer might a best use some API to read-out or influence this.

Agreed. For simplicity, remote identifiers should be of one opaque type from the programmer's perspective. In the general case the programmer doesn't work with 'identifiers' at all, but instead the protocol module emits 'proxy' objects that are (to the type-system and most programmer interactions) indistinguishable from local objects. EeLanguage does distinguish 'far refs' from local refs to make reliability concerns more explicit. My favored design treats all objects as remote and concurrent, which supports explicit destruction (mentioned below), process accounting (e.g. shut down an activity with well-defined semantics for failed outgoing messages), and automatic distribution and replication.

I don't know what you mean by pair-wise IDs. I imagine something like one Id for both sides of a reference (my objects here has ID x which is referenced as ID y over there). Could you tell me more?

Assume Charlie needs to share an identifier to object O with Alice. As far as costs go, a 'global' ID needs to avoid naming collisions and ideally is unguessable (ObjectCapabilityModel-secure). In practice, this isn't difficult to achieve in about 64 characters per ID, including a 160-bit base-32 of near-random object identifier plus another 160 bits to identify and fingerprint a host (fingerprint is a alternative to certificates that protects against man-in-middle attacks but hurts for human-readability). The costs to create a new ID are tiny compared to the costs of distributing it or GarbageCollection. Multi-homed identifiers to support reliability and performance can grow somewhat more expensive.

The pairwise IDs, meanwhile, don't need to to be unguessable since there is only one recipient who has full rights to access all of the objects given to it. Presumably, they also don't need multi-homing. Thus, one can get by with 48 to 60 bit integers for pairwise IDs, and a simple increment. This may actually result in very considerable performance savings if a large portion of IDs never escape pairwise reference. However, transport outside that pair is much more expensive, requiring round-trips or certificates or distributed upgrade. There is also a total server-space cost proportional to the number of hosts with a reference to the ID (a condition that doesn't scale).

The overall life-cycle cost for an identifier should include the cost of its construction, its distribution, and its maintenance (including upgrade, interest tracking, mobilization, and GarbageCollection). Starting with a pair-wise ID and eventually upgrading is more expensive overall than starting with a global ID, but one must weigh that against the probability that a pair-wise ID never requires the upgrade.

View edit of November 3, 2009 or FindPage with title or text search