Distributed Garbage Collection (DGC) is GarbageCollection
for a DistributedSystem
Whether an object should be destroyed is provably undecidable in the general case (e.g. given partial failures). Thus, it must be assumed that any DGC protocol will have errors. Keeping the error-rate below an acceptable threshold with minimal overhead is the goal for a good DGC protocol.
- partitioned networks and node failures: must deal with a highly partial disconnected network topology where remote references may be inaccessible for unpredicted amount of time.
- mechanisms typically determine the reachability of the remote objects by means of communication between the nodes involved. However, in a mobile setting where disconnections are the rule rather the exception, the system cannot determine how long it should wait for a connection to be restored. Knowing when the communication will be restored depends on how the application reacts to disconnections. Some applications may wait for the connection to be restored to resume its computation while some others may continue their computation with a substitute service instead. Sometimes the geographic location of the devices or identify information may also influence the behaviour of mobile applications.
- security: In the general case (as for ObjectCapabilityModel in an open system), you cannot ask which identifiers another machine has, and you cannot directly tell them which identifiers you have, and you cannot tell where you might have sent identifiers you once had.
- thus depends on
- the object graph
- the context in which the objects are themselves i.e
- the semantics of the application and
- the role of the reference in the network
- the implicit context such as the location of devices
- Rearranged from
If a distributed programming system is designed to heal after node-level failures, then the garbage collector can leverage this by destroying a distributed object heuristically
if it hasn't been used in a while. Upon later discovering that the destruction was in error, use the self-healing mechanisms to restore it. Protocol and service-layer self-healing may also be leveraged in higher layers.
Partition the Identifiers:
From locality of reference, we know that most memory objects (>95%, easily) are never directly referenced from external machines, and so may be garbage-collected locally. Leverage this: use 'intermediate-scale' identifiers rather than jumping straight from local IDs to global IDs. For example, pair-wise IDs or domain-local IDs could be used. IDs with limited distribution allows more localized coordination to determine reachability, and is likely to cover most GC cases in practice... and they also reduce transport and security costs within the domain. Depending on communications patterns (especially the common use of request/reply patterns, future transport, and client-server distribution patterns) pair-wise IDs could easily
save another 95% or more, simplifying DGC by limiting or more explicitly tracking distribution.
- Some comments and replies moved to (page_anchor: opaque and pairwise)
By nature, any form of DGC must be cooperative, so make it even more so! If Alice wants to keep a particular globally identified object around on Charlie's machine, then she says "Hey, Charlie, would you hold onto that object for me?". That way Charlie doesn't need to guess - especially useful when Charlie doesn't know Alice (i.e. if Alice obtained that reference through Bob). A minimal cost would be a simple timestamp per object (a 'do not expire before' date) such that Alice could speak up again if an object is about to expire. More complex approaches may keep more information for a number of reasons (low-latency collection, process accounting, system administration), but even the overhead for full reverse-links or explicit interest-groups would be quite acceptable for objects of low-to-medium distribution. Usefully, support for subscriptions and other constructs may serve well as implicit programmer interest. An overlay network that can perform intra-node caching or subscription multi-cast may also reduce overhead for interest tracking (quashing unnecessary requests to extend the lifetime before they reach the primary hosts).
Combine with Explicit Destruction:
Allow programmers to explicitly destroy certain objects. Allow programmers to mark survival of some objects as contingent upon the survival of others. This can be a tool for process control, capability revocation, and more. This would further reduce need for DistributedGarbageCollection
; fewer collections would mean fewer heuristics failures and improved robustness.
DGC is Not. Worth. It. Once a reference to an object is communicated outwards beyond a domain over which you possess full control of communications, it is theoretically impossible to track all places it might reach... and you probably lack the authority to even ask or search. DGC beyond your domain violates security principles. However, even if you possessed the necessary authority, a network search is not scalable, especially with any additional requirements for disruption tolerant networking. IIRC, it has been proven that with certain assumptions about the network (involving disruption and partitioning) that DGC is totally undecidable... you can't -ever- decide whether something ought be deleted.
Information or objects should disappear when people with the authority to make them disappear want them to disappear (for whatever reason) and even that cannot be performed reliably due to the casual ability to replicate information. If nobody wants the information gone, it can hang around forever and be cached away to more permanent but slower storage as it falls into disuse. Available storage, after all, increases at a greater rate than original content to store, after all, even on the time dimension. Local garbage collection is a different matter... and should be performed normally because, unlike a network (to which one can always add storage), individual devices suffer limited storage.
Better would be a system where you simply demand that those remote users interested in keeping an object be sure to tag it to indicate their interest (similar to a subscription, possibly possessing an expiry clause, but not necessarily involving outwards communications when the object changes or is otherwise processed... still, essentially a reversed link). When it comes time to clean up space (if you must) simply remove some of the old stuff that nobody has tagged as interesting and that has fallen into disuse... much like removing dead threads in a forum. Anybody who cares will have a powerful social-engineering pressure to make that concern known OR to replicate the data for their own use (as security dictates). Since any network protocol must ultimately depend upon social engineering of the agents involved, might as well make it explicit.
does not mean "universal uncontrolled collection of garbage". It is here mentioned in the context of NetworkAsComputer
I never said anything about uncontrolled, but either way it doesn't matter. It still comes down to an issue of domain and social engineering of agents. If references are legally allowed to leave a domain, you simply can't collect those with a DGC... not without violating the very purpose of garbage collection in general (to recycle those resources that are no longer of potential use to any legal user of the resource). If you do have control of the full network across which you are performing collection, then you do some collection, but even in that case, DGC can be undecidable.
I assume what was meant is: Distributed in the sense of connected by network with failure modes and delays that have to be taken into account. Not more, but also not less. In this setting the object never leaves your domain. Maybe the distributed system (might simply be a large distributed VM like MozartOzLanguage
) runs on a server farm owned and controlled by you. If mobile or other remote devices are part of the system the untrusted/unowned connections have to be secured of course.
A server farm (any sort of cluster or grid) isn't generally considered 'distributed', though it certainly is concurrent. It's an informal definition, but I don't consider a system truly 'distributed' until it becomes impractical for any one agent to know the name of or talk directly to every other agent (a quality that is, perhaps, corollary to LeslieLamport?'s statement: "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."). Handling garbage collection in a concurrent environment is quite difficult enough. Anyhow, yes, if you do have authority to hunt down every reference to an object (because you own the domain), AND you can handle the concurrency problem (communications delays, references to an object added or deleted from nodes during collection), AND you can handle node and link communications disruption problems (nodes become inaccessible for indeterminate periods of time), THEN you could probably do some garbage collection in a distributed environment.
The name 'DistributedGarbageCollection' would additionally imply that the service itself is distributed such that a failure or partitioning of a subset of nodes performing DGC doesn't (entirely) halt the collection process, doesn't cause deadlock, doesn't cause livelock, and doesn't leave the system in an undefined state. However, I think this topic here was in reference only to collection of resources in a distributed computation environment.
Considering how difficult it is to perform simple distributed transactions with rollback in a distributed environment, I remain greatly skeptical that DGC will ever surpass those solutions that involve local agent decisions on whether to keep an object around or not.
Mozart uses local garbage collection and weighted object references (where weight is based on how -much- of the object you have; it essentially tracks whether remote references exist). It does not collect objects with weighted references, but might hand the weight off to another agent carrying a reference to that object.
RE: Partitioned Identifiers (page_anchor: opaque and pairwise)
I agree that [partitioning the identifiers] optimizes lots of cases. But the programmers should not care about it as long as he doesn't care. I mean IDs should be of one opaque type, the system would choose (and switch between) a suitable implementation and the programmer might a best use some API to read-out or influence this.
Agreed. For simplicity, remote identifiers should be of one opaque type from the programmer's perspective. In the general case the programmer doesn't work with 'identifiers' at all, but instead the protocol module emits 'proxy' objects that are (to the type-system and most programmer interactions) indistinguishable from local objects. EeLanguage
does distinguish 'far refs' from local refs to make reliability concerns more explicit. My favored design treats all objects as remote and concurrent, which supports explicit destruction (mentioned below), process accounting (e.g. shut down an activity with well-defined semantics for failed outgoing messages), and automatic distribution and replication.
I don't know what you mean by pair-wise IDs. I imagine something like one Id for both sides of a reference (my objects here has ID x which is referenced as ID y over there). Could you tell me more?
Assume Charlie needs to share an identifier to object O with Alice.
- Charlie creates a string ID for O, puts <ID,O> in a table, then sends ID to Alice.
- Alice, on her end, creates a proxy object O' that takes all incoming messages and routes them back to O in Charlie. This routing requires the ID that Charlie initially gave Alice, so that Charlie can find ID in the table.
- Additionally, Alice must do a little extra so that she knows when her O' is no longer in use and that global objects with ID may be collected. This requires some support from the local GarbageCollection mechanism, such as weak references or triggered destructor events - nothing particularly specialized.
- The use of tables partitions the GarbageCollection efforts: local GC is used to collect objects and proxy objects normally, then DGC is used to maintain the protocol routing-tables at a much more sedate pace.
- The routing becomes slightly more complicated when Alice later wishes to forward her O' to Bob. It is this complexity that distinguishes a 'global ID' from a 'pair-wise ID'.
- With a global ID - such as a UniformResourceIdentifier - Alice simply gives Bob the same global ID. At that point, Bob can talk to Charlie directly. However, this has implications for GarbageCollection, since Bob now knows about Charlie's objects, but Charlie doesn't know that Bob knows. Figuring out who holds references to a URI is very difficult.
- With pair-wise ID, only Alice may use her ID to access a particular object on Charlie. Essentially, there is a specific table just for Alice's IDs inside Charlie, so <ID,O> is unique to Alice and only messages from Alice can directly reference it. When forwarding O' to Bob, Alice has several options:
- Option 1 (chaining): create a new pair-wise ID between Alice and Bob. This makes all messages from Bob to object O route through Alice, which has negative performance and reliability implications, but might not be so bad so long as the chains are kept shallow or can be shrunk based on demand. (The chaining approach also works across different protocols and standards.)
- Option 2 (forwarding): Alice asks, "Charlie, can you forward my object X to Bob?". Charlie responds by reading <X,O> from Alice's table, then writing <Y,O> to Bob's table, then replying "Alice, Bob's object Y is the same as your object X". This requires a round-trip to Charlie to forward objects, and also requires Charlie be online at the time.
- Option 3 (certification): Alice hands Bob a certificate that reads: "Charlie, Bob gets access to my X. Signed, Alice". Bob then asks Charlie for Alice's X and also hands over the certificate, at which point Bob can ask for a more personal ID for the same object. For GarbageCollection, Alice becomes the responsible party for knowing to whom she forwards the messages and holding id X on behalf of Bob.
- Option 4 (upgrade): Alice or Charlie decides that the distribution of references to O is wide enough to expect extensive distribution in the future, such that it's worth upgrading to a global ID. The transition to the global ID can be lazy in many cases. However, new distribution uses the global ID.
As far as costs go, a 'global' ID needs to avoid naming collisions and ideally is unguessable (ObjectCapabilityModel
-secure). In practice, this isn't difficult to achieve in about 64 characters per ID, including a 160-bit base-32 of near-random object identifier plus another 160 bits to identify and fingerprint a host (fingerprint is a alternative to certificates that protects against man-in-middle attacks but hurts for human-readability). The costs to create a new ID are tiny compared to the costs of distributing it or GarbageCollection
. Multi-homed identifiers to support reliability and performance can grow somewhat more expensive.
The pairwise IDs, meanwhile, don't need to to be unguessable since there is only one recipient who has full rights to access all of the objects given to it. Presumably, they also don't need multi-homing. Thus, one can get by with 48 to 60 bit integers for pairwise IDs, and a simple increment. This may actually result in very considerable performance savings if a large portion of IDs never escape pairwise reference. However, transport outside that pair is much more expensive, requiring round-trips or certificates or distributed upgrade. There is also a total server-space cost proportional to the number of hosts with a reference to the ID (a condition that doesn't scale).
The overall life-cycle cost for an identifier should include the cost of its construction, its distribution, and
its maintenance (including upgrade, interest tracking, mobilization, and GarbageCollection
). Starting with a pair-wise ID and eventually upgrading is more expensive overall than starting with a global ID, but one must weigh that against the probability that a pair-wise ID never requires the upgrade.