Describe your experiences with distributed programming here. Stories about non-distributed systems becoming distributed, distributed systems becoming non-distributed, and scalability / performance issues are particularly interesting.
One of the projects I've worked on was developed with BigDesignUpFront
. The team decided from the beginning that it was going to be written in Java, use CORBA, and be distributed across five machines.
This was a team whose prior experience was solely with C. Yes, the project failed. Umm... "failure" might be too kind. It crashed, burned, and careered out of control, taking huge chunks of the landscape with it.
That's all beside the point, though. The point is that, when we finally integrated this beast and ran its multiple processes, we found that only one of the processes was taking a significant amount of CPU time. Everything else blocked on that process. There was no need for distribution at all! If we'd had a simple design and only one CPU, we would have been just fine. The program would have been done a lot sooner, too.
But the project would have failed anyway. -- JimLittle
Another project I've worked on was also developed with BigDesignUpFront
. The team decided from the beginning that it was going to written in Java, use an in-house distribution protocol, and be client-server with a fat client.
Distribution was identified as a big risk from the beginning, and one of the first things we worked on was to get the application running in its distributed state. It took several weeks to do so, but we finally did it.
Distribution remained a major pain in the patooshka for the remainder of the project: Only one person really understood how distribution worked, and little bugs were constantly creeping into the distribution mechanism. At the end of every (month-long) iteration, it took us several days to get the application deployed and working.
Eventually, we wrote an automated deployment script and UnitTest
s that exercised every single remote method. After that, distribution was no longer a hassle.
One thing that we did which I think was a good idea was to include a "DistributionSwitch
." This was a flag that allowed us to run the application in a single process when we were developing. It made it much easier for the developers to run their manual tests and debug. If we had been using automated UnitTest
s from the beginning, the DistributionSwitch
would not have been so useful.
I'm contributing this as a counterpoint to my above story: designing distribution in up-front doesn't always fail. In hindsight, though, we would have been better off if we had just used an HTML thin client rather than building our own distributed client. -- JimLittle
My coworker at company that produced educational software converted a monolithic GUI teacher's Macintosh application into a client app and server app (both Mac apps) by essentially cutting it in half, and taking advantage of the communication protocol we were already using for student - to - teacher communication. This app did not have design-up-front, (nor OO) but the split into client - server took only one or two weeks. -- KeithRay
This seems like a variation on HalfObjectPlusProtocol.
Distribution: Book One - A magical first encounter
I entered the world of distribution with OliverSims
as my guide.
I was taught to see how a distributed environment made the world of enterprise software simpler, not more complex. How so? Well, of course we had some problems to solve back in the early 90s like connections, communication, translation (ASCII - EBCDIC, Big-Endian - little-Endian), naming, location, performance but these were all solvable. Oliver would say that if something is possible in theory but not practice, try harder. We built a new world infrastructure (Newi) now obsolete, Application Servers now provide a standard way of doing this. We passed our data as Semantic Data Objects - XML now provides this flexibility. We built business objects (components as they are called today) on PCs, AS/400s, AIX boxes in C in C++ in COBOL, we dropped new objects into running systems and they collaborated in unanticipated ways and it was good.
The value gained far outweighed the cost of building this infrastructure. Client software could focus on the needs of a single user. Server software could focus on the need to maintain integrity of the shared resources and provide reusable complex enterprise business functionality. A business component was the output of the development process. We had broken the monolith. We built robust small black box components; they could be configured in various topologies and independently maintained and extended. New client applications could be quickly re-configured and new ones assembled reusing existing enterprise behavior. Oliver mentored us to build systems where each component could independently evolve whilst maintaining interoperability. Organic!!
Too cool, too soon, Integrated Objects was sold and Newi was gone. You may have come across it at an OMG meeting or read about it in The Essential Distributed Objects Survival Guide ISBN 0471129933
. The ideas live on.
Distribution: Book Two - A simple business solution
My second story I'll keep short. It was short; I was at the client for 3 months. First month project discovery, then two months to build test and deploy (we used weekly deployment increments, XP had not been published at this time). The client was Bekins, a moving and storage company.
Distribution helped even during the project discovery stage. By identifying groups of business functionality (business components to us techies, session EJBs in J2EE terminology) (e.g. Inventory, Order_Entry, Customer_Details, Tracking_Services, Warehouse_Management, Order_Pricing?) we were able to communicate scope and complexity in terms that were meaningful to both business and technical groups. We arrived at a solution with buy in from both technical and business teams - Order Tracking.
During the development, a layered architecture was used to manage complexity. We also used the DistributionSwitch
- which enabled us to develop the user interface and server components in parallel.
This project was quite a success - winning the Application Development Trends Innovator Award for e-business deployment in 2000.
Read more about it at:
See the architecture at:
Distribution: Book Three - A repetitive disconnected nightmare
My journey now enters the dark side. Complexity and fragmentation. I encountered several large distributed projects working as an Architect for Cambridge Technology Partners and after. Several repeating patterns were observed:
- Too much time spent by skilled technical staff trying to become experts in the many emerging technologies (Java, EJBs, html, XML, JNDI, LDAP, Legacy connections, ?)
- Too little time spent on determining which technology was the best fit
- Too little time spent on collaboration between team
- All developers exposed to many tools and technologies (App Servers, Web Servers Java IDEs, XML editors, Source control, Modeling tools, Legacy connectivity APIs, Swing, Applets, Servlets, JSPs, LDAP, EJB builders, OO-RDBMS libraries, Security APIs ?)
- Dealing with several paradigm shifts at once: OO, Component based development, and distributed development
- Duplication of effort between teams due to lack of integration techniques and communication
- Unclear division of responsibilities and ambiguous interfaces between teams and components
- Disconnect between technical teams, management and business stakeholders
- No integration plan
- Lack of common vision/goals
Some things that could have helped but were missing:
Dark times, unhappy teams, 60 hour work weeks, TheCostOfInefficiency
was huge, millions lost, projects failed.
In summary the business opportunities were there. Simple solutions could have been found. The technologies and the talent were available. The problem was analyzed, teams were fragmented and synergy never found. Entropy won the day.
Distribution: Book Four - A promised land
Coming soon - I hope.
There is a better way. The AgileAlliance
has laid some of the principles.
I have this theory going that RPC-based distributed systems are a failure, with the only things that could be called successes being NFS (the famous Network Failure System) and its sibling, NIS (the almost as famous Network Insecurity System). While these have been well accepted and are now somewhat industry standards, they have more
than their fair share of problems, as denoted by their nicknames.
So I would like to see if there are any stories of big
distributed successes using RPC-style protocols (synchronous remote function/procedure/method calling) that would prove me wrong.
I'm starting to think that asynchronous message-passing might be a more natural fit for distribution than I first thought. It is less transparent, but I now think that distribution is a different enough thing that having it done transparently is going to cause more problems than solve them. -- PierrePhaneuf
At one of my old companies (a large telco) they had DCE widely deployed. Whether this was 'successful' is questionable - there was so much inertia in getting new services deployed (partly due to BigDesignUpFront
, but also to real worries that the CapacityPlanner?
s had about volumes of RPCs on the mainframes) that people would resort to ScreenScraping?
(which meant the capacity planning was being bypassed, as well as being less reliable). Eventually the DCE licenses ran out and when I left they were in the process of porting every RPC to CORBA (on BEA WLE). This meant that all requests for providing services via the middleware were put on hold for 1-2 years
(it was a very big telco), meaning more people were resorting to hacks bodges and screenscraping to get real work done. Plus ca change.
The RPCs that did work worked well and reliably, but the mess they'd got into with heavyweight QA made everyone avoid middleware like the plague. The main alternatives were using the mainframes directly, with a little office automation, and building intranet web-based interfaces. Hardly a joined-up enterprise, but they both delivered, fast, at a fraction of the cost (eg £60k for a tactical web solution vs £1.5M for the 'strategic' RPC-based solution).
I was involved in one very successful RPC project at that telco - an early piece of java/corba work. Here the corba allowed us to do better UI than the web allowed and move some of the processing onto the client. However, it would have delivered much
earlier if we'd just bought the users X clients, and moved some of the logic into the database! I would consider X the biggest RPC success. -- AnonymousDonor
Ah, X is a better example, but I would be inclined to write it down as a message passing system, or more appropriately, an hybrid. Messages from the X client to the server are called requests and messages from the server to the client are called events, and you have a few round-trip messages that map more closely to RPC (the hybrid part), but they are only a few of these (the main part of an X application communication is asynchronous message passing). -- PierrePhaneuf
I spent a couple of years leading a team building a distributed system. The biggest problem we faced was that we had a large quantity of real-time data being generated every second, and the managers wanted to minimize the amount of network bandwidth that they needed to buy. But they also required us to use CORBA, which is not designed for high-frequency/low-bandwidth applications. So, most of our efforts went into minimizing the number of network transactions and squeezing as much data as possible into each transaction. The entire architecture is designed around this set of principles.
Now that the system is in place and considered successful, they have decided that they will buy lots and lots of bandwidth. "What would you implement differently if bandwidth wasn't a constraint?" they ask. Everything!!!!
We've built a distributed/clustered platform for internet appliances here at Nortel. The system is pretty cool, it:
- Gives a "single system image" for operation and maintenance, meaning that all configuration changes, software upgrades (from patches to new OS images), etc are done as if there is only one machine, even if there are really 200. (NB: this works too :-))
- Makes it easy to add new machines: just plug it in, tell it the IP and password of one you already have, and it automatically installs the software and configuration and is "ready to roll".
- Handles failures, and makes sure that the O&M IP address is always online.
- Heaps of other fun stuff
This is deployed absolutely all over the place in our SSL-offload and switched firewall products (more on the way.) The platform is written in the fantabulous ErlangLanguage
(with heavy use of the MnesiaDatabase
), developed and maintained by on average about 10 people (mostly working with it rather than on it), and has been in use and evolving for about 4 years (I've been here for three myself, and it's been a pleasure to hack).
More info links:
What I'm trying to say is - throw away this CORBA/other-horrible-nasty-stuff and use Erlang - it's a thousand times better :-). (Even if your actual application is written in another language, Erlang is spectacular for the distribution) -- LukeGorrie
Here's an example outside of the business world: I worked on a project that used a TI DSP to preprocess data from a linescan camera, sent it downstream to one of 4 DSPs on another number crunching card and finally to a PC that did the GUI and final evaluation of the data. There was no OS so we got to handcode all the interrupt handlers and communication routines in C40 assembly. I think in some ways that made it easier because we knew exactly what would happen and the system was small enough that we didn't really need an OS.
Anyway, the scalability was close to linear. Ignoring mechanical considerations we could have run the machine twice as fast with 8 downstream DSPs. Building a non-distributed system wasn't practical, especially given the low cost of DSP MIPS when doing things DSPs are made for. The biggest hurdle was debugging, although we had an ICE it was really hard to debug without slowing the system down enough to hide the bugs.
The biggest problem with distributed systems in my experience is that it's just too hard to anticipate exactly in what order things will be happening. With a single CPU you can still reason about how threads will interact but with many CPUs it gets so much harder. In some cases you can synchronize the CPUs just like you would synchronize threads but in other cases you would defeat the purpose of a multi-processor system.
As to BDUF - in my case there was quite a bit of it since the project involved custom software, custom hardware and custom mechanics plus heaps of off-the-shelf stuff as well. I think the more non-software components your system contains the less likely you will escape BDUF. On the other hand, I think a hybrid project like this forces everyone involved to be smarter about requirements and changes half way through the project. In many, many cases the only reason BDUF fails is because of the "oh it's only software" mindset.