Remote Data in Drupal: Museums and the Web 2009

by Larry Garfield

At Palantir, we frequently work with museums and universities that need to integrate large legacy data systems.

In the case of museums that's usually a Collections Management System and/or a Digital Asset Management system (DAM) of some kind, holding information on thousands or hundreds of thousands of works in the museum's collection. In the case of a university, that's usually course information and availability coming from a registrar system. As a result, we've gotten a disturbing amount of experience integrating with oddball 3rd party systems. At last week's Museums and the Web 2009 conference in Indianapolis, Tiffany Farriss and I ran a workshop on remote data strategies based on the work we've done on several different museum sites. Of course, as a pre-conference workshop attendance wasn't that big (even though we did fill the room), so for those who weren't at the conference we decided to put the information up online. The slides themselves aren't that useful, but consider this the novelization of the presentation.

For those playing our home game, there's sample code provided at the end. It uses Amazon as a remote data source, and requires just the base amazon.module.

Integrating 3rd party data is a "killer feature" for many sites. At the Data API Design Sprint in February 2008, we identified better integration of remote data as a major goal for Drupal's next-generation data model. Although we're not there yet, there have been improvements made for Drupal 7 and even in Drupal 6 there are many ways to go about working with 3rd party non-node data.

The key is to understand how to leverage Drupal's multiple personality disorder. Is Drupal a CMS or an application framework? The answer, as anyone who's worked with Drupal heavily knows, is "both". That's what makes it so incredibly flexible, and allows for several different ways of handling remote data. We'll look at three in detail.

I should also note that we are assuming that the legacy database is definitive for its data, and already has some way of modifying that data. That means we only need to deal with read operations, not write operations. While you could certainly implement write operations if you wanted to, in most cases you actively don't want to. Bi-directional data synchronization is a really thorny problem and if we can avoid it so much the better.

The obligatory block diagram

Drupal's general architecture, as of Drupal 6, looks like this:

 

 

On the left we have Drupal's major "app framework" subsystems. There are no doubt others, but those take in the most important for this discussion. In the middle, we have Drupal's node system, which forms the basis of the Drupal's CMS personality. It of course leverages the app framework systems, but most importantly provides a common base to which to attach other "node-related stuff", such as comments and taxonomy in core, various other contrib-based behavior and fields (the light blue), and of course the ever-popular CCK. The ability to value-add to nodes in parallel from a variety of modules is one of Drupal's strongest features, but it really doesn't handle non-local data that well in most cases.

And of course there's room for zillions (technical term, somewhat larger than a bajillion) of other contrib modules, both those that attach to nodes and those that do not. For now we're just showing everyone's favorite, Views.

And of course nearly all data lives in the local SQL database, and is displayed through Drupal's theming system.

Method 1: Direct Access

The first way of dealing with remote data is to leverage Drupal the Application Framework, and essentially ignore the node system entirely. That's not as bad as it sounds; Drupal's a much better application framework than some give it credit for, and even with just FAPI, the database layer, and the theme system it's possible to replicate nearly any functionality you want.

That's exactly what we did for the Art Institute of Chicago. Full details of its implementation are available on Drupal.org, so for now we'll focus just on the basics of the remote data integration. The overall approach looked something like this:

 

 

The yellow parts are the 3rd party system, and/or custom code. The general approach is actually quite simple. First, you build a web-services API of some sort (or just provide direct MySQL access) to your third party system. This is really the hardest part, because not all 3rd party legacy systems are created equal. Sometimes they come with really slick SOAP front-ends built in. Other times you have to custom build your own system. In nearly all cases it requires system-specific code. It also requires having a reasonably fast connection to the legacy system, for some definition of "reasonably fast". Because Drupal doesn't actually "know" about the 3rd party system itself, all searching and querying behavior needs to be executed on the legacy system. Drupal just needs an API for how to query it.

Perhaps the most important piece of this approach is the "data entity". While you are not using nodes, you still want to have a clear separation between your data object and your business logic. That's just good software design in general, and when you don't do it the Ruby on Rails crowd makes fun of you. Object-oriented techniques work really well here, especially if you want to do lazy-loading, but are not required. What is practically required is having a dedicated load mechanism to pull up a data object that you can then render. Essentially, you're paralleling the way nodes work. There's a data object that gets loaded from a data source, you can optionally have hooks to do other fun things to it, you can use that data to generate forms using FAPI, and you can display it to the screen using drupal_render() and the theme system.

If that sounds like a lot of work, it is. Or rather it can be. It could also be very, very simple depending on your use case. As a general rule, I'd say the hardest part is the search integration with the 3rd party system. The 3rd party system will almost never support the level of granularity and power in its search system that you want, simply because it was designed before your web project existed. (If that wasn't the case, you would have just put the data in Drupal, wouldn't you?) There's roughly a 1:1 relationship between the amount of functionality you want and the amount of work required to make the FAPI and theme system do it. The fancier you get the more work it takes but the process is still fairly straightforward, especially if you know how to use the #theme key of the Form API. (If you don't, look into it. You won't be sorry.)

On the plus side, this method gives you an incredible amount of flexibility, especially if you properly separate your data layer from the FAPI and theme code. You also get up to date data with no lag, unless you integrate caching yourself for performance. (There's also the page cache, of course, but that's no different than dealing with nodes.)

On the down side, you may need to write a fair bit of custom code this way; most of Drupal's click-and-grin functionality can't handle non-node non-local data. That also means you don't get access to the enormous archive of node-value-add modules available in contrib.

Method 2: Lazy bridge nodes

The second option builds on the first mechanism. You still need direct access to the 3rd party data source, but rather than displaying it yourself you wrap it into a thin node. You don't want to pull the entire dataset into Drupal at once, though, because that's a ton of nodes, so you lazy-create the nodes on the fly the first time they're viewed. You can see this approach in action on the Indianapolis Museum of Art web site, with full details on Drupal.org.

It looks something like this:

 

 

Note that we still have the same access layer to the 3rd party database (via SOAP, direct MySQL, or whatever), the same data entity, and the same dedicated search mechanism. However, rather than having users view a custom page that we build ourselves they simply view a custom node. That custom node type has a field (generally not a CCK field, but it could be done that way) that contains the ID of its corresponding object in the 3rd party system; say an artwork ID. When the node is loaded, its hook_load() callback loads the data entity we created earlier based on that ID. Then in hook_view() we pull data out of that entity and stick it into the $node->content array as desired. That's exactly how the node API normally works, except that we're usually pulling data from the local SQL database. There's no requirement that you do so, however.

Now that you have a node the represents the 3rd party object, you can attach other "stuff" to that node. There's no shortage of stuff to attach, too. Comments, Taxonomy, CCK fields, ratings, location information, contrib has thousands of modules that can leverage nodes to do wild and crazy things. With this mechanism, you can, almost, have your cake and eat it too: Remote data with local value-add capabilities.

How do we get that "bridge node" in the first place, though? The answer there is our custom search system. Rather than all search results linking to, for example, artwork/123, we first check the lookup table to see if there is already a corresponding node. If so, we link to to that node from the search result. If not, we link to artwork/import/123 instead. That's a very small custom page callback that creates a new node object of our custom type, sets our custom field to 123, saves the node, and then redirects the user to the newly created node page. As far as the user knows they just clicked a link and saw a node page with lots of information on it.

The code for this method is fairly straightforward, but somewhat verbose due to the way the node API works. In practice we've found it easier to use a custom node type, but it could also be implemented on a core-defined node type using hook_nodeapi or a specially flagged CCK integer field. There are many ways to skin that kitten.

On the plus side, this method allows us to take remote data and value-add to it node-style, plus we get all of the advantages of direct access.

On the downside, it's somewhat complex. You also then have, potentially, two separate search systems. Drupal's own search system will know about those nodes that have been created, and whatever gets rendered for them, but won't know about the full archive. The custom search system will know about the entire 3rd party repository, but won't know about any value-add data that has been created in Drupal. Depending on your use case that may not actually be a problem, though. It wasn't for the IMA.

Method 3: Wipe/rebuild import

This method needs a new name. We tried to come up with something catchier but couldn't. Suggestions welcome. :-)

Both of the previous methods require fast, live access to the 3rd party data source. You don't always have that, however. The legacy system may not have an API that is usable, or it may be too slow of a connection. In some cases, such as a university registrar database for course data, allowing direct access may be against institutional policy. (Not that we speak from experience, no, not at all...) Or the 3rd party data source is Excel, or FileMaker, or MS Access 95, or some other "it worked when it was built 10 years ago" setup involving large amounts of duct tape. For whatever reason, the best you can get is a snapshot of the data.

Fortunately, that's frequently enough. If the data does not need to be accurate up to the minute, a cloned snapshot is fine. From that, you can write an importer for it. Not to create nodes, but to create dedicated tables in the local Drupal database. Then, when you need to change the data, you simply rerun your export and import process, wipe away the old data, and rebuild it. That could be daily, weekly, or on any manual schedule. That's why you don't want to use nodes for it. Creating and deleting hundreds or thousands of nodes like that is horrifically slow, and since you don't want to allow editing of that content through Drupal there's really no need to make nodes out of them anyway.

But without nodes, how do you access that data? Have a look:

 

 

Once the data is in the local SQL database, properly normalized, the Views module takes over. In Drupal 6, Views can read nearly any SQL table and slice and dice it however you want; you just need to write the appropriate Views integration hooks. They can be verbose, but are not particularly hard. Depending on your use case you may also need some custom Views handlers, but those are also not hard once you understand the Views API.

Once you have the entire dataset in Views, well, the sky's the limit. You can search the data locally, including faceted search (within reason), using Views. You can create arbitrary lists, random selections, etc. You can even create a single-record page with a simple argument handler: artwork/123 uses 123 as an argument for the view, limited to one record, and you're done. Just add theming.

On the plus side, this method works with any sort of data and does not require an active connection to the remote data. That allows it to work in a wide variety of cases. Because the data is local when it is displayed it should be quite fast, or at least no slower than the rest of the site. Plus, everything is better once you add Views to it, right?

On the downside, it does introduce a latency into the data. It will only be as up to date as the latest import. Depending on the size of your dataset the import process could be quite time consuming, too. Importing a few hundred course records takes only a few seconds. Importing 500,000 artwork records? That could take a bit longer. What "a bit longer" means depends on your data, your export format, and whether or not you do any trickery in the code, such as double-buffering your SQL tables. (Maintaining two copies, importing to the inactive one, and then switching which one you use. Weird but it works.)

Hybrid approaches

There are, of course, a variety of other ways to handle remote data. The three above are the "primitives" from which more complex approaches could be built. You could add caching of various sorts. You could combine methods 2 and 3 and read the entirety of the remote object into a node, so that it exists locally from that point on. Add in various cache-refresh mechanisms as needed. Or, if your legacy system is stagnant and you do want to eventually migrate data into Drupal, don't delete anything once it's been imported.

Alternatively, you can batch-import data into nodes. It could be that you're decommissioning the 3rd party system, and want to bring the entire dataset into Drupal. Or more likely, you have new data from another system that is periodically added and want to push it out to the web site, but then not keep it in sync after that. That's common in publishing, such as the recently-launched Journal of Foreign Affairs web site. In their case, every new issue of the magazine that is released gets exported from their publication system and imported en-masse into the Drupal site as nodes. There, content editors can make web-specific tweaks as necessary before officially publishing the issue. (Watch the Drupal.org front page for a more detailed writeup of that site.)

The future

The drop is always moving, as they say, and the Data API Design Sprint over a year ago is finally showing fruit for remote data handling. With Fields moved into core, we have the potential to greatly redefine how Drupal handles data, both local and remote.

Imagine a diagram that looks like this:

 

 

With Fields able to be assembled into any data entity, not just nodes but also users, comments, and so forth, what's to stop us from having them represent parts of a remote data object, which gets assembled into a local object within Drupal? Why bridge from artworks in a museum database to nodes, when we can add local fields directly to that artwork object?

For that to be possible, fields need to be able to represent remote data. They also need to be able to work together to load data in an optimal way. If you think the debates over fully-normalized tables for local data were serious, just wait until you're dealing with data that is not in a local, fast MySQL database but somewhere on the other side of a SOAP connection. The API also needs to be flexible enough to handle hybrid data sources, with objects living partially local, partially remote, and still having the same API throughout.

Will Drupal 7 be able to handle all of that? Well, we have until September 1st to make sure it does.

Comments

Thanks for the excellent writeup, Larry.

I do think hybrid data sources (field A living locally, field B served from a SOAP connection) has huge potential.

Right now D7 pluggable 'field storage engine' are global to the site - all local SQL, or all external-through-SOAP, or all-whatever. Last time I asked 'Do we intend to support pe-field storage engine', the answer was 'How will Views handle this ?' :-). Building parallel queries against several storage locations and merging the results will be tricky. Also means a storage engine needs to come with a full-fledged Views query builder ?

Per-entity type (or whatever it's called) storage engine is I think the absolute minimum we have to support for Field API backends to be useful. Otherwise, you can't use remote storage without basically killing the entire local storage system. That's really not worth the effort.

If we can go as far as per-field, that's even better. I can see that being an issue for Views, but I don't see it as a solvable problem. Searching across separate datasets has no solution that anyone has found that does not involve first caching everything in an intermediary server and then searching that. We're not going to solve that in Fields API in D7.

So I'd say make the views integration for fields backend-dependent. So yes, you can only search across fields in the same data source. We're OK with that. Only a small fraction of people are going to be trying to do that anyway, and they should understand how that's an exponentially harder problem. We could still get a huge amount of added power and flexibility without that.

So let's do. :-)

You've sparked ideas in my head for some work I'm doing now. You are a top notch designer and you've expressed your ideas well. If you're in Columbus, OH anytime feel free to contact me.

I have done something very similiar using RESTXML web services to connect Drupal to our University ERP, to make drupal the university portal.

I'm working on abstracting some of the code to make an XML web services API, but not sure if it shouldn't stay a custom component or should be elavated to the contrib module. I allready have a drupal forms to XML binding layer complete and am thinking about moving more of my code into a contrib project.

I really appreciated the thin node on the fly concept. I'm seriously considering adopting that approach for integrating other applications.