On Demand Video Streaming Management Tools
Aggregating Metadata Into A Single Content Management System
Decoupling Drupal from the web service to quickly aggregate complex, large-scale metadata.
- Decoupling Drupal with tools and services like REST, Elasticsearch, and Silex
- Speedy wrangling and aggregation of large-scale metadata
- Using Drupal for its administrative and content editing strengths
A quick note about this case study: Due to the complex nature of the project, and the myriad of tools and services we used to provide an effective and efficient solution to our client, we go into more technical detail than usual. Despite this, it’s a thorough and interesting read for developers and non-developers alike as it provides a clear look into our thought and development process.
Ooyala is a video technology provider that works with media companies around the world to provide data-rich streaming video solutions to very large audiences.
What They Needed
Ooyala wanted to aggregate metadata about movies, TV episodes, and other videos from their archive into a single content management system (CMS) for its clients. This clearinghouse would allow its clients to provide metadata for TV shows and movies to users via a multi-platform streaming video on demand system. However, the existing data was not always reliable or complete, so it needed varying degrees of human review to verify all data before it was sent out.
There were many layers of complexity to consider on this project:
- A requirement to merge in metadata for TV shows and movies from a third-party video service to compensate for incomplete metadata.
- Different shows needed to be available for different periods of time depending on contract requirements
- In addition, depending on certain factors, shows could be previewed for users before they could be purchased.
- A 99.99% uptime requirement, with minimal latency.
- Wrangling data from a contextual standpoint using REST API separate from the content management system.
How We Helped
Pulling in data from a Web service, curating it, and serving it out with a Web service sounds like just the thing for Drupal 8, but given its proposed release date over a year after the project deadline this wasn't a viable option. And while Drupal 7 has some support for Web services via the Services and Rest WS modules, but both are hamstrung by Drupal 7's very page-centric architecture and generally poor support for working with HTTP. Our determination was that we needed a better solution for this project.
Fortunately, Drupal is not the only tool in Palantir’s arsenal. After a number of rounds of discovery, we decided that a decoupled approach was the best course of action. Drupal is really good at content management and curation, so we decided let it do what it did best. For handling the Web service component, however, we turned to the PHP microframework Silex.
Silex is Symfony2's younger sibling and therefore also a sibling of Drupal 8. It uses the same core components and pipeline as Symfony2 and Drupal 8: HttpFoundation, HttpKernel, EventDispatcher, and so forth. Unlike Symfony2 or Drupal 8, though, it does little more than wire all of those components together into a "routing system in a box"; all of the application architecture, default behavior, everything is left up to you to decide. That makes Silex extremely flexible and also extremely fast, at the cost of being on your own to decide what "best practices" you want to use.
In our testing, Silex was able to serve a basic Web service request in less than a third the time of Drupal 7. Because it relies on HttpFoundation it is also far more flexible for controlling and managing non-HTML responses than Drupal 7, including playing nicely with HTTP caching. That makes Silex a good choice for many lightweight use cases, including a headless Web service.
This decision opened up the question of how to get data from Drupal to Silex, as Silex doesn't have a built-in storage system. Pulling data directly from Drupal's SQL tables was an option, but since the data stored in those often requires processing by Drupal to be meaningful, this wasn’t a viable option. Additionally, the data structure that was optimal for content editors was not the same as what the client API needed to deliver. We also needed that client API to be as fast as possible, even before we added caching.
An intermediary data store, built with Elasticsearch, was the solution here. The Drupal side would, when appropriate, prepare its data and push it into Elasticsearch in the format we wanted to be able to serve out to subsequent client applications. Silex would then need only read that data, wrap it up in a proper hypermedia package, and serve it. That kept the Silex runtime as small as possible and allowed us do most of the data processing, business rules, and data formatting in Drupal.
Elasticsearch is an open source search server built on the same Lucene engine as Apache Solr. Elasticsearch, however, is much easier to setup than Solr in part because it is semi-schemaless. Defining a schema in Elasticsearch is optional unless you need specific mapping logic, and then mappings can be defined and changed without needing a server reboot. It also has a very approachable JSON-based REST API, and setting up replication is incredibly easy.
While Solr has historically offered better turnkey Drupal integration, Elasticsearch can be much easier to use for custom development, and has tremendous potential for automation and performance benefits.
With three different data models to deal with (the incoming data, the model in Drupal, and the client API model) we needed one to be definitive. Drupal was the natural choice to be the canonical owner due to its robust data modeling capability and it being the center of attention for content editors. Our data model consisted of three key content types:
- Program: An individual record, such as "Batman Begins" or "Cosmos, Episode 3". Most of the useful metadata is on a Program, such as the title, synopsis, cast list, rating, and so on.
- Offer: A sellable object; users buy Offers, which refer to one or more Programs
- Asset: A wrapper for the actual video file, which was stored not in Drupal but in the client's digital asset management system.
We also had two types of curated Collections, which were simply aggregates of Programs that content editors created in Drupal. That allowed for displaying or purchasing arbitrary groups of movies in the UI.
Incoming data from the client's external systems is POSTed against Drupal, REST-style, as XML strings. A custom importer takes that data and mutates it into a series of Drupal nodes, typically one each of a Program, Offer, and Asset. We considered the Migrate and Feeds modules but both assume a Drupal-triggered import and had pipelines that were over-engineered for our purpose. Instead, we built a simple import mapper using PHP 5.3's support for anonymous functions. The end result was a few very short, very straightforward classes that could transform the incoming XML files to multiple Drupal nodes (sidenote: after a document is imported successfully, we send a status message somewhere).
Once the data is in Drupal, content editing is fairly straightforward. A few fields, some entity reference relationships, and so on (since it was only an administrator-facing system we leveraged the default Seven theme for the whole site).
Splitting the edit screen into several since the client wanted to allow editing and saving of only parts of a node was the only significant divergence from "normal" Drupal. This was a challenge, but we were able to make it work using Panels' ability to create custom edit forms and some careful massaging of fields that didn't play nice with that approach.
Publication rules for content were quite complex as they involved content being publicly available only during selected windows, but those windows were based on the relationships between different nodes. That is, Offers and Assets had their own separate availability windows and Programs should be available only if an Offer or Asset said they should be, but if the Offer and Asset differed the logic system became complicated very quickly. In the end, we built most of the publication rules into a series of custom functions fired on cron that would, in the end, simply cause a node to be published or unpublished.
On node save, then, we either wrote a node to our Elasticsearch server (if it was published) or deleted it from the server (if unpublished); Elasticsearch handles updating an existing record or deleting a non-existent record without issue. Before writing out the node, though, we customized it a great deal. We needed to clean up a lot of the content, restructure it, merge fields, remove irrelevant fields, and so on. All of that was done on the fly when writing the nodes out to Elasticsearch.
Another note related to this: for performance reasons, and to avoid race conditions when saving nodes, we deferred the actual processing off to Drupal's queue system. That neatly avoided race conditions around accessing nodes during node save and kept the user interface fast and responsive.
There was one other requirement: Since the incoming data was often incomplete we needed to also import data from RottenTomatoes.com. For that we built a two layer system: One is a generic PHP package using the Guzzle library that expressed Rotten Tomatoes content as PHP objects, while the other then bridges that system to create Drupal nodes populated from Rotten Tomatoes data. We then matched up Rotten Tomatoes movies and reviews with the client's source data and allowed editors to elect to use data from Rotten Tomatoes in favor of their own where appropriate. That data was merged in during the indexing process as well, so once data is in Elasticsearch it doesn't matter where it came from. We also exposed Critic Reviews to Elasticsearch as well so that client applications could see reviews of movies and user ratings before buying.
Incoming requests from client applications never hit Drupal. They only ever hit the Silex app server. The Silex app doesn't even have to do much. For the wire format we selected the Hypertext Application Language, or HAL. HAL is a very simple JSON-based hypermedia format used by Drupal 8, Zend Appagility, and others, and is an IETF draft specification. It also has a very robust PHP library available that we were able to use. Since Elasticsearch already stores and returns JSON it was trivial to map objects from Elasticsearch into HAL. The heavy lifting was just in deriving and attaching the appropriate hypermedia links and embedded values. Keyword and other search queries were simply passed through to Elasticsearch and the results used to load the appropriate records.
Finally, we wrapped the HAL object up in Symfony's Response object, set our HTTP caching parameters and ETags, and sent the message on its way.
A big advantage of the split-architecture is that spinning up a new Silex instance is trivial. There is no meaningful configuration beyond identifying the Elasticsearch server to use, and most code is pulled down via Composer. That means spinning up multiple instances of the API server for redundancy, high-availability, or performance is virtually no work. We didn't need to worry, though; the API is read-only, so with correct use of HTTP headers and a basic Varnish server in front of it the API is surprisingly snappy.
A big part of Drupal's maturity as a CMS is realizing that it isn’t the be-all end-all answer to all problems. For Ooyala and its clients, Drupal was great for managing content, but not for serving a web API. Fortunately, Palantir’s knowledge of the upcoming Drupal 8 release and its reliance on the Symfony pipeline let us pair Drupal with Silex – which is great for serving a Web API but not all that hot for managing and curating content. Ultimately, Palantir chose the right tool for the job, and the project benefited from this greatly.
Ooyala now has a robust and reliable API that is able to serve client applications we never even touched ourselves; Ooyala’s clients get what they want; end users have a fast and responsive Web service powering their media applications. In addition, Palantir had the opportunity to get our hands dirty with another member of the Symfony family – an investment that will pay off long-term with Drupal 8 and the growing popularity of Symfony within the PHP ecosystem.
Great for Ooyala; great for Palantir; great for the community.
Image by Todd Lappin "Above Suburbia" under CC BY-NC 2.0, modified with greeen overlay and the addition of arrows.