Icons of XML file, arrow, and Drupal 8 logo

Migrating XML in Drupal 8

Migrate in Drupal 8 is a flexible and powerful tool - you just need to know where to look. 

Drupal 8 is here which means I have had the privilege of working on my first D8 projects and the migrations that accompany them. I wanted to share some of the key findings I’ve taken away from the experience.

Migrate in Drupal 8 is awesome as long as you know what you are looking at. It is flexible, powerful and relatively easy to read. But as is the case with most things, a lot of its power is tucked away where it is hard to find if you don't know where to look. This is definitely the case with Migrate Plus XML data process plugin which is presently available only in the dev version of Migrate Plus. It is a pretty solid tool for migrating from a variety of XML based sources and today we are going to talk about how to use it.

The first thing we have to consider is where our data is coming from. Migrate plus expects to have this information fed to it in the form of a url which gives us two options:

  1. our source is from outside the website, like an rss feed; or
  2. it is stored locally.

If you have an external url, all you need to do is plug it into the url’s parameter. If your source is stored locally, you will either need to construct a url for the source or store it in the private file directory, using the private:// stream wrapper. I would go for the latter as it involves less overhead. At this point your migration source should look something like this:

source:    
   plugin: url    
   data_fetcher_plugin: http    
   data_parser_plugin: xml   
   urls: private://migration.xml

This brings us to parsing out the XML. All of the selectors we will be talking about are using xpath. The first thing you need to do is define the item selector so migrate can identify the individual items to migrate into your choose destination. For example, if we were migrating posts from a WordPress export it might look something like this:

item_selector: /rss/channel/item[wp:post_type="post"]

Next up we need to map all of our fields to nice, readable machine names that we can use in the process part of the migration. Each field will have a name that will identify it in other parts of the migration, a label for describing what sort of data we will find in that XML element, and a selector so the migration can map that data from the xml file:

fields:
   -
    name: title
    label: Content title
    selector: title
   -
    name: post_id
    label: Unique content ID
    selector: wp:post_id
   -
    name: content
    label: Body of the content
    selector: content:encoded
   -
    name: post_tag
    label: Tags assigned to the content item
    selector: 'category[@domain="post_tag"]/@nicename'

If you are using anything more complicated than the XML node names, you will need to wrap the selector as a string. The selectors are being passed to xpath in the data processor, so you can get pretty precise in selecting XML nodes.

All that is left to do is define the migration id and you have your source all ready to go:

ids:
   post_id:
     type: integer

Put it all together and you should have something that looks something like this:

source:    
   plugin: url    
   data_fetcher_plugin: http    
   data_parser_plugin: xml   
   urls: private://migration.xml
   item_selector: /rss/channel/item[wp:post_type="post"]
   fields:
      -
       name: title
       label: Content title
       selector: title
      -
       name: post_id
       label: Unique content ID
       selector: wp:post_id
      -
       name: content
       label: Body of the content
       selector: content:encoded
      -
       name: post_tag
       label: Tags assigned to the content item
       selector: 'category[@domain="post_tag"]/@nicename'
      ids:
         post_id:
           type: integer

A note on prefixed namespaces: you can see we mixed XML nodes that have prefixes with those that don’t. Sometimes Migrate handles this with no problem at all; sometimes it refuses to fetch data from XML nodes that don’t have prefixes. As far as I can tell, it does this when one of the nodes in the item_selector has a prefix (although it doesn’t seem to have this problem with the filters in the item_selector). If you should have a datasource with a parent prefixed node, you can still get non-prefixed children by using the following syntax:

name: description
label: Content description
selector: '*[local-name()="description"]'

It will allow you to select XML nodes with a given local name regardless of the prefix, which is very handy when you have no prefix at all.