WCSEA 2015: Matt Johnson — Content Migration: Beyond WXR

I’m at WordCamp Seattle today and am post­ing notes from ses­sions through­out the day. These are posted right after the session, and could be a little rough.

This is a talk from Matt Johnson, team lead at Alley Interactive. Find Matt at @xmatt on Twitter and at alleyinteractive.com on the web. Here’s the WordCamp.org session description.

What is migration

There’s an old site, and you’re making a new one. Your old site has content; you need to move it to the new site.

Client attitudes vary. Some are obsessed with migration. These are great because they give you a clear heads-up. Others don’t think about it at all unless you do.

Clients are often surprised when things get complicated, because they imagined it would be really simple. That got laughter in the audience, but Matt pointed out it makes sense: Content migration isn’t something you notice unless it goes wrong, so most people don’t know to think about it.

Content migration can be the fun part of a project

Content migration can involved some of the more interesting problems to solve, such as reverse-engineering weird legacy systems. You may get to write code just to extract it, and then clean up old content from bad code. Then there’s the satisfaction of processing hundreds of thousands of posts with a single CLI command.

Content migration can be the least fun part of a project

Bad things can happen, such as legacy content in Windows-1252 content encoding, when WP speaks UTF-8. Sometimes meta data is completely missing.

“Oh hey, we have this microsite we forgot to tell you about until right now, a week before the launch. Can we just merge it into the main site’s migration?”

With techniques from this presentation, the answer to that can be “Yes (here’s a change order) and yes!”

The basics

Content migration is moving all of your user-generated content from one place to another, accurately. Sometimes the old data maps to the new data really easily. Other times, migration is part of a project that also overhauls the site’s information architecture — as opposed to just a “face lift”. In these cases, newly migrated data needs its structure changed to from the old to the new information architecture structure.

The scale of the project does not necessarily correlate with the difficulty of migration. A project with major information architecture changes and relatively little content can have a much more difficult content migration than a project with a little of content and very little information architecture changes.

The types of migration approaches

The easiest: WXR out, and WXR in. WRX stands for WordPress Extended RSS feed (WRX). WXR files are generated by the Export menu item in the the WordPress Tools menu, and ingested by the Import tool.

You can almost never use this method. Common reasons include that the old site isn’t WordPress; the new site handles images differently on the new site; the new site has a new information architecture, such as:

  • When switching from users-as-authors to Co-Authors Plus (a plugin that allows you to have authors on the byline who are not users in WordPress, which news organizations frequently want)
  • When loading custom metadata into Fieldmanager [?]
  • Re-mapping taxonomies
  • Re-mapping content types (custom post types)

Approaches to migration

Plan A: Make your own WXR

This is unweildy. It requires writing custom code, and it’s code that must write XML, which is hard. You’re also still limited by the format of WXR.

As long as you’re writing custom migration code, why not take total control?

Plan B: Fix up your data after WXR

Run a WXR import, and see what went wrong or is missing, then troubleshoot using a WP-CLI script to finish this up.

A detour into using WP-CLI: WP-CLI is perfect for this, especially its extensability, which allows you to write custom WordPress code to run on demand from the command line. You could write custom code from a tool page, but there are runtime limits, and you need to work harder to create (even limited) UI that you just don’t need on the command line.

Doing this is easy:

if ( defined( 'WP_CLI' ) ) {
    require_once( MY_THEME_DIR . '/inc/class-migration-cli.php' );
}

/**
 * /inc/class-migration-cli.php'
 *
 * In this example, for the sake of brevity, we're omitting
 * output for debugging, which unless you're an evil genius,
 * you need.
 */
class Migration_CLI extends WP_CLI_Command {
    public function fix_my_data( '$args, $assoc_args ) {
        $per_page = 100;
        $page = 0;
        do {
            $posts = get_posts( array(
                // .../ Your WP_Query arguments here.
                'posts_per_page' => $per_page,
                'offset' => $per_page * $page++
            ) );
            foreach ( $posts as $post ) {
                // Do your stuff here.
                wp_update_post( $post );
            }
        } while ( $per_page == count ( $posts ) );
    }
}

To run this command, just do:

$ cd /var/www/my_wp_site.com
$ wp migration fix_my_data

Plan C: Goodbye WXR, Hello ETL

The advantage here is to get away completely from the WXR limitations.

ETL is a venerable computer science term that means extract, transform, load. It’s the most common pattern for custom migration scripts.

Another custom WP-CLI migration class example:

/**
 * /inc/class-migration-cli.php'
 */
class Migration_CLI extends WP_CLI_Command {
    public function migrate_data( '$args, $assoc_args ) {
        $this->connect_to_legacy_source();
        // <code>has_legacy_data()</code> returns true if there's more to process
        while ( $this-&gt;has_legacy_data() ) {
            // Extract:
            // <code>get_legacy_data()</code> gets next, and increments counter
            // This can also do a lot of your heavy lifting in extraction
            $row = $this-&gt;get_legacy_post();
            $post = array(
                'post_type' =&gt; 'post',
                'post_title' =&gt; $row['title'],
                'post_content' =&gt; $row['content'],
                'post_date' =&gt; date( 'Y-m-d H;i:s', strtotime( $row['date'] ) )
            );
            // Transform:
            if ( $row['is_slidewshow'] ) {
                $post['post_type'] = 'slideshow';
            }
            // Load:
            $post_id = wp_insert_post( $post );
            update_post_meta( $post_id, 'legacy_id', $row['id'] );
            if ( $row['is_slideshow'] ) {
                update_post_meta( $post_id, 'slides, $this-&gt;get_legacy_slides( … );
            }
        }
    }
}

Important: update_post_meta() is “item potent” which means you can run it mutiple times and have the same result as running it once, as opposed to creating duplicate posts each time you run it. It does this by checking if the WP post id existsyet for this post, and updating it if does.

if ( $post_id = $this->new_post_exists( $row['id'] ) ) {
    $post['ID'] = $post_id;
    wp_update_post( $post );
} else {
    $post_id = wp_insert_post( $post );
}

[This is brilliant!] This means you can iteratively improve your data if you need to add to it, instead of starting from scratch each time with a clean database.

  • has_legacy_date(): return true until no legacy items left.
  • get_legacy_post(): return an array with the next legacy item
  • get_legacy_slide():returns some special structured data (like slides in a slideshow).
  • new_post_exists(): returns the post_id of the WP post with this legacy id, or false if there isn’t one.

Typical Legacy data formats to work with

  • A MySQL database (any weird schema)
  • A pile of ZXML or JSON files
  • An RSS feed
  • A REST API (yay!)

This post is part of the thread: 2015 WordCamp Seattle Live Notes – an ongoing story on this site. View the thread timeline for more context on this post.