WCSEA 2015: Panel — WordPress at Scale: Enterprise, Media, and Education

I’m at WordCamp Seattle today and will be post­ing notes from ses­sions through­out the day. These are posted right after the session, and could be a little rough.

This was a panel discussion moderated by Grant Landram, with Evan Cordulack, Jeremy Felt, Josh Kadis, and Nathan Letsinger. Follow those links to their Twitter pages. Here is the WordCamp.org session description.


ModeratorGrant Landram: Senior project manager at 10up.com, previously with FreshMuse.

Jeremy Felt: Platform Manager at Washington State University. Previously senior engineer at 10up.

Josh Kadis: Internal and external product development at Alley Interactive, and Seattle dev meetup organizer. Previously the senior technologist at Quartz.

Evan Cordulack: Engineering manager for the Seattle Times.

Nathan Letsinger: Product lead at Grist, designer and developer.

Topic 1: What is “Scale”?

What does scale mean or look like with you and your organization?

Josh: Scale means traffic, pageviews, load on server. But there’s also internal users, scale of code base, database.

Evan: Scale means the bigger the database gets, then the things we thought WP was good at out of the box left us with some hangups. We still fight with that every day.

Nathan: Scale means we needed to deal with just the shear number of users. 2 dozen or 600 users who need to login raises a lot of concerns. Could be security. Many people logging in at the admin level means you should evaluate security. We have 700 bylines for posts, and didn’t want that many fake users in WP.

What special considerations do you have?

Jeremy: When you’re no longer running a single WP site, a lot of caching considerations get really complicated. Also a lot depends on your budget. With more money, have a sysadmin work on this for you. If you don’t, there’s a lot you can get away with on a single Linode with Memcached. Even if you’re paying someone to do this for you, it makes sense to understand what they’re doing for you, so you can write code that works well in sync with the systems you have in place.

Josh: The use case most people have for WP is one where every non-logged-in user gets the same HTML markup. If you’re working on a BuddyPress site, then that’s a totally different story. But normally, traffic can quickly become a non-issue with full page caching.

Evan: We try to use caching at every level. We use Akamai, then Varnish, Memcached… and at the WP API level, we use transients everywhere to take advantage of the Memcached object cache. We managed to totally kill page loading speed early on in develompent. Once we started taking advantage of caching APIs, it shaved off seconds. There are things built-in that get you really far, really fast — whatever level you’re at, even if you don’t have object systems in place. And it’ll make you feel like a pro.

A lot of managed hosts disable caching for logged-in users. So if you have high logged-in user traffic, you need to be more strategic about your caching strategy.

Nathan: You want to worry about caching. For us that meant full page caching. A story about this: Before we were on WP, we had that problem. Our best days for editors had huge traffic spikes. Those were our worst days for our developers, when we were having to spin up new servers, etc. We learned about Batchache from the WordPress community, and ported it over to our old system. But in the process we learned enough about the WP ecosystem that it convinced us to embrace WP generally.

Nathan: You need to know the constraints your caching systems should have in place, so that you know what not to do to break things. For example, we can’t add database tables without also having to modify our caching systems to accommodate those. I actually found these constraints freeing.

What other tools are you using?

How do you make choose which systems to use?

Nathan: It also depends on your passion. We’re not sysadmins, but we were playing that role part time. Can we hire someone to do this for us? We decided to pay somoene to just solve this for us. Even if we felt we were paying too much, there’s a lot of value in peace of mind. So when your traffic is spiking on a Saturday, I can keep sipping my Manhatten and not worry about it.

Topic 2: Scaling Strategies

Dev Team Sizes for your organizations

Jeremy: 20 or so on the team doing content, design, dev, everything. We’re supporting 600 sites and around 1,000 users. Central univ support.

Josh: We have 40 people at Alley Interactive including non devs (PMs, design, etc.). On a given project we may have as few as a dev and a PM. But sites we support with caching strategies are as large as the NY Post, which is one of the largest WP sites anywhere. They’re hosted on WP.com VIP which handles scaling, traffic, caching in the same way for us as it does for anyone else. That enables us to be the dev team for them with a relatively small team because we don’t need to worry about it.

Evan: Dozen devs. While there are a lot of users in the system at a time, there are also content being injected from various places constantly.

Nathan: It’s 2 devs and myself at Grist.

Key parts of your tool set (tech, process, organizational) for scaling

Jeremy: Have a local environment matching production as closely as possible is important. Then after that I rely on Ngynx and MySQL (although Zach Brown would recommend alternatives to MySQL like MariaDB).

Josh: For sites with very large database, e.g. New York Post which has a million database rows, we use Elasticsearch for queries (not just user-facing search). Even if you already know the ID of a post, in a large database a post-specific query can be expensive.

Nathan: Lots of users writing means you need something like a calendar for scheduling, and some kind of chat program so everyone’s in touch w/ each other. This is pretty essential for the overall team.

The Reliability of Elasticsearch

[A question from me, hence the greater detail in this document:] Josh mentioned depending on Elasticsearch for Queries. Kyle Kingsbury posted some research into Elasticsearch last year that raised questions about its reliability. Have you encountered that or had to deal with it in production?

Jeremy: Automattic makes heavy use of Elasticsearch, with people doing the work full time. If they can trust it and they’re using it at insane scale, even though they have to rebuild their indexes occasionally, then it’s probably fine.

Evan: There are a lot of queries that are surprisingly taxing at scale. Times Elasticsearch has failed for us are usually times that we did something wrong. You can write integration stuff that’ll make your life easier. Inevitably you’ll have to reindex and it’ll be a drag.

Zach Brown, from the audience: Basically, don’t use it as canonical. It’s just a way to access info quickly.

Workflow for content to moves from outside of WP into WP

Jeremy: Even though we think TinyMCE and DFW are neat, nobody uses it, because people pass things around in Word and then copy/paste them into WP.

Josh: At Quartz it became the policy that stories had to be written in WP. I could still tell who wasn’t doing that. But we were able to help allay concerns from journalists just because people get a little put off by the interface because if it’s a little off-putting to them, it’ll become familiar eventually. “Word was new to you too, once, and this will get better eventually.”

Evan: We don’t have anything like an editorial calendar, and we don’t have a good way to do this.

[After a Follow up question from audience about content/workflow policy governance and enforcement:] We rely from WP for curation; what shows up where. That’s where WP does the most work for us. Revisions of the story will happen in a different editing platform, reason being we have to feed the printing press. Some would like to do digital first, but “that is not my department”.

Nathan: Autosave has been great for the concern about browsers crashing.

Do you run your sites through performance analyzer tools?

Jeremy: I obsess over websitetest.org. On launch days I’m refreshing every 10 minutes trying new things, trying to get others out.

Josh: We’re taking the opposite approach: Starting a project by structuring build tasks, so that when you get to the end, Grunt or Gulp or whatever you’re using has already been done. [In other words, this is a philosophy of beginning with high performance as your baseline, rather than revising to get it right at the end.]

Evan: Since these tools are easy to use once you show someone, then everyone cares about performance. This is good and bad because meetings get really weird really fast (talking about which DOM events matter with non-devs is strange), but good to have everyone considering it.

Do you use tools to find performance issues specifically in code?

Jeremy: John Blackbourn’s Query Monitor plugin. Using things like xdebug or PHPStorm with step debugging can identify loops that are running multiple times and slowing things down.

Josh: Peer code review is huge for performance because you can rely on the experience of your team, seeing what things are slow, what’s caused problems for them in the past, etc. Leverage your embodied experience.

Besides plugin updates and installs, what changes when you move to a multiserver load-balanced environment

Jeremy: I’m only running one VM, but… it becomes more to manage, more pipes to keep connected, so more things to worry about going down. But, scaling horizontally by having more redundant things can be a less painful thing where everything’s beefy and fast…

Evan: We have a ton of internal conversations about this, and everyone has an opinion. Whatever it is, take your time. If you’re using a cloud platform, you can experiment with scaling things up and down. (If you’re having to do this in production, that’s another story.)

For large editorial teams, how do you manage training as you make continual changes?

Jeremy: Every Friday morning we have an open lab on campus for people to come talk. We say “we’re working on this, here’s a preview” and we get good feedback.

Josh: It’s important to treat dev of internal WP functionality as a UX problem in the same way you would treat a design change on the front-end. Without completely redesigning the admin interface there is still a lot you can do to make a feature intuitive and easy for people to figure out (or totally baffling and unintuitive). There’s a degree of design thinking you can apply to internal plugin dev that I’ve found helpful — even briefly and semi-formally for an internal plugin can really help speed up adoption.

Evan: If new changes require one-time actions before you can use them or before the UX returns to normal, and people don’t know about it, that upsets people. It’s important to give people a heads-up about changes that are coming.

Architecture constraints migrating to WP from other platforms

Josh: The constraint is less about the total number of content objects than it is about the differences in information architecture. The great thing about migrations is that you’re typically not that concerned with performance (within reason). So when you start the migration script if you’re talking about 6 hours or 10 hours to finish, it doesn’t matter so much… But if you can’t get a clean map of information architecture, that can be difficult.

Evan: We had a lot of problems with DB performance. It would get to a certain size where we were timing out, particularly around post_meta. It was failing in mysterious ways. We were in a hosted environment where there were settings we didn’t have access to. Lots of little roadblocks to clear, remove one, hit another, so we had to get through a bunch of that.

Nathan: We use EditFlow, which includes the editorial calendar. There are many media companies and even mid-sized blogs using this who need a calendar. This is an area for growth. We’re testing CoSchedule, which is monthly paid service, and has great features, but is off of WP’s architecture. I think this is an area where media companies have a need.

Have you encountered bloat from leaving revision management active?

Josh: On one occasion I’ve run a wp-cli script to clear out revisions on very old posts.

Nathan: That’s WordPress.com VIP’s problem. :)

What do we need to know about what’s coming?:

Jeremy: The easy answer is the JSON REST API and having everything available to you via JSON.

Nathan: Improvements to taxonomy terms will open a lot of opportunities.