› ~/

Wordpress Export Format

2021-12-05

I’ve worked on parsing a WordPress.com xml export file and convert it to markdown and I can’t say that I’m a big fan of the xml format. There are some issues with the format that really bugs me, it makes it tedious to work with the files. It feels a bit like they needed to have an export, to just make put a tick in a box, and not a feature that has been taken care of over time.

Before I start bashing on the format, I must say that it’s excellent that the platform allows you to export all your information. It’s becoming more and more difficult to do in today’s world of isolated silos! 👍

The format

Lets just try and break some things down. I’ve omitted some data and replaced it with ... where I felt it necessary to make it readable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" ...>
  <channel>
    <title>TITLE</title>
    <link>HOMEPAGE_URL</link>
    <description>DESCRIPTION</description>
    <pubDate>Wed, 06 Oct 2021 18:55:00 +0000</pubDate>
    <language>en</language>
    <wp:wxr_version>1.2</wp:wxr_version>
    <wp:base_site_url>http://wordpress.com/</wp:base_site_url>
    <wp:base_blog_url>HOMEPAGE_URL</wp:base_blog_url>
    <wp:author>
      <wp:author_id>ID</wp:author_id>
      <wp:author_login>NAME</wp:author_login>
      <wp:author_email>EMAIL</wp:author_email>
      <wp:author_display_name><![CDATA[NAME]]></wp:author_display_name>
      <wp:author_first_name><![CDATA[]]></wp:author_first_name>
      <wp:author_last_name><![CDATA[]]></wp:author_last_name>
    </wp:author>
...

So far so good, but this is where the behaviour I expected ends.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
...
<wp:category>
  <wp:term_id>ID</wp:term_id>
  <wp:category_nicename>NAME</wp:category_nicename>
  <wp:category_parent/>
</wp:category>
<wp:category>
  <wp:term_id>ID_2</wp:term_id>
  <wp:category_nicename>NAME</wp:category_nicename>
  <wp:category_parent/>
</wp:category>
...

All of a sudden, there are multiple categories without an enclosing parent. I know it’s not necessary but it sure makes it easier to parse the data if you know that all upcoming elements are of one specific type. This behaviour is present troughout the file and is found in tags, items, comments, metadata and some other elements. It’s atleast consistent so you know what to expect.

The next thing that bothers me is item, which is an entity that can be many different things, namely:

The type of an item is defined by the field called wp:post_type on an item. This makes it difficult to understand what each field represents for the different types. Categories, metadata and comments are all pushed directly into the item element without any enclosing element.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
...
<item>
  <title>TITLE</title>
  <link>EXTERNAL_URL</link>
  <pubDate>Tue, 28 Sep 2021 08:22:00 +0000</pubDate>
  <dc:creator>CREATOR</dc:creator>
  <guid isPermaLink="false">INTERNAL_URL</guid>
  <description/>
  <content:encoded><![CDATA[]]></content:encoded>
  <excerpt:encoded><![CDATA[]]></excerpt:encoded>
  <wp:post_id>ID</wp:post_id>
  <wp:post_date>2021-09-28 10:22:00</wp:post_date>
  <wp:post_date_gmt>2021-09-28 08:22:00</wp:post_date_gmt>
  <wp:post_modified>2021-09-28 12:20:40</wp:post_modified>
  <wp:post_modified_gmt>2021-09-28 10:20:40</wp:post_modified_gmt>
  <wp:comment_status>open</wp:comment_status>
  <wp:ping_status>open</wp:ping_status>
  <wp:post_name>NAME_OF_POST</wp:post_name>
  <wp:status>publish</wp:status>
  <wp:post_parent>0</wp:post_parent>
  <wp:menu_order>0</wp:menu_order>
  <wp:post_type>post</wp:post_type>
  <wp:post_password/>
  <wp:is_sticky>0</wp:is_sticky>
  <category domain="category" nicename="SLUGIFIED_NAME"><![CDATA[PRETTY_NAME]]></category>
...
</item>
...

Why would you have use this kind of format?

Probably it’s not just one of these, but a combination of different choices made over the years that’ve made it what it’s today.

A Suggestion to improve the format

A good practice is to not talk complain over something if you can’t give some well structured feedback or ideas for improvement. I’ll try to suggest, what I believe is, an improvement over the existing xml format!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" ...>
  <channel>
    <title>TITLE</title>
    <link>HOMEPAGE_URL</link>
    <description>DESCRIPTION</description>
    <pubDate>Wed, 06 Oct 2021 18:55:00 +0000</pubDate>
    <language>en</language>
    <wp:wxr_version>1.2</wp:wxr_version>
    <wp:base_site_url>http://wordpress.com/</wp:base_site_url>
    <wp:base_blog_url>HOMEPAGE_URL</wp:base_blog_url>
    <wp:author>
      <wp:author_id>ID</wp:author_id>
      <wp:author_login>NAME</wp:author_login>
      <wp:author_email>EMAIL</wp:author_email>
      <wp:author_display_name><![CDATA[]]></wp:author_display_name>
      <wp:author_first_name><![CDATA[]]></wp:author_first_name>
      <wp:author_last_name><![CDATA[]]></wp:author_last_name>
    </wp:author>
    <wp:posts>
      <wp:post>
        <id>ID</id>
        <title>TITLE</id>
        ...
        <wp:comments>
          <wp:comment>
          ...
          </wp:comment>
        </wp:comment>
      </wp:post>
      ...
    </wp:posts>
    .. tags, attachments, etc.
    <wp:categories>
      <wp:category>
        ...
      </wp:category>
    </wp:categories>
  </channel>
</rss>

My changes include:

The drawback with this approach is that it would create a lot of extra bytes in the file but I believe can motivate the change with:

A completely different approach would be to move to have the export format based on JSON instead. I feel it would be a little bit more with the times.

Closing thoughts

Wordpress is a highly succesful project, that I’ve used myself, but it seems that it’s having some difficulty keeping up with the times. I suppose the idea behind Wordpress is to enable basic tasks and let plugins handle the more advance/polished used cases.

Written by
Andreas