Nov 21, 2019

Fixing my content after migrating to Jekyll

Thanks to migrating to Jekyll, I could finally fix years of negligence of my content. Here is how I did it.

Articles in this series:

Migrated from Ghost to Jekyll
Migrating content from Ghost to Jekyll
How I set up this blog on Jekyll
How I improved my Jekyll setup
How I improved my Jekyll SEO
Fixing my content after migrating to Jekyll (this article)

The case of missing metadata

I’ve been writing for quite a few years now. Originally, I started on self-hosted WordPress and later moved to Ghost. Back in the days, my setup wasn’t that sophisticated. The only metadata I’d use were keywords. Since my theme wasn’t requiring me to provide a preview image or an excerpt I haven’t, only to pay the price for it now.

I’ve been building websites since forever and I’m fully aware of the importance of providing metadata. And yet, for some reason, I haven’t done it for my own blog. Only after I moved from WordPress to Ghost, I started paying more attention to it, specifying proper excerpts and preview images. But by then, I’ve accumulated quite some debt. Roughly 800 blog posts, that I wrote in the past, were missing a preview image and or excerpt.

Theoretically, all the content was there. For years I’ve been opening my articles with an image and using the first paragraph as the excerpt. But extracting that information and converting to proper metadata in hosted Ghost was a complexity I wasn’t willing to go through. Until I got another opportunity.

When I migrated my blog from Ghost to Jekyll, it changed from a collection of content accessible through a CMS to a bunch of Markdown files on my disk. And that came with a big benefit. Whatever issue I had in my content, I could fix it at scale with just the right script. And fixing this issues was not just to make things aesthetically pleasing. Missing images and excerpts were negatively impacting my content published in AMP format which could theoretically affect my SEO. So with a bit of work, I fixed missing preview images and excerpts, providing my readers with proper experience, even if they went all the way back to my very first article. Here is how I did it.

Metadata in Jekyll

In Jekyll, as well as in other static site generators, metadata is defined in a top section of the page, named the front matter. It’s a collection of key-value properties, some of which are specified by the generator and others that are defined by your theme. Here is a sample front matter that I’m using on this blog:

---
title: What version of Node can I use for the SharePoint Framework?
slug: node-version-sharepoint-framework
description: 'Theoretically, you could use any version of Node.js with the SharePoint Framework but there are caveats.'
image: /assets/images/2016/05/banner.jpg
pubDate: '2019-11-11 18:51:54'
tags:
 - node-js
 - office-365-development
 - sharepoint-development
 - sharepoint-framework
---

On this blog, I’m using the front matter expressed using YAML. For each article, I define properties such as layout, title, slug, tags, whether it’s featured and hidden but also the preview image and excerpt which were missing for the majority of my old content.

Setting the preview image

With a few exceptions, all the articles I wrote in the past, were starting with an image that was a part of the content. To match the structure of this blog I needed to move that image out of the content and put it in the front matter instead. Here is the script that I used:

var fs = require('fs'),
  path = require('path'),
  matter = require('gray-matter'),
  argv = require('yargs').argv;

var postsDir = argv.postsDir;
var hasError = false;

if (postsDir === undefined || postsDir.length === 0) {
  console.error('ERROR: Specify the path to the Jekyll posts directory using the postsDir argument');
  hasError = true;
}

if (hasError) {
  console.log();
  console.log('Sample usage:');
  console.log('node index.js --postsDir=./_posts');
  process.exit();
}

const postFiles = fs.readdirSync(postsDir);
postFiles.forEach(f => {
  console.log('Processing ' + f + '...');
  const filePath = path.resolve(postsDir, f);
  const postFile = fs.readFileSync(filePath, 'utf-8');
  const file = matter(postFile);
  if (file.data.image) {
    console.log('- has image');
    return;
  }
  const m = file.content.match(/^\s*<img[^>]+src="([^"]+)"[^>]*>.*/)
  if (!m || !m[1]) {
    console.log('- no image found');
    return;
  }

  file.data.image = m[1];
  file.content = file.content.replace(m[0], '');
  fs.writeFileSync(filePath, file.stringify(), 'utf-8');
  console.log('- DONE');
});

The script starts with iterating over all files in the posts folder specified using the posts argument. For each of these files, it checks if it already has the image defined in the front matter or not. To simplify working with the front matter, I use the gray-matter npm package which takes care of properly handling the different styles and formats of front matter for both reading and writing it. If the file has the image defined, it’s skipped. If not, the script tries to extract the first image in the content. If it succeeds, it writes the image’s URL to the front matter’s image property and removes the image from the body. Finally, it saves the file back to disk.

Having the blog files in a git repo, the changes are easy to track and verify that all has been done as expected. With images properly set, it was time to tackle the next issue at hand.

Setting the excerpt

Besides the preview images, my older articles were missing the excerpt being defined in the metadata. The excerpt is used for several things: from blog post archive listing to search engine result previews, so I wanted to ensure that they were in place.

Again, I ensured that all my articles had them using a simple script I wrote:

var fs = require('fs'),
  path = require('path'),
  matter = require('gray-matter'),
  removeMd = require('remove-markdown'),
  argv = require('yargs').argv;

var postsDir = argv.postsDir;
var hasError = false;

if (postsDir === undefined || postsDir.length === 0) {
  console.error('ERROR: Specify the path to the Jekyll posts directory using the postsDir argument');
  hasError = true;
}

if (hasError) {
  console.log();
  console.log('Sample usage:');
  console.log('node index.js --postsDir=./_posts');
  process.exit();
}

const postFiles = fs.readdirSync(postsDir);
postFiles.forEach(f => {
  console.log('Processing ' + f + '...');
  const filePath = path.resolve(postsDir, f);
  const postFile = fs.readFileSync(filePath, 'utf-8');
  const file = matter(postFile);
  if (file.data.excerpt && file.data.excerpt.trim().length > 0) {
    console.log('- has excerpt');
    return;
  }

  const firstP = removeMd(file.content.trim().split('\n')[0]);
  file.data.excerpt = firstP;
  fs.writeFileSync(filePath, file.stringify(), 'utf-8');
  console.log('- DONE');
});

Once again, the script starts by iterating over all files in the posts folder. For each one, it checks if it has the excerpt already defined or not. If not, it grabs the contents of the article, removes the Markdown formatting using the remove-markdown package and uses the first paragraph as the excerpt. With the changes in place, it updates the file.

Automation using JavaScript

You might have noticed that in both cases I’ve used JavaScript to automate the process. Why? Why not something else like PowerShell or Bash? The answer is simple: simplicity.

To fix my articles, I had to deal with the processing of front matter and Markdown. There are over 1 million packages on npm and a few to help me with both these things. Using existing packages allowed me to focus on the necessary changes for my content rather than building the plumbing of processing front matter and Markdown. All was left, was some rudimentary understanding of JavaScript and Node.js to parse the files and their contents.

Both scripts are available on GitHub.

In the next articles in this series, I’ll tell you more about how I set up hosting this blog and the tools that I use for writing.