Bobby Russell

Repetitive Text Processing and Posting to WordPress Made Easy with Code

Posted on November 1st, 2013 by bobby.

Manually processing thousands of documents sucks. Not only is it mindless and boring, but it takes up a huge amount of time and is prone to error. But there is latent value in having content out there in the wilds of the web, so it is worth understanding how to process documents quickly and efficiently in order that they may yield their value to their owner.

I encountered just such a problem a while back. It had been a strategy of ours for quite some time to spam content into the web in an effort to gain higher volumes of search traffic. For the most part, I can tell you that this strategy is a successful one. But the main problem (for us) has always been scalability: there are only so many articles that one can read, edit, and post in a given amount of time, and I had been spending about a week per month one for just 100 articles. Manually completing this task was never really a problem for me, but it was nonetheless a complete and total waste of time (I estimate that I’ve spend close to 300 hours doing this kind of work.)

The boss had approached me with a request to expand this scheme to include an additional site, and to increase the rate of content expansion from 100 articles a month to 200. At that rate, I would have to spend all day, every day, reading, editing, and posting articles. Clearly this was unacceptable. I would need to find a way to get these articles published without wasting tons of time doing so. Why not do it with code?


This whole situation came about because of our desire to build up some lead generation, particularly lead generation with organic search traffic. The idea is simple: bring people to a site (or in our case, sites) and cajole them into doing something (in our case, filling out a lead form.) An excellent and sure-fire way of doing so is to do some keyword research, figure out what people are looking for in your niche, and create content that is relevant to those questions or topics. The more particular these questions or topics are, the easier it is to get the right page to show up as a result for the relevant query. The downside to an approach like this is that it requires a significant amount of research (usually) and time to create and publish content that serves the purpose of generating leads.

In our case, it was pretty straight forward figuring out what to write about. For each article that needed to be written only required some boilerplate text and some sprinkling of relevant keywords. We were able to find a content producer that could get fairly well written articles for us at a rate of about $.01 per word. The boilerplate was to run anywhere between 400 to 500 words, putting the price per article at about $4.50. Considering that these leads are worth about $15 a pop, and considering that some of these articles would draw multiple leads and phone sales in per day, they were well worth what we paid.

The Problem

But even cheap content like this comes with a price. When you’re only paying a penny for a word, it’s nearly impossible to get those words from native English speakers. Many of the writers that do this kind of work are from the Phillipines or Eastern Europe, and they don’t always have the best grammar or writing style. Thus, there were a slew of common mistakes that needed a-fixin’. This would be the time consuming, manual work. Automating the remaining part of the process would make the workflow manageable.

What we ultimately needed to do was to parse the these documents, post the articles to a WordPress site, and review the articles for the aforementioned grammatical errors before the articles were live to the public.

Some Considerations

First, we would have to consider the cost. After all, we’re doing this to make money, and even though the margins on this are high, other segments of this part of the business had much lower margins, so it was necessary to do this work at as little cost as possible. We resolved to not spend any additional money on the creation and processing of articles than we already were (again, these costs were the cost of creation, and my own time,) and ideally we would spend a whole lot less than we had previously. With these parameters, we had a couple of options:

  1. Make the writers do the work for us
  2. Hire someone else to help with the editiing
  3. Write code to automate the process

Ruling Out the Writers

Our first thought was to perhaps try to extract more work out of the writers. After all, we were paying good money to these guys for their work, so why not have them format the articles and then post them to the WordPress site themselves? But, after about 10 seconds of deliberation, it became pretty clear why this wouldn’t work.

To keep the costs low, we used super low-skilled writers working in foreign countries, and getting these guys to follow special instructions could be potentially disastrous (yes, this sounds a bit cynical, but I ran this thought by the editor broker who we were getting the hookup from and he agreed.) Secondly, we didn’t really want to give anyone access to our site, even as a guest. There were too many opportunities to for things to go wrong.

Ruling Out a New Hire

We then turned our attention to a new hire. This seemed like a somewhat reasonable solution at first, but we later realized that this simply wasn’t a scalable solution. It would take time for us to hire and train a new editor. Even if we were ready to do bring someone else on, they would undoubtedly be expensive, which would violate our specification for an inexpensive solution. Even if we could find someone that was affordable and picked up on what we were trying to do quickly (which was unlikely,) they would almost certainly be slower than I was.* * Not trying to boast, but I had spent a lot of time doing this particular flavor of editing in the past and can type and write quite quickly and efficiently.

Going Procedural

Given that the first two options kind of sucked, we decided that the only acceptable course of action was to do things procedurally.

Parsing the Document

There were a couple of decisions that needed to be made when parsing these documents. First, we had to figure out if we wanted to parse Word documents or html documents. Each had an interesting set of pros and cons.

Parsing docx files was relatively easy, as there is an existing module that extracts the plaintext out of the file straight away and outputs it to a file. But with this approach, we lost all of the Word formatting. This was problematic because there was no straightforward way to recreate this formatting in HTML; there wasn’t any straightforward logic to add formatting in programmatically either.

The other option was to use an xml parser to do the job. This was a pretty yucky way to do the job (have you ever seen the html output of word docs?) but atleast we would be able to retain our writers’ formatting. Despite the hassle of dealing with bizarre markup, this was ultimately the most simple solution to extracting out latent html content.

Choosing a Library

There are a bunch of xml/html parsing libraries available in Python’s Standard Library, including lxml, xml, HTMLParser, and so on. For this job, however, I chose to use BeautifulSoup. Even though it’s not in the Standard Library (I prefer to use the Standard Library whenever possible,) it offered a lot of useful functionality that would have been difficult and time consuming to write on my own, and that wasn’t provided in the standard library.

After the technology decisions were made, we had to choose a design scheme to implement the core functionality.

Putting the Pieces Together

The main choice we had to make was whether to make the posting process fully automatic or semi automatic. With fully automatic posting, we would be able to drip articles into the Google index 24/7 (so long as our supply of articles was at a steady drip.) But that could introduce Quality Control problems that would be difficult to address with code alone. Therefore, we settled on a semi-automated process that would

  1. Process the raw article files
  2. Insert these files directly into the WP database
  3. Review the article, checking for grammatical errors
  4. Submit the changes
  5. Repeat until there aren’t any more articles

With this approach, we were able concentrate the actual amount of work to do be to the essentials: reviewing an article for errors, fixing them if they exist, and scheduling the article for posting.

The Code

This code requires that we have relatively clean article files. The article files that are parsed by this code are .htm files with ONLY THE DISPLAY INFORMATION SAVED INTO THE HTML! You can use other files if you like, but this is the simplest way of dealing with this type of data.

Anyway, (with permission from my employer) here’s the spaghetti code:

Document Preprocessing

import sys
from bs4 import BeautifulSoup
from article_loader import load_files
BASH_STDOUT = sys.stdout
Helper Functions
def unwrap_tags(tag_list):
    Unwrap the content from within specific tags
    tag_list: a list of tag strings (e.g. 'p', 'em' etc...)
    for tag in tag_list:
        for t in tag:
            if t.b != None:
            if t.u != None:
            if t.em != None:
            for k, v in t.attrs.items():
                del t[k]
def remove_attrs(attr_list, contents):
    Specify a list of attributes to remove
    tag_list = []
    for tag_type in attr_list:
Main Functions
def print_files(input_path, output_path, clean_attrs=['p','span','ul','li','em']):
    Print the processed html files to the file system.
    file_list = load_files(input_path)
    for file in file_list:
        new_name = file.split('.')
        f = BeautifulSoup(open(input_path+file))
        contents = f.div
        contents['class'] = "page_content" 
        remove_attrs(clean_attrs, contents)
        clean_contents  = str(contents).rstrip()
        clean_contents  = unicode(clean_contents, errors='ignore')
        sys.stdout      = open(output_path + new_name[0] + '.html', 'w')
        print clean_contents
    sys.stdout = BASH_STDOUT

The Process Spawner

import datetime
import json
import sys
from os \
        import path
from subprocess \
        import Popen, PIPE
from html_parser \
        import load_files
BASH_STDOUT = sys.stdout
def fix_names(file_name, separator='_', prefixes=['new', 'south', 'west', 'north', 'rhode']):
    Corrects the way that names are stored within the array
    file_name:  file name string that will be used to extract state information
    separator:  a string separate the terms of the file name
    prefixes:   special cases for states names that have 2 words  
    states = path.splitext(file_name)[0]
    states = states.split(separator)
    prefixes = ['new', 'south', 'west', 'north', 'rhode']
    for i, state in enumerate(states):
        if state in prefixes:
            states[i] = state + " " + states[i+1]
    return states
def get_parent_page_list(list_file):
    list_file: a tab separated States, ID list (requires a WP database query or script to produce)
    returns a dict with where k,v == states, id
    state_ids_dict = {}
    with open(list_file) as f:
        states_and_ids = f.readlines()
        for line in states_and_ids:
            id_list = line.split("\t")
            state_ids_dict[id_list[0]] = int(id_list[1].rstrip())
    return state_ids_dict
def create_title(file_name):
    Create a title from a filename
    states: plaintext statenames separated by linebreaks
    separator: a string with a separator character
    returns an array of strings
    states = fix_names(file_name)
    for i, state in enumerate(states):
        states[i] = state.title()
    return states 
def create_post(FILE_LIST, 
    Create a post file
    FILE_LIST:  a list of files from the input_path directory
    input_path: a string location to the input directory (generally the OUTPUT of the htmlparser process)
    output_path: a string location to the output directory (generally your JSON_DIR)
    page_id_list: dictionary of page (string): id (int)
    initial_post_delay: an int that sets the number of hours from now before the first post goes live
    increment: an int that sets the number of hours between each post
    author_id: an int that specifies the author of the post
    title_prefix: a string that sets the prefix of the articles title
    comments: a string that determines whether comments are turned off or on (defaults to off)
    post_date = + datetime.timedelta(hours=initial_post_delay)
    timedelta = datetime.timedelta(hours=increment)
    for file in FILE_LIST:
        print "The File:", file
        file_name = file
        states = create_title(file)
        file = open(input_path + file).read().encode('utf-8', errors='ignore').replace('\n', '') 
        data = {
                'post_author'   : author_id,
                'post_content'  : file,
                'post_date'     : post_date.strftime("%Y-%m-%d %T"),
                'post_title'    : title_prefix + (' '.join(states)),
                'post_type'     : 'page',
                'comment_status': comments,
                'post_name'     : (' to '.join(states)),
                'post_status'   : 'future',
                'post_parent'   : page_id_list[states[0]],
        json_obj = json.dumps(data)
        new_file = output_path + file_name + ".json"
        sys.stdout = open(new_file, 'w')
        print json_obj
        sys.stdout = BASH_STDOUT
        print "SUCCESS! Wrote " + new_file + " to the filesystem."
        print "Post Scheduled for " + post_date.strftime("%Y-%m-%d %T"),
        post_date += timedelta
def process_spawner(JSON_PATH):
    Turns JSON into Wordpress Posts
    FILE_LIST: a string JSON directory location
    JSON_FILES = load_files(JSON_PATH)
    for file in JSON_PATH:
        proc = Popen(["php", "json_to_post.php", file],
        stdout =  proc.communicate()[0]
        print stdout

Insertion Script (PHP)

require_once( "/path/to/wp/wp-load.php" );
echo "Starting app...\n";
$file_name = $argv[1];
$string = file_get_contents( "json_files/" . $file_name );
$json_array = json_decode( $string, true );
function make_title( $json_array ) {
	$title = $json_array["post_title"] . " | American Auto Move";
	return $title;
function make_metadesc( $json_array ) {
	$meta_desc = "Description Prefix " . $json_array["post_name"] . " Description Postfix";
	return $meta_desc;
$post_id = wp_insert_post( $json_array, $wp_error );
if ( $post_id === 0 ) {
	echo "Oops, something went wrong\n";
else {
	echo "Posted a page! The Page ID is " . $post_id . "\n";
echo "Changing the template...\n";
$updated = update_post_meta($post_id, '_wp_page_template', 'pageandposts.php');
$new_title = make_title( $json_array );
echo "Changing title to " . $new_title . "...\n";
$change_title = update_post_meta( $post_id, '_yoast_wpseo_title', $new_title);
echo "DONE!\n";
$new_metadesc = make_metadesc( $json_array );
echo "Making new meta description to " . $new_metadesc . "...\n";
$change_metadesc = update_post_meta( $post_id, '_yoast_wpseo_metadesc', $new_metadesc);
echo "The Meta Description is\n";
echo get_post_meta( $post_id, '_yoast_wpseo_metadesc', true);
echo "DONE!\n";

To run everything, just create a driver script that calls process_spawner() iteratively over every file in a given directory. I happen to prefer Python for most programming tasks, and I used that here, but you can use PHP to write the entire process if you’d like (in fact, it would probably be a lot cleaner that way.)

There are CLEARY MANY OPTIMIZATIONS THAT CAN BE MADE TO THIS CODE. It could (and should) be object oriented, for example. It would also be great to interact with this code via a web MVC interface. Persistent storing and logging of activities are also great ideas. Unfortunately, a lot of my work is maintenance of code these days, so I don’t get a whole lot of time to refactor things.

The point here is that it’s pretty easy to do something like this. Take this code quick and dirty code example, hack it, and use it however you like!