Bobby Russell
I/O

2014: My Year of Productivity

January 2nd, 2014 bobby.

Greetings to all, and a Happy New Year! It’s been over two months since my last post, so I’ll have a lot to talk about today. But first, here’s my obligatory recitation of goals and aspirations for the New Year:

  1. Increase my WPM to 100
  2. Master vim
  3. Develop a systematic debugging workflow
  4. Read the docs more often
  5. (And as a result of 1-4) Increase my productivity by 3x

It’s gonna be a pretty tall order to get all of these completed, but I believe that each is a necessary condition of becoming a serious developer. Here’s an explanation of each line item:

Spend Less Time Typing

I spent a lot of time typing in 2014. In fact, despite being an computer professional for 3+ years, I probably typed more than twice what I typed in 2012. Why? Because I spent some time practicing typing, and as it turns out, practicing typing necessary begets more typing because you have to type to practice typing!* The faster I type, the more I type, and because I’m typing more, I get even more practice with typing!

* What?

tl;dr: I want to spend less time typing so that I can type more things.

Use My Tools Better

That you type is not enough for the professional developer. What you type, and what you have to type, and what you don’t is more important. I think there is an argument to be made that performing a task more accurately is more important than performing it quickly. But with vim, there’s no excuse to sacrifice either. I have from time to time believed that I was decent with vim, and as it turns out that is mere delusion. Vim is large; it contains multitudes. Learning to use even 10% of the vanilla, no plug-in bullshit vim allows a developer to wield more power than they are probably responsible enough to handle.

Kill Bugs Good

There are few things in life that are more infuriating than the insidious bug. I like to think that I do a decent job with bug prophylactics, but I must say that I have spent more than my fair share of time tracking down and eliminating bugs. A lot of the time it comes down to just taking a deep breath, analyzing the situation, and taking a break from thinking about the problem for a minute so that cooler heads can prevail, or so that your own head cools a bit. But there are other, less subtle and more straight forward ways of bug extermination than this approach.

RTFM

I think that 2013 was my Year of Google. I found myself checking out SO a lot, and kind of MacGyvering together solutions that were suboptimal before ultimately exegese the docs and refactor (or often throw out) my old solutions. While it’s nice to just try and Stack OverFlow some of your own problems, in my heart of hearts I know these are instances of penny wisdom and pound foolishness in Googling a solution. RTFM > LMGTFY bro.

Thus, My Year of Productivity

My goal this year is to produce triple of the amount of code that I produced last year while improving the quality of what I do produce. I think this both possible and measurable, albeit not necessarily easy, and I think that improving in the aforementioned areas is a pretty good place to begin.

Increase my WPM to 100

From 2011 to 2013, increased my typing speed from about 50 natural language WPM to roughly 80 for text transcription. When accounting for less oft use keys (like the number row keys, curlies, et.,) I sped up from about 40 WPM to almost 70 WPM. Copying someone else’s code for the sake of practice (thanks typing.io!,) I am anywhere from 40 – 50 WPM presently and was probably nowhere near that in 2011. I want to get my natural language typing skills up to about 100 WPM this year, and my programming typing skills up to about or 70 would be excellent.

I think that I can accomplish this by 1) practicing less-often used key fingers for about 15 minutes a day, and 2) trying my damnedest to NEVER miss ANY key (but especially the funky keys) even if it means my slowing down for a second to look at the key, or even to just think about/visualize where my hands are at the keyboard before executing a keystroke that feels funny. I’m sure I will continue to play typing games in 2014 to continuously improve upon my natural language typing.

Mastering vim

If you are not constantly thinking about how to improve your vim, you will be left in ‘the Stone Age of Computing.’ I have to admit that I was much more enthusiastic about learning vim in 2012 than I was in 2013. Perhaps it was a case of not wanting to put the cart before the horse, but it was far more likely a case of being lazy, and indeed once again, of being penny wise and pound foolish. I do not want to commit the same mistakes that I made in 2013 in 2014, and I consider one of my cardinal sins to be the forsaking of marginally improving my command of vim over time.

While I do doubt that I will “master” vim within the foreseeable future, I do believe that I can make significant improvements to my vim fluidity by approaching my learning of it more consistently and systematically. Gary Bernhardt has suggested that learning the entirety of the simple (straight-up character) normal mode commands is a good place to start. It seems a reasonable of a place to start as any.

Develop a Systematic Debugging Workflow

Thinking about it now, I imagine that this is probably going to be the hardest problem for me to solve this year. It seems that solutions to bugs just kind of bubble up into my consciousness from thin air. It almost feels like I’m inspired into bug fixes from time to time, which is really kind of crazy when you think about how stressful it is to crunch bugs on a deadline.

This year, my plan to systematically debug my systems is to kill bugs before they ever manifest themselves in production. That means a greater attentiveness to unit testing, and a deeper knowledge of design patterns and the problems that they address.

Read the docs more often

Like seriously, read the damned docs. Just do it. God.

Final Word

It’s fun to challenge yourself. It’s even more fun when you’re challenging yourself with a purpose. There are real stakes for making a concerted effort to improve your productivity: the more you get done with your time, the more you earn. After all, time is money. Whether you’re earning to more money by doing more with your time, or by doing better so that your time is worth more, productivity talks and bullshit walks.

Work smarter in 2014, not harder.


Repetitive Text Processing and Posting to WordPress Made Easy with Code

November 1st, 2013 bobby.

Manually processing thousands of documents sucks. Not only is it mindless and boring, but it takes up a huge amount of time and is prone to error. But there is latent value in having content out there in the wilds of the web, so it is worth understanding how to process documents quickly and efficiently in order that they may yield their value to their owner.

I encountered just such a problem a while back. It had been a strategy of ours for quite some time to spam content into the web in an effort to gain higher volumes of search traffic. For the most part, I can tell you that this strategy is a successful one. But the main problem (for us) has always been scalability: there are only so many articles that one can read, edit, and post in a given amount of time, and I had been spending about a week per month one for just 100 articles. Manually completing this task was never really a problem for me, but it was nonetheless a complete and total waste of time (I estimate that I’ve spend close to 300 hours doing this kind of work.)

The boss had approached me with a request to expand this scheme to include an additional site, and to increase the rate of content expansion from 100 articles a month to 200. At that rate, I would have to spend all day, every day, reading, editing, and posting articles. Clearly this was unacceptable. I would need to find a way to get these articles published without wasting tons of time doing so. Why not do it with code?

Motivation

This whole situation came about because of our desire to build up some lead generation, particularly lead generation with organic search traffic. The idea is simple: bring people to a site (or in our case, sites) and cajole them into doing something (in our case, filling out a lead form.) An excellent and sure-fire way of doing so is to do some keyword research, figure out what people are looking for in your niche, and create content that is relevant to those questions or topics. The more particular these questions or topics are, the easier it is to get the right page to show up as a result for the relevant query. The downside to an approach like this is that it requires a significant amount of research (usually) and time to create and publish content that serves the purpose of generating leads.

In our case, it was pretty straight forward figuring out what to write about. For each article that needed to be written only required some boilerplate text and some sprinkling of relevant keywords. We were able to find a content producer that could get fairly well written articles for us at a rate of about $.01 per word. The boilerplate was to run anywhere between 400 to 500 words, putting the price per article at about $4.50. Considering that these leads are worth about $15 a pop, and considering that some of these articles would draw multiple leads and phone sales in per day, they were well worth what we paid.

The Problem

But even cheap content like this comes with a price. When you’re only paying a penny for a word, it’s nearly impossible to get those words from native English speakers. Many of the writers that do this kind of work are from the Phillipines or Eastern Europe, and they don’t always have the best grammar or writing style. Thus, there were a slew of common mistakes that needed a-fixin’. This would be the time consuming, manual work. Automating the remaining part of the process would make the workflow manageable.

What we ultimately needed to do was to parse the these documents, post the articles to a WordPress site, and review the articles for the aforementioned grammatical errors before the articles were live to the public.

Some Considerations

First, we would have to consider the cost. After all, we’re doing this to make money, and even though the margins on this are high, other segments of this part of the business had much lower margins, so it was necessary to do this work at as little cost as possible. We resolved to not spend any additional money on the creation and processing of articles than we already were (again, these costs were the cost of creation, and my own time,) and ideally we would spend a whole lot less than we had previously. With these parameters, we had a couple of options:

  1. Make the writers do the work for us
  2. Hire someone else to help with the editiing
  3. Write code to automate the process

Ruling Out the Writers

Our first thought was to perhaps try to extract more work out of the writers. After all, we were paying good money to these guys for their work, so why not have them format the articles and then post them to the WordPress site themselves? But, after about 10 seconds of deliberation, it became pretty clear why this wouldn’t work.

To keep the costs low, we used super low-skilled writers working in foreign countries, and getting these guys to follow special instructions could be potentially disastrous (yes, this sounds a bit cynical, but I ran this thought by the editor broker who we were getting the hookup from and he agreed.) Secondly, we didn’t really want to give anyone access to our site, even as a guest. There were too many opportunities to for things to go wrong.

Ruling Out a New Hire

We then turned our attention to a new hire. This seemed like a somewhat reasonable solution at first, but we later realized that this simply wasn’t a scalable solution. It would take time for us to hire and train a new editor. Even if we were ready to do bring someone else on, they would undoubtedly be expensive, which would violate our specification for an inexpensive solution. Even if we could find someone that was affordable and picked up on what we were trying to do quickly (which was unlikely,) they would almost certainly be slower than I was.* * Not trying to boast, but I had spent a lot of time doing this particular flavor of editing in the past and can type and write quite quickly and efficiently.

Going Procedural

Given that the first two options kind of sucked, we decided that the only acceptable course of action was to do things procedurally.

Parsing the Document

There were a couple of decisions that needed to be made when parsing these documents. First, we had to figure out if we wanted to parse Word documents or html documents. Each had an interesting set of pros and cons.

Parsing docx files was relatively easy, as there is an existing module that extracts the plaintext out of the file straight away and outputs it to a file. But with this approach, we lost all of the Word formatting. This was problematic because there was no straightforward way to recreate this formatting in HTML; there wasn’t any straightforward logic to add formatting in programmatically either.

The other option was to use an xml parser to do the job. This was a pretty yucky way to do the job (have you ever seen the html output of word docs?) but atleast we would be able to retain our writers’ formatting. Despite the hassle of dealing with bizarre markup, this was ultimately the most simple solution to extracting out latent html content.

Choosing a Library

There are a bunch of xml/html parsing libraries available in Python’s Standard Library, including lxml, xml, HTMLParser, and so on. For this job, however, I chose to use BeautifulSoup. Even though it’s not in the Standard Library (I prefer to use the Standard Library whenever possible,) it offered a lot of useful functionality that would have been difficult and time consuming to write on my own, and that wasn’t provided in the standard library.

After the technology decisions were made, we had to choose a design scheme to implement the core functionality.

Putting the Pieces Together

The main choice we had to make was whether to make the posting process fully automatic or semi automatic. With fully automatic posting, we would be able to drip articles into the Google index 24/7 (so long as our supply of articles was at a steady drip.) But that could introduce Quality Control problems that would be difficult to address with code alone. Therefore, we settled on a semi-automated process that would

  1. Process the raw article files
  2. Insert these files directly into the WP database
  3. Review the article, checking for grammatical errors
  4. Submit the changes
  5. Repeat until there aren’t any more articles

With this approach, we were able concentrate the actual amount of work to do be to the essentials: reviewing an article for errors, fixing them if they exist, and scheduling the article for posting.

The Code

This code requires that we have relatively clean article files. The article files that are parsed by this code are .htm files with ONLY THE DISPLAY INFORMATION SAVED INTO THE HTML! You can use other files if you like, but this is the simplest way of dealing with this type of data.

Anyway, (with permission from my employer) here’s the spaghetti code:

Document Preprocessing

import sys
 
from bs4 import BeautifulSoup
 
from article_loader import load_files
 
"""
CONSTANTS
"""
 
BASH_STDOUT = sys.stdout
 
 
"""
Helper Functions
"""
 
def unwrap_tags(tag_list):
    """
    Unwrap the content from within specific tags
    tag_list: a list of tag strings (e.g. 'p', 'em' etc...)
    """
    for tag in tag_list:
        for t in tag:
            if t.b != None:
                t.b.unwrap()
            if t.u != None:
                t.u.unwrap()
            if t.em != None:
                t.em.unwrap()
            for k, v in t.attrs.items():
                del t[k]
 
def remove_attrs(attr_list, contents):
    """
    Specify a list of attributes to remove
    """
    tag_list = []
    for tag_type in attr_list:
        tag_list.append(contents.find_all(tag_type))
 
    unwrap_tags(tag_list)
 
"""
Main Functions
"""
 
def print_files(input_path, output_path, clean_attrs=['p','span','ul','li','em']):
    """
    Print the processed html files to the file system.
    """
    file_list = load_files(input_path)
    for file in file_list:
        new_name = file.split('.')
        f = BeautifulSoup(open(input_path+file))
        contents = f.div
        contents['class'] = "page_content" 
        remove_attrs(clean_attrs, contents)
        clean_contents  = str(contents).rstrip()
        clean_contents  = unicode(clean_contents, errors='ignore')
        sys.stdout      = open(output_path + new_name[0] + '.html', 'w')
        print clean_contents
    sys.stdout = BASH_STDOUT

The Process Spawner

import datetime
 
import json
 
import sys
 
from os \
        import path
 
from subprocess \
        import Popen, PIPE
 
from html_parser \
        import load_files
 
"""
Constants
"""
 
BASH_STDOUT = sys.stdout
 
"""
Helpers
"""
 
def fix_names(file_name, separator='_', prefixes=['new', 'south', 'west', 'north', 'rhode']):
    """
    Corrects the way that names are stored within the array
    file_name:  file name string that will be used to extract state information
    separator:  a string separate the terms of the file name
    prefixes:   special cases for states names that have 2 words  
    """
    states = path.splitext(file_name)[0]
    states = states.split(separator)
    prefixes = ['new', 'south', 'west', 'north', 'rhode']
 
    for i, state in enumerate(states):
        if state in prefixes:
            states[i] = state + " " + states[i+1]
            states.remove(states[i+1])
    return states
 
def get_parent_page_list(list_file):
    """
    list_file: a tab separated States, ID list (requires a WP database query or script to produce)
    returns a dict with where k,v == states, id
    """
    state_ids_dict = {}
    with open(list_file) as f:
        states_and_ids = f.readlines()
        for line in states_and_ids:
            id_list = line.split("\t")
            state_ids_dict[id_list[0]] = int(id_list[1].rstrip())
    return state_ids_dict
 
 
def create_title(file_name):
    """
    Create a title from a filename
    states: plaintext statenames separated by linebreaks
    separator: a string with a separator character
    returns an array of strings
    """
    states = fix_names(file_name)
    for i, state in enumerate(states):
        states[i] = state.title()
    return states 
 
"""
Functions
"""
 
def create_post(FILE_LIST, 
                input_path,
                output_path,
                page_id_list, 
                author_id,
                title_prefix,
                initial_post_delay=1, 
                increment=8, 
                comments='closed'
    ):
    """
    Create a post file
    FILE_LIST:  a list of files from the input_path directory
    input_path: a string location to the input directory (generally the OUTPUT of the htmlparser process)
    output_path: a string location to the output directory (generally your JSON_DIR)
    page_id_list: dictionary of page (string): id (int)
    initial_post_delay: an int that sets the number of hours from now before the first post goes live
    increment: an int that sets the number of hours between each post
    author_id: an int that specifies the author of the post
    title_prefix: a string that sets the prefix of the articles title
    comments: a string that determines whether comments are turned off or on (defaults to off)
    """
    post_date = datetime.datetime.today() + datetime.timedelta(hours=initial_post_delay)
    timedelta = datetime.timedelta(hours=increment)
 
    for file in FILE_LIST:
        print "The File:", file
        file_name = file
        states = create_title(file)
        file = open(input_path + file).read().encode('utf-8', errors='ignore').replace('\n', '') 
        data = {
                'post_author'   : author_id,
                'post_content'  : file,
                'post_date'     : post_date.strftime("%Y-%m-%d %T"),
                'post_title'    : title_prefix + (' '.join(states)),
                'post_type'     : 'page',
                'comment_status': comments,
                'post_name'     : (' to '.join(states)),
                'post_status'   : 'future',
                'post_parent'   : page_id_list[states[0]],
                }
        json_obj = json.dumps(data)
        new_file = output_path + file_name + ".json"
        sys.stdout = open(new_file, 'w')
        print json_obj
        sys.stdout = BASH_STDOUT
        print "SUCCESS! Wrote " + new_file + " to the filesystem."
        print "Post Scheduled for " + post_date.strftime("%Y-%m-%d %T"),
 
        post_date += timedelta
 
def process_spawner(JSON_PATH):
    """
    Turns JSON into Wordpress Posts
    FILE_LIST: a string JSON directory location
    """
    JSON_FILES = load_files(JSON_PATH)
    for file in JSON_PATH:
        proc = Popen(["php", "json_to_post.php", file],
                     stdin=PIPE,
                     stdout=PIPE,
                     )
        stdout =  proc.communicate()[0]
        print stdout

Insertion Script (PHP)

<?php
require_once( "/path/to/wp/wp-load.php" );
 
echo "Starting app...\n";
 
$file_name = $argv[1];
 
$string = file_get_contents( "json_files/" . $file_name );
$json_array = json_decode( $string, true );
 
function make_title( $json_array ) {
	$title = $json_array["post_title"] . " | American Auto Move";
	return $title;
}
 
function make_metadesc( $json_array ) {
	$meta_desc = "Description Prefix " . $json_array["post_name"] . " Description Postfix";
	return $meta_desc;
}
 
$post_id = wp_insert_post( $json_array, $wp_error );
 
if ( $post_id === 0 ) {
	echo "Oops, something went wrong\n";
	die;
}
 
else {
	echo "Posted a page! The Page ID is " . $post_id . "\n";
}
 
echo "Changing the template...\n";
 
$updated = update_post_meta($post_id, '_wp_page_template', 'pageandposts.php');
 
$new_title = make_title( $json_array );
 
echo "Changing title to " . $new_title . "...\n";
 
$change_title = update_post_meta( $post_id, '_yoast_wpseo_title', $new_title);
 
echo "DONE!\n";
 
$new_metadesc = make_metadesc( $json_array );
 
echo "Making new meta description to " . $new_metadesc . "...\n";
 
$change_metadesc = update_post_meta( $post_id, '_yoast_wpseo_metadesc', $new_metadesc);
 
echo "The Meta Description is\n";
 
echo get_post_meta( $post_id, '_yoast_wpseo_metadesc', true);
 
echo "DONE!\n";
 
?>

To run everything, just create a driver script that calls process_spawner() iteratively over every file in a given directory. I happen to prefer Python for most programming tasks, and I used that here, but you can use PHP to write the entire process if you’d like (in fact, it would probably be a lot cleaner that way.)

There are CLEARY MANY OPTIMIZATIONS THAT CAN BE MADE TO THIS CODE. It could (and should) be object oriented, for example. It would also be great to interact with this code via a web MVC interface. Persistent storing and logging of activities are also great ideas. Unfortunately, a lot of my work is maintenance of code these days, so I don’t get a whole lot of time to refactor things.

The point here is that it’s pretty easy to do something like this. Take this code quick and dirty code example, hack it, and use it however you like!


Cantor’s Diagonal Argument

November 1st, 2013 bobby.

The coolest thing that I’ve learned so far in my Discrete Mathematics class is Cantor’s Diagonal Argument. This argument reveals that there are atleast two ways to talk about the cardinality of infinite sets: countably infinite and uncountably infinite. A countably infinite set is countable (i.e., there exists an algorithm to reveal that sets cardinality.)

A countably infinite set is a set such there is a 1:1 corresponce, otherwise known as a bijection, between the elements in a set and the and the set of all Counting numbers ({1,2,3,4,5…}); it is uncountably infinite if the set is infinite and there isn’t a bijection between that set and the Counting numbers. Alright, fine…but how do you know when you can count the numbers of elements in a set and when you can’t…

That’s what Cantor shows. Take an infinite set (like the set of Real numbers,) and enumerate it (like you can with all countable sets, DUH!) Then suppose that you have a sequence Bbar. Bbar is, by definition, a sequence Bbar such that Bbar = .bbarsub11, bbarsub22, .. bbarsubnn, when n in the position of that particular bbar in a matrix. Because Bbar means that b is necessarily different from the Bbar, each Bbar can never be the same as the b that shares its index.

If you can count the number of Real numbers, then you should be able to enumerate them when you’re asked to. There’s a shitload of sequences in the set of Real numbers, so we can’t really do that, but we can think of a sequence called Bbar! Bbar traverses the diagonal of the Real number matrix, and we know that if there’s atleast 1 digit of the of any one of our sequences B that is not the same as Bbar, then there isn’t a bijection between the countable numbers and the Real numbers. If that’s true, then we can’t say that the cardinality of countable numbers is the same; both are incountable, but the Real number set has a cardinality that is somehow different than the set of counting numbers. Because of the nature of Bbar, we know that there’s always some index nn where the nn in Bbar and b are different by virtue of fact that n is countable and infinite!

Cantor was such a smartass. My mind is blown right now.