Data Pipeline Architecture for Job Posting Analytics

In my previous post, I presented a data dashboard that allowed the viewer to slice and dice 2017/18 Infocomm Job Postings from the Careers@Gov portal. Here, I will explain the data pipeline that I used to perform the necessary ETL operations for preparing the data dashboard.

Architecture

data pipeline for careers@gov

Based on the architecture above, you can tell that it is a pretty straight forward setup. I have chosen to run it on top of Docker so as to avoid being encumbered with the installation process of the necessary software and libraries (e.g. the hassle of installing Python3/pip3 alongside Python2, installing NLTK without the GUI because I run on headless).

One key advantage of this approach is that I can run my pipeline on any host without worrying too much about the underlying OS details.

Pipeline Step Details

The pipeline comprises of three main parts, namely: Transformation (left/blue), analytics (middle/green), and visualization (right/red).

Step 1 – Data Transformation

Data from the Careers@Gov portal arrives in the form of one raw HTML file per job posting. The first step would be to use a simple Python + Beautiful Soup script to extract the desired section (i.e. div tag), and strip away the unnecessary tags.

$ docker run --rm -i --user $(id -u):$(id -g) 
  -v $PWD:/scripts 
  -v $PWD/../FolderA-RawData:/FolderA-RawData 
  -v $PWD/../FolderB-JobPostingOnly:/FolderB-JobPostingOnly 
  -w /scripts/ python:3.7-alpine /scripts/SH1-RemoveHTML.sh

From the command above, I am using the vanilla Python 3.7 container that is build from the Alpine base (lean images ftw!). I mount all my data input, data output, and script folders as volumes, and make use of the Python engine within the container to do the processing.

After all the HTML tags have been stripped away, I noticed that the files contained non-ascii UTF-8 characters. These are often left behind from bullet points, tab spaces, or curly quotes within the data file. To wrangle these away, I use a nifty combination of xargs and iconv commands:

$ ls ../FolderB-JobPostingOnly | xargs -L 1 -I {} sh 
-c "iconv -f utf-8 -t ascii//IGNORE//TRANSLIT 
-o ../FolderB-JobPostingOnly/{} ../FolderB-JobPostingOnly/{}"

Step 2A – Fixed Field Extraction

In all job postings, there are several fixed fields that appear consistently – such as the hiring ministry/agency and job type (e.g. Permanent or Internship). I use another Python script to extract out these fields and return them in an appropriate delimited format:

$ docker run --rm -i --user $(id -u):$(id -g) 
-v $PWD:/scripts 
-v $PWD/../FolderB-JobPostingOnly:/FolderB-JobPostingOnly 
-v $PWD/../FolderC-JobKeyFields:/FolderC-JobKeyFields 
-w /scripts/ python:3.7-alpine /scripts/SH2-ExtractKeyFields.sh

The command is identical to the one listed in Step 1, with the only difference being the folder and script names.

Step 2A (Add-on) – Job Post Deduplication

As part of my own requirements, I deduplicated job postings based on Job IDs where the end date of the previous posting is within 30 days of a newer posting. Doing so makes it easier to visualize, but loses the count of re-postings within a given duration.

$ thisJobOutput=`date +%Y-%m-%d_%H:%M:%S_%Z` && 
ls ../FolderC-JobKeyFields/ | 
xargs -I {} cat ../FolderC-JobKeyFields/{} >> ../FolderD-SantizedAndDeduped/raw-$thisJobOutput && 
sort -t"|" -k1,1 -k10,10n -k9,9M -k8,8n ../FolderD-SantizedAndDeduped/raw-$thisJobOutput >> ../FolderD-SantizedAndDeduped/sorted-$thisJobOutput && 
docker run --rm -i --user $(id -u):$(id -g) 
-v $PWD:/scripts 
-v $PWD/../FolderD-SantizedAndDeduped:/FolderD-SantizedAndDeduped 
-w /scripts/ python:3.7-alpine python S3-SantizeAndDedup.py /FolderD-SantizedAndDeduped/sorted-$thisJobOutput > ../FolderD-SantizedAndDeduped/processed-$thisJobOutput

The code block above 1) combines the individual job postings into a single file, 2) sorts the jobs based on their Job ID and posting date, and 3) deduplicates them using a third Python script.

The intermediate/output files are appended by a timestamp, so that repeated runs of this step (e.g. when new data is available) will not overwrite previous results.

Step 2B – Document Clustering

The document clustering branch of the pipeline has the following steps:

  1. For every de-duplicated Job ID, remove the fixed fields, convert the remaining text to its lower case, do Part-of-Speech tagging, lemmatization, and stop word removal.
  2. Express every Job ID as a Bag of Words (ideally by its nouns) and use scikit-learn to do clustering. Finally, return the cluster that the Job ID belongs to.

More improvements and streamlining work can be done for this part of the pipeline, and I’ll probably revisit when the time comes.

Step 3 – Data Visualization

I chose Google Data Studio as it allowed me to quickly share a self-serve data dashboard with others. It was also a good excuse to learn a new tool for data visualization 🙂

Summary

While designing the data pipeline for analyzing Careers@Gov job postings, one of the key factors for me was that it should be made of reusable components that are cross platform.

And while there are definitely areas of improvements, the current state of the pipeline makes a good minimum viable product that can be launched and refined over time. If you have cool suggestions to improve the work done here, feel free to reach out! 🙂

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s