In my earlier post, I used MongoDB to identify some trends in several years of documents from a previous volunteer role. In this post, I'll share about the process, a sample document schema, and the queries used to derive the insights to the questions.
While tidying up 10 years worth of digital documents from a previous volunteer role, I found my excuse to get familiar with MongoDB to identify trends, and answer questions that I often pondered about.
I stumbled upon Paint.NET back in college when I was looking for a free image editing software for poster creation. It has now been my go-to software whenever I need to do some image/photo editing. A while ago, I helped a friend to thicken/bold some handwritten words in a scan of a concert poster so as … Continue reading Thickening Words in Paint.NET
Using SQL queries to generate reports across several days can take a non-trivial amount of time. While it is tempting to simply throw more hardware at the problem, it does little to address the potential problem of inefficient queries. Inefficient queries are precursors to their final production ready counterparts, similar to developing software whereby the … Continue reading Optimizing Redshift SQL Queries Via Query Plan Estimates
In my previous post, I presented a data dashboard that allowed the viewer to slice and dice 2017/18 Infocomm Job Postings from the Careers@Gov portal. Here, I will explain the data pipeline that I used to perform the necessary ETL operations for preparing the data dashboard. Architecture Based on the architecture above, you can tell … Continue reading Data Pipeline Architecture for Job Posting Analytics
Foreword All opinions expressed in this post are mine alone, and are not representative of any organization or group regardless of affiliation status The work/effort done here are purely using my own resources and time (also explains the untimeliness of this post) This is a hobby project. You would be ill-advised to make decisions/judgement based … Continue reading Data Dashboard: 2017/18 Infocomm Job Postings on Careers@Gov (Part 2)
I was recently given the opportunity to optimize a query that processed a total of 660 million rows. The problem with this query was that it took 150 minutes to complete, provided that it did not time out (which it did ~40% of the time). The query timing out caused two key problems, namely: 1) … Continue reading Query Optimization – Processing 660 Million Rows Twice as Fast
I am curious of the skills in demand for infocomm jobs within the Civil Service. In my own personal capability, I set forth to analyze and summarize my observations of the trends within infocomm job postings on Careers@Gov.
In my previous post on TeBaC-NET, I talked about the reason why I created it. In this post, I talk about why I created it the way it is. Design Considerations #1 Cross Platform One of the most important consideration is that it should be platform agnostic. A simple tool that can run on any … Continue reading TeBaC-NET Design Considerations
I was recently exploring spaCy for some NLP work, and found that the default model was not sufficient for tagging entities in the domain I was exploring. The documentation was very helpful in explaining how I could train the statistical model of the named entity recognizer, but I needed training and evaluation data. While I could … Continue reading Text Based Custom Named Entity Tagger (TeBaC-NET)