Data Pipeline Architecture for Job Posting Analytics

In my previous post, I presented a data dashboard that allowed the viewer to slice and dice 2017/18 Infocomm Job Postings from the Careers@Gov portal. Here, I will explain the data pipeline that I used to perform the necessary ETL operations for preparing the data dashboard. Architecture Based on the architecture above, you can tell … Continue reading Data Pipeline Architecture for Job Posting Analytics

Data Dashboard: 2017/18 Infocomm Job Postings on Careers@Gov (Part 2)

Foreword All opinions expressed in this post are mine alone, and are not representative of any organization or group regardless of affiliation status The work/effort done here are purely using my own resources and time (also explains the untimeliness of this post) This is a hobby project. You would be ill-advised to make decisions/judgement based … Continue reading Data Dashboard: 2017/18 Infocomm Job Postings on Careers@Gov (Part 2)

Text Based Custom Named Entity Tagger (TeBaC-NET)

I was recently exploring spaCy for some NLP work, and found that the default model was not sufficient for tagging entities in the domain I was exploring. The documentation was very helpful in explaining how I could train the statistical model of the named entity recognizer, but I needed training and evaluation data. While I could … Continue reading Text Based Custom Named Entity Tagger (TeBaC-NET)

Apache Hadoop Data Capacity Planning

Planning capacity for a Hadoop cluster is not easy as there are many factors to consider - from the software, hardware, and data aspect. Planning a cluster with too little data capacity and/or processing power may limit the amount of operations/analytics that can be run on it, while planning for every possible scenario may be … Continue reading Apache Hadoop Data Capacity Planning

Spark Analytics on B-Cycle Open Dataset (Austin, TX) using PySpark and Jupyter Notebook

As the final course of my Specialist Diploma in Big Data Management, I used a pseudo distributed Apache Spark cluster with PySpark to analyze the B-Cycle Trip and Kiosk data set from Austin, Texas. The data set was downloaded from Austin's open data portal (https://data.austintexas.gov/), and comprised of data from late 2014 to mid 2017. … Continue reading Spark Analytics on B-Cycle Open Dataset (Austin, TX) using PySpark and Jupyter Notebook

Architecting an Environment to Share and Collaborate with Jupyter Notebooks

Jupyter Notebooks are very useful for developing (and sharing) data analytics. In addition, its flexibility allows it to be used for much more than that - teaching materials, self-learning programming languages, and (re)publication of academic papers and ebooks are other interesting uses. A while back, I helped architect and implement a collaborative environment that allowed … Continue reading Architecting an Environment to Share and Collaborate with Jupyter Notebooks

Remote Access to a Public Jupyter Notebook Server

Jupyter Notebook is a great way to share documents with other collaborators (e.g. team members) to collaborate on analytic use cases. Of course, it is not only limited to that: The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. … Continue reading Remote Access to a Public Jupyter Notebook Server