While tidying up 10 years worth of digital documents from a previous volunteer role, I found my excuse to get familiar with MongoDB to identify trends, and answer questions that I often pondered about.
In my previous post, I presented a data dashboard that allowed the viewer to slice and dice 2017/18 Infocomm Job Postings from the Careers@Gov portal. Here, I will explain the data pipeline that I used to perform the necessary ETL operations for preparing the data dashboard. Architecture Based on the architecture above, you can tell … Continue reading Data Pipeline Architecture for Job Posting Analytics
Foreword All opinions expressed in this post are mine alone, and are not representative of any organization or group regardless of affiliation status The work/effort done here are purely using my own resources and time (also explains the untimeliness of this post) This is a hobby project. You would be ill-advised to make decisions/judgement based … Continue reading Data Dashboard: 2017/18 Infocomm Job Postings on Careers@Gov (Part 2)
I am curious of the skills in demand for infocomm jobs within the Civil Service. In my own personal capability, I set forth to analyze and summarize my observations of the trends within infocomm job postings on Careers@Gov.
In my previous post on TeBaC-NET, I talked about the reason why I created it. In this post, I talk about why I created it the way it is. Design Considerations #1 Cross Platform One of the most important consideration is that it should be platform agnostic. A simple tool that can run on any … Continue reading TeBaC-NET Design Considerations
I was recently exploring spaCy for some NLP work, and found that the default model was not sufficient for tagging entities in the domain I was exploring. The documentation was very helpful in explaining how I could train the statistical model of the named entity recognizer, but I needed training and evaluation data. While I could … Continue reading Text Based Custom Named Entity Tagger (TeBaC-NET)
Planning capacity for a Hadoop cluster is not easy as there are many factors to consider - from the software, hardware, and data aspect. Planning a cluster with too little data capacity and/or processing power may limit the amount of operations/analytics that can be run on it, while planning for every possible scenario may be … Continue reading Apache Hadoop Data Capacity Planning
As the final course of my Specialist Diploma in Big Data Management, I used a pseudo distributed Apache Spark cluster with PySpark to analyze the B-Cycle Trip and Kiosk data set from Austin, Texas. The data set was downloaded from Austin's open data portal (https://data.austintexas.gov/), and comprised of data from late 2014 to mid 2017. … Continue reading Spark Analytics on B-Cycle Open Dataset (Austin, TX) using PySpark and Jupyter Notebook
Jupyter Notebooks are very useful for developing (and sharing) data analytics. In addition, its flexibility allows it to be used for much more than that - teaching materials, self-learning programming languages, and (re)publication of academic papers and ebooks are other interesting uses. A while back, I helped architect and implement a collaborative environment that allowed … Continue reading Architecting an Environment to Share and Collaborate with Jupyter Notebooks
Jupyter Notebook is a great way to share documents with other collaborators (e.g. team members) to collaborate on analytic use cases. Of course, it is not only limited to that: The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. … Continue reading Remote Access to a Public Jupyter Notebook Server