Planning capacity for a Hadoop cluster is not easy as there are many factors to consider – from the software, hardware, and data aspect. Planning a cluster with too little data capacity and/or processing power may limit the amount of operations/analytics that can be run on it, while planning for every possible scenario may be unrealistic due to budget and/or space limitations.
In this post, I will focus on planning for the data capacity of a Hadoop cluster. I draw these steps from a not too recent project with the objective of building an in-house Hadoop cluster for experimentation.
Please note that:
- The numbers/approach here is more geared towards the planning stage, where not every thing is known
- There are many ways to perform data capacity planning, and the outline below is just one of them
- Depending on your setup, some steps may by optional (e.g. separate Hadoop clusters for ELT and analytics, and/or file system for logging)
Step 1: Initial Data Set Size
The first thing to know is the size and growth of the data set to be processed. The easiest is if both are fixed (e.g. a historical data set with a known size). If these are not known, it will be good to find similar data sets and/or run actual data collection processes for a reasonable period and extrapolate the results.
Example Data Sets:
- NYC TLC Trip Record Data of 2017 for all yellow and green street cabs (known size and no growth)
- Twitter data on a sales campaign to be launched in 6 months time (unknown size and growth)
If tackling the second example, it may be good to find similar data sets (e.g. a previous sales campaign of the current and/or a rival company) and extrapolate from there.
Step 2: HDFS Replication Factor
HDFS replicates data across nodes to: 1) ensure that data is not lost if certain nodes become unavailable, and 2) perform parallel data processing.
In typical production clusters, the replication factor is set at the default 3 times. Increasing or decreasing the replication factor has its implications that are outside the scope of this post.
Step 3: Transformation of Raw Data
In layman speak, having a data set with high data quality means that the records are structured, sanitized, and in the right format for the analytical needs. Depending on the data set’s quality, several rounds of data transformation steps will need to be done.
For every round of transformation, a new data set of similar size to the original will be produced – with the exception of those that filter the data set on a field. It will be good to make a list of transformation steps to see if any of them can be combined so as to reduce the number of intermediate data sets generated from data transformation. There is no right or wrong way to do it, but a good way would be to ask if there is a chance that the intermediate data sets will be useful in the future.
For example, given the following fictitious data record:
- 22 April 2017, @luppeng : Hello World #firstpost , 3 comments, 90 likes, 62 reblogs
The following data transformation steps might be necessary:
- Sanitize data fields to remove leading and trailing spaces
- Convert semi-colon to comma to create an author column
- Combine comments, likes, and reblogs to one common field
- Transform fields to expected formats
However, after talking to the data scientists/analysts who will be working on the data, they express their uncertainty if combining the comments, likes, and reblogs fields into a common field (e.g. “reactions”) is a good idea. Therefore, we can consider combining steps 1 and 2, while leaving 3 and 4 as separate transformations. In the event that the decision is to not combine these fields, the initial transformations (step 1 and 2) do not need to be executed again.
However, based on the example above, the required capacity for the transformation step is now an additional 3 times the size of the initial data set. (This is the age-old trade off between processing time and storage size)
Step 4: Intermediate Data Sets from Analytics of Transformed Data
With the right data format, the next step is to look at the type of analytics that will be executed.
Some analytics generate intermediate data sets to be used for training, or as a requirement for the next round of analysis. Knowing how many rounds of analysis, and if any intermediate data sets will be produced from these steps, will help better calculate the required capacity.
If the size of the transformed data is not known at this stage, use the raw data set size as a good gauge.
Step 5: Intermediate Storage for Map-Reduce Phases
Additional space is required for the storage of intermediate results from the Map-Reduce phases. After applying the map step, the results will need to be stored on the disk before being reduced.
This is especially so if Hadoop speculative execution is enabled to prevent cluster rate-limiting by slower nodes.
A good starting factor would be 10% if the processed data set size.
Step 6: Storage of Processed Data
After the data has been analyzed, it is important to know where the processed data (if any) will be stored. Will they be stored in HDFS to be used as part of a feedback loop to fine-tune the model, or in a traditional RDBMS (e.g. mysql, postgres) on a different system for data visualization or to be processed by third party BI tools?
Knowing the size of the processed data set, and where it will be stored will be another important factor to determine the data capacity of the Hadoop cluster.
Step 7: Logging Overheads
Apache Hadoop, other third party Apache projects/tools, and even the Operating System produce logs and/or audit trails that require additional disk space. This is critical if the organization is required to keep these logs as part of audit and/or compliance.
A good estimate would be around 10% of the raw data set size. Another strategy would be to reduce the estimate, but have rotating logs with the older ones being archived onto a separate file system.
For Consideration: RAID Configuration Overheads
This should only affect the number of nodes (and the hardware costs) of the cluster. If the cluster is intending to have RAID configuration that is not JBOD, then the redundancy overhead will add to the cluster size.
- Step 1: Raw data set size – 100 GB
- Step 2: HDFS replication factor – Standard 3 times replication
- Step 3: Data Transformation – 2 separate steps that do not filter the raw data set
- Step 4: Intermediate Data Sets from Analytics – 75% of transformed data set size
- Step 5: Intermediate Map-Reduce Storage: 10% of transformed data set size
- Step 6: Storage of Processed Data – 50% of transformed/raw data set size to be stored in HDFS
- Step 7: Logging Overheads – 10% of raw data set size
- Required data capacity = [Raw Data Set Size + (Data Transformation Steps * Raw Data Set Size) + Intermediate Data Sets from Analytics + Processed Data Set Size] * Replication Factor + Intermediate Map-Reduce Storage + Logging Overheads
Putting it into numbers:
- Required data capacity = [100 GB + (2 * 100 GB) + 0.75 * 100 GB + 0.5 * 100 GB] * 3 + (0.1 * 100 GB) + (0.1 * 100 GB) = 1,295 GB
The steps outlined above are intended to be used as guidelines – each project varies in its own unique way. Some setups may have different configurations or existing systems/tools that can reduce the amount of data being stored in HDFS.
A good start is to understand what each step addresses, and compare it with your existing system/future setup to see if it is applicable. All the best! 🙂