CAPACITY PLANNING IN CASSANDRA – Project topics materials

Get Complete Project Material File(s) Now! »

Capacity planning approaches

This research area is in it nascent stage and there is very less related work published in the literature. However, from the Cassandra Documentation[5], [6] the capacity planning approach is identified.
According to the Cassandra Documentation[9], capacity planning is based on few factors such as Row overhead, Column overhead, Indexes etc. Row and column overhead is the extra space required to store data in the tables. These factors are again based on many parameters such as number of rows in the tables, length of primary key for each row of data, number of columns in the data etc. This capacity planning approach is widely followed for predicting disk storage. This process starts with determining the column overhead which is calculated as 15 or 23 bytes plus the column value size and the column name size depending whether the column is non-expiring or not. Here by expiring rows columns refer to the column which have a time to live value i.e. the data in the column will be expired after a certain time .Then we find the row overhead i.e. 23 bytes for each row stored in the Cassandra. Then we calculate the sizes of the primary indexes of the table. All these are added to the usable disk size i.e. the size of data based on the datatypes of the data. Now finally we multiply the replication factor. For additional details regarding the factors refer to Section 4.8.
Another capacity planning model was proposed by Anil. This approach is followed in Accenture [10]. The capacity planning is carried out by calculating the row and column overheads for all the rows and columns in the Cassandra. These overheads are added to the usable disk space i.e. the number of rows multiplied by the size of data in each row. Indexes are calculated and added to this. In addition to the Datastax approach, here the free space for compaction to trigger is also calculated based on the type of compaction strategy selected. Compaction is triggered to delete stale and obsolete data. For this process to trigger free space is required.
According to Edward Capriolo’s approach on Capacity planning in Cassandra [11], smaller key sizes and smaller column name sizes save disk space. This indicates that key size and column name size directly affect the size of the disk. He also mentions about the free space required for compaction to trigger must be of the size of the largest SStable present in the Cassandra storage system. He also suggests to plan for the future disk space needs before the disks runs out. This would ensure that the data is intact and not lost.
The Cassandra capacity planning area of research is in its nascence stage and vey less research has been performed. A generic capacity planning model is the need of the hour and many researches are been carried to predict the exact disk storage when Cassandra is deployed. Via this research a capacity model is made which predicts the disk storage required for Cassandra deployment with maximum accuracy and covering all the factors affecting disk storage exhaustively.

RESEARCH METHOD

This part of the document deals with research method chosen to answer the research questions. The motivation for choosing the research method is mentioned and the rejection criteria for the other research methods are also mentioned.
Research Method Selection:-
Once the research questions were formed, the next objective was to select a research method to answer those questions. In this scenario Case Study is chosen as the research method. Case study is one of the commonly used research method for research. It is used when a deep empirical investigation has to be performed on a real time context or subject[12]. In this scenario, a model is to be built which would predict the disk storage capacity of the Cassandra storage system. Investigation is required on different parameters and factors on which the Cassandra disk storage system is affected. This model shall be used to predict the storage capacity for Ericsson’s Voucher generating system based on the identified factors. The factors are characterized broadly into two categories they are, Cassandra-specific factors and Product-specific factors. Cassandra specific factors are those which can be tuned based on Cassandra’s requirements and product specific are those which are based on the product or product requirements.
Why Case study? Alternatives rejection criteria explained:-
Apart from case study there are many other research methods used in computer science. They are experiments, action research, surveys etc.
Experiments [13], [14] are rejected because the factors which affect the Cassandra are not fully known. So an experiment cannot be conducted on factors which are not fully known and determined.
Surveys[15] are not chosen as their results would only provide Cassandra users perspective and opinion on capacity planning but will not help in building a model to calculate the disk storage.
Action Research [16], [17] is used when an effect of an action is to be known. Here, the aim is to identify different factors affecting Cassandra storage system. So, Action research is not chosen.
Case study[17], [18] is chosen because the factors are not fully known and they differ from case to case. In this particular case the effect of factors on Cassandra is to be determined. Also case study is the best chosen method when a subject is to be deeply investigated. This investigation on the subject would answer the research questions.

Literature Review:-

Literature review is conducted based on the guidelines provided in Creswell’s book “Research Design” [19]. Using the literature review the related work and gaps in the research area are identified. The literature review is carried out in a 7-step procedure.
Step1: Identifying the keywords in the research area. The keywords are to be used in searching for related articles, books, journals etc. from the present literature. The key words in this research area are: “Cassandra”, “Capacity planning”.
Step 2: Once the keywords are identified, they are used for searching relevant journals, articles, books etc. in different databases like INSPEC, Google Scholar, Engineering village.
Step 3: Few relevant articles and journals are selected from the search results. These formed the base for the literature review. More keywords are deduced from these articles and the search strings are enhanced such as “sizing”, “data storage”.
Step 4: Now using the enhanced keywords the database were searched. This time the selection of articles from the search results were based on abstracts and results present in them. Articles which were closely related to the research were selected.
Step 5: Now that the articles are being selected, simultaneously the literature map is created i.e. what is to be included in the literature review. This is created based on the research papers and articles obtained from the step 4. Using the literature map a clear picture of the research can be obtained. The literature map is shown below as
Step 6: Now the literature review is written which contains the related work i.e. research which is going on or done in the present research area. It is seen that very less research is done in this research area.
Step 7: From the literature review research gaps are identified and proposal is submitted to fill them. Here a model for dimensioning Cassandra is the proposed solution.
The literature review was also conducted on the Datastax Cassandra Documentation[8] to get acquainted with the Capacity planning model mentioned in it which is widely followed for capacity planning in Cassandra. In section 2.1 detailed description of the capacity planning model mentioned in Datastax documentation is mentioned. The model which was built was based on the Capacity planning procedure by Datastax and also few additional parameters extracted from the interviews. All the parameters were analyzed and a new capacity planning model was proposed based on the Datastax capacity planning procedure and the parameters extracted from the interviews. For the additional parameters interviews (Data collection method) were conducted as a part of the Case study research method. The coming section gives the detailed description of the case study and its design.

Case Study Design

The following processes are involved in the case study design.

Case selection and Unit of analysis

A case is selected to elevate the factors which affect the disk storage in Cassandra data storage system. This case selection process is the inception of the case study and is the base for the research. This case will act as an input to determine the factors affecting the case. The unit of analysis is the Voucher system project at Ericsson R&D. The voucher system uses Cassandra data storage system as its storage system and the capacity planning has to be done before it is deployed. This is done by first analyzing the voucher system and its requirements. This would be the starting point before the actual capacity planning is done. Interviews were conducted on different people who work in Cassandra Team at Ericsson R&D, India. There is a dedicated Cassandra team which works for the development and maintenance of Cassandra on Voucher system project. Once the interviews are finished, the data is analyzed and the factors affecting the Cassandra disk storage are gathered from the interview results. These factors are used in building a generic model which predicted the disk storage required to deploy Cassandra successfully.

READ Stochastic Approximation and Least-Squares Regression

Case Study Process

The present research process is shown in Figure 4. From the interviews factors affecting the disk storage are extracted. On analysis of the factors, parameters affecting disk stoarge are extracted. These parameters are fed as input to the Capacity planning model. The output from the model is the predicted disk storage.
Case study process or protocol refers to the information regarding the case study data collection and analysis procedures. It contains all the aspects of the Case study approach as a research method[12]. The Figure 5 shows how the case study is actually performed in this research. The sections in this figure are explained in detail in the coming sections.

Data Collection Process

Data required for the case study is gathered from different sources ranging from direct observations to interviews. This process is done in order to achieve data triangulation i.e. gathering data from different sources which would allow analysis in different aspects and perspectives.[20]

Standard Cassandra Documents

The data collection process was started by first analyzing the Datastax Cassandra Documentation[5], [6]. These are the standard documents for all Cassandra related authentic information. Datastax is a company which provides all types of Cassandra based services to its clients. On analysis of these documents some factors are identified and were noted in memos. The document has a section separately dealing with the capacity planning procedures. Also this
document was used to understand the Cassandra functioning, terminology and architecture.

Interviews Design

The next step in the data collection process is the interviews. Interviews are conducted when the practitioners or normal people’s opinion is to be taken. Here interviews are planned to collect information on factors affecting Cassandra data storage system. Interviews are chosen over surveys because the one-to-one conversations with the Cassandra users will be used to gather the information required for the Case study. One-to-one interviews was chosen as it would give individual opinion on the Cassandra capacity planning. Before the interviews are conducted, a questionnaire is to be prepared so that the participants can be interviewed on only those questions. A single interview was divided into 4 parts. In the first part mutual introduction between the interviewer and the interviewee was held. Following this the interviewee is explained about the research and the reason for the interview. In the next part of the interview the interviewee is asked about his experience with Cassandra and capacity planning strategies in Cassandra. This is followed by the last section where the interviewee suggests how the capacity planning can be improved while deploying Cassandra. All the interviews are transcribed thoroughly. During the interviews memos are also noted when the interviewee is answering. These memos are used when the interview results are analyzed.
All the interview results are maintained confidentially. All the questions in the interviews were semi-structured. A mail was sent to all the interviewees before the interview to book an appointment. This helped in managing the time for interviews.

Formulating the Questionnaire

To conduct the interviews a questionnaire was required. This questionnaire should consist of all the questions to which the interviewees would answer and the information can be analyzed for the case study. The questionnaire contained semi-structure questions which were also open ended i.e. the interviewee can give an open exploratory answer which is not restricted to any options[17]. The questionnaire formulation was made in two phases, they are:
1. Literature review and standard Cassandra documents: From the available literature and Cassandra documents the initial questionnaire was made. The keywords from the research questions were identified and literature survey was conducted. This was base for the initial questionnaire. Also the Cassandra documentation[5], [6] by the Datastax was also used in formulating the questionnaire. The questions are framed in a manner where the respondent can give the answer exploratory without any constraints. This was the initial questionnaire which was made and used.
2. Once the initial questionnaire was ready, interviews were conducted and were transcribed completely. If there were any responses which were unclear or ambiguous then a new question is added to the initial questionnaire to make it more efficient. This process is continued until a saturation point is reached. This made the interviews more efficient.
There were no previously conducted interviews in this field in the literature. So they were not considered while making the questionnaire.

Transcription Process

While the interviews are going on, the responses given by the interviewees are noted in a transcript which is used for analysis. All the responses were written down. Apart from these whenever the interviewee made special remarks like “always”, “never” etc. they are saved in a memo. These are then asked to other interviewee whether it is “always” or “never”. The interviews are not recorded as it is up to the interviewee’s wish[19].

Data Analysis process

Once the interviews were in progress simultaneously the data analysis process was also started. For the analysis, Grounded theory was used. Grounded theory refers to a systematic methodology using which a new theory is derived from the analysis of the data. Grounded theory is said to be a superior approach for any data analysis of unstructured qualitative data[21], [22].The data analysis using grounded theory is done in 5 phases as shown in the

Open-Coding (Extracting concepts/codes from the data)

After the pre-coding phase, Coding process is started in the data analysis process. In this phase from the data collected codes/concepts are identified. This phase is also known as the open coding. A code or a concept is a phrase which refers to a part of the text in the transcripts. The codes are the basic building blocks for the data analysis phase[22]. For example if an interviewee talks about “Row Overhead” as a potential factor affecting the disk storage then we mark “Row overhead” as a code. Now that we have a code named “Row overhead”, whenever an interviewee refers to this in his responses that part of the text is mapped to this code and thus making the analysis phase simpler. Codes are manually identified from the transcripts and are marked as underlined or highlighted in the Microsoft Word.
Initially codes are identified as the most occurring words in the transcripts. Therefore we identify the frequency of each word in the transcript and mark the first 30 most frequent words as 30 separate codes. Apart from these other words are also analyzed and codes are made. This is to ensure that during the analysis no text parts of the transcripts are left without analyzing. The identification of codes process is conducted until a saturation point is reached i.e. no more codes can be extracted from the present data in the transcripts. A total of 52 codes are extracted from the transcripts in total.

Grouping the Codes into Categories (Axial-Coding)

Once all the codes are identified, the next step is group similar type of codes into groups. This phase is also known as axial coding. This would help in analyzing codes in groups and thus codes can be distinguished on the basis of groups. The grouping is done by comparing each code, if any similarity they are grouped together. In this research scenario, all the 52 codes are divided into two groups namely Cassandra-specific factors and Product specific factors. The main reason behind grouping the factors is to identify respectively the Cassandra-specific and product specific factors by comparing the codes and also to enhance the data analysis process by breaking down the analysis process[22]. The grouping of the codes is shown in the Figure 7.

Table of contents :

1 INTRODUCTION
1.1 WHAT IS CASSANDRA?
1.2 CASSANDRA ARCHITECTURE
1.3 WHY CASSANDRA?
1.4 CAPACITY PLANNING IN CASSANDRA
1.5 WHY CAPACITY PLANNING?
1.6 RESEARCH GAP
1.7 RESEARCH QUESTIONS
2 RELATED WORK
2.1 CAPACITY PLANNING APPROACHES
3 RESEARCH METHOD
3.1 CASE STUDY DESIGN
3.1.1 Case selection and Unit of analysis
3.1.2 Case Study Process
3.1.2.1 Data Collection Process
3.1.2.1.1 Standard Cassandra Documents
3.1.2.1.2 Interviews Design
3.1.2.1.3 Formulating the Questionnaire
3.1.2.1.4 Transcription Process
3.1.2.2 Data Analysis process
3.1.2.2.1 Pre-coding of the data
3.1.2.2.2 Open-Coding (Extracting concepts/codes from the data)
3.1.2.2.3 Grouping the Codes into Categories (Axial-Coding)
3.1.2.2.4 Checking codes importance by frequency
3.1.2.2.5 Validation of Analyzed Data
3.2 PLANNING THE CASE STUDY
3.2.1 Interviews planning
3.2.2 Interviewee selection
4 RESULTS
4.1 VOUCHER SYSTEM AND CASSANDRA CAPACITY PLANNING
4.1.1 Voucher System at Ericsson
4.1.2 Cassandra Data storage system for Voucher System
4.1.3 Capacity planning for Voucher System’s Cassandra
4.2 SYNOPSIS OF THE INTERVIEWS CONDUCTED
4.2.1 Transcription process
4.3 PRE-CODING
4.4 OPEN-CODING (EXTRACTING CODES)
4.5 GROUPING OF CODES INTO CATEGORIES
4.6 CODE FREQUENCY
4.7 VALIDATION OF THE CODES
4.8 FACTORS AFFECTING DISK STORAGE IN CASSANDRA
4.8.1 Cassandra specific Factors
4.8.2 Product/Project specific Factors
4.8.3 Parameters for the extracted factors
4.9 DESIGNING THE MODEL
5 ANALYSIS
5.1 ANALYSIS OF THE MODEL
5.2 VALIDATING CAPACITY PLANNING MODEL USING ERICSSON’S VOUCHER SYSTEM
5.2.1 Proposed model capacity planning calculations
6 DISCUSSIONS AND VALIDITY THREATS
6.1 DISCUSSIONS
6.2 VALIDITY THREATS
6.2.1 Internal threats
6.2.2 External threats
6.2.3 Construct validity
6.2.4 Reliability
7 CONCLUSIONS
7.1 CONCLUSIONS
7.1.1 Relating to the research questions
7.2 FUTURE WORK
REFERENCES
APPENDIX