Support Vector Machine Application for Classification of Tempe Fermentation Maturity with Information Gain Selection Feature

Tempe fermentation process using traditional methods in an open environment has several obstacles, such as slow and erratic maturity time. This is caused by less than ideal environmental Abstract Tempe is one of the ingredients of traditional Indonesian cuisine. In making tempe, a soybean fermentation process is needed which is generally still carried out in an open environment so that the maturity time becomes slow and erratic. Therefore, in the tempe fermentation process, a detector is needed to find out optimal maturity in tempe. This detection effort makes it possible to use image processing by utilizing various feature extractions through the classification process. This research utilizes a variety of image features, namely texture features using the GLCM method and various color features, namely RGB, HSV, LAB, CMYK, YUV, HIS, HCL, LCH. However, with so many features, it causes a high computational load, so that in this study the Information Gain approach was used to select features. Furthermore, the classification process is carried out using the Support Vector Machine (SVM) method with variations of linear, polynomial, gaussian and sigmoid kernels. Tempe objects in the fermentation process are divided into unripe, ripe and rotten classes with a total of 410 images as a dataset. The test results (SVM+IG) on the Sigmoid kernel produce the fastest time accuracy with a computational result of 2.18 seconds on a 30:70 split ratio, the longest split ratio is 80:20 which is 2.50 seconds on a Linear kernel and produces the highest accuracy of 96 ,74%. Furthermore, in the SVM test without using Information Gain on the gaussian kernel, it produced the fastest time accuracy of 2.28 seconds, and the longest at a split ratio of 40:60, namely 3.00 seconds in the polynomial kernel. Thus the result of using SVM+IG is that the average level of accuracy when using (SVM+IG) is faster than the SVM process without IG which obtains slower computation time. Based on the description above, this study aims to apply the SVM method to classify tempe fermented images with feature selection using Information Gain.


Introduction
Tempe fermentation process using traditional methods in an open environment has several obstacles, such as slow and erratic maturity time. This is caused by less than ideal environmental conditions, especially fluctuating air temperatures and uncontrolled humidity. To overcome these problems, a special incubator has been developed for tempe fermentation with temperature and humidity control which is an important step in improving the quality of tempe production [1], [2].
In previous studies, room temperature was controlled within the range of 30-35°C with 60-75% humidity. Even so, artisans still have to pay attention directly to the maturity of tempe by estimating the time needed [1]. Currently, tempe is still made using Rhizopus sp., which grows well at an ideal temperature between 28-35°C and humidity below 65-70%. This approach aims to ensure that the tempe has reached the desired level of maturity before the fermentation process stops. With a good detector, artisans can find out when the tempe is ripe, reducing uncertainty in the fermentation process, and increasing the overall efficiency of tempe production.
Efforts to detect the ripeness of potential tempe fermentation are carried out using machine learning-based image processing as has been studied for cases of fruit ripeness using a Support Vector Machine (SVM). The SVM method has been tested to detect ripeness of tomatoes [3], melons [4], and citrus fruits [5] . The test results in these studies [3] [4] [5], showed quite good results with an average accuracy of more than 70%. Thus this study proposes the SVM approach for the classification of tempe fermentation maturity. However, image processing still requires feature extraction, such as texture features that can use GLCM as well as color features with various colors such as RGB, HSV, LAB, CMYK, YUV, HIS, HCL, LCH. The more features extracted for classification will potentially increase accuracy, but the consequences will cause a high computational load.
One approach to summarizing the many features of which is to use the Information Gain method. In research [6] and [7], have used the Information Gain approach for feature selection and used SVM for sentiment analysis classification with an accuracy of more than 80%. The advantage of Information Gain is that the unimportant part of the feature is not included in advanced computing. This is intended to produce concise data with the best features [8], in previous research [9] using the Information Gain selection feature can reduce features by as much as 89% in Indonesian language abstract document objects. Tempe images during the fermentation process are taken as data to be classified into unripe, ripe or rotten classes. The extracted image features include GLCM and eight color features such as RGB, HSV, LAB, CMYK, YUV, HIS, HCL, LCH which will then be selected for the best using the Information Gain approach.

Methods
Then the second process is preprocessing, where the tempe image will be used as test data and the test data will be applied to the cropping process and converting the image to a grayscale format. The purpose of cropping is to increase the clarity and depth of the image and reduce the effect of noise that can affect color and texture segmentation in the image. The third process, namely Feature Extraction, is the process of segmentation and feature extraction from the image. At this stage, the researcher uses the feature extraction GLCM (Gray-Level Co-occurrence Matrix) and 8 color features (RGB, HSV, LAB, CMYK, YUV, HIS, HCL, LCH) to obtain a binary value that represents the image to be processed. These features are used to detect the color of the tempe image and determine the type of tempe maturity.

Stages of Research
This research is intended to classify tempe fermentation maturity using SVM with feature selection using Information Gain, the stages in the process are presented in Figure 1. Based on Figure 1, the flow of data processing in an application or system is presented to classify the maturity of tempe fermentation based on tempe images. The first process is Data Collection which is the process of taking tempe images. Tempe images taken are categorized into unripe, ripe, and rotten tempe images. The data collected consisted of 137 unripe tempe images, 137 ripe tempe images, and 136 rotten tempe images.
Gray Level Cooccurrence Network (GLCM) provides a lot of information about textures and patterns in an image. Some of the features that can be extracted from GLCM include energy, contrast, homogeneity, and entropy. These features can be used for classification, segmentation, and further analysis of images [10]. In the RGB color model, each pixel in an image is represented by three color components: red (R), green (G), and blue (B). These components determine the intensity of each color channel and allow for the creation of a wide range of colors. Each component has a range of values from 0 to 255 [11]. HSV (Hue, Saturation, Value) is an image color representation that characterizes diversity in terms of tone, immersion, and value. Concealment reveals its true essence, coloration reveals the degree of need of a variety, especially by understanding how much white is given to the variety, whereas attention is a trademark that reveals how much light the variety gets [12]. In LAB (Lightness, A, and B) color features, the image is represented in the variation space L*a*b*, L* denotes softness and a* and b* are the organization of variations. a* and b* are various headings: +a* is a red hub, -a' is a green pivot, +b* is a yellow hub and -b* is a blue hub. The L*a*b* variety room empowers exact variety correspondence among organizations and their inventory network to guarantee that items are created for proper varietal determination [13]. The CMYK (Cyan, Magenta, Yellow, Key'black') variation model is a subtractive variation model (essential intelligent varieties/utilizes light reflection to create colors through a mixing cycle), this variation model is used in variety printing (variety printing) [14] [15]. The YUV color feature is the range of spaces normally used in drawing pressure. YUV reduces the chroma part to reduce the transmission of image information data but without undue effect on human vision. In YUV, the chroma part (U and V) is simply isolated from the luminance part (Y) which works by image division [16]. HSI (Hue, Saturation, Intensity) is a representation of image features based on the variation framework that is closest to the natural eye. HSI summarizes image data, both variance and grayscale. Tint is the point between the reference tone and the S (immersion) vector. The reference tone is usually red, but there may be other variations. H values between 0 degrees and 360 degrees correspond to red hubs. This point deals with pure varieties that are attenuated by white light [17]. HSL (Hue, Saturation, and Lightness) is a color system used in computer graphics and image processing. In the HSL color system, there are three main components: hue (high), saturation (saturation), and lightness (brightness) [18]. LCH (Local Color Histogram) representation, on the L* axis represents lightness. The c* axis states chroma or saturation / saturation / density of color purity. The h* axis represents Hue. It is a color type. The Hue value appears when the ball model is cut in the middle horizontally, so that a colorful circle appears. This circular axis is called Hue or ho [19].
The fourth process is Feature Selection which is the process of selecting and reducing the best color and texture features through the Information Gain method. Information Gain is a selection feature used to reduce information by focusing on important feature data that must be handled and used to select the best components, Information Gain is also used to rank important features. Impact after the Information Gain process summarizes all features and selects the best features without reducing effectiveness in this study [20]. The Information Gain Equation (1) and (2): The proportion of S belongs to class i, namely Pi The set of values (A) associated with feature A in the context of model training using a training instance (S) and a training set object (Sv).
The selected features will be used as input for the classification stage. The fifth process is Machine Learning (Classification) The stage where the classification of types of tempe is carried out using the Support Vector Machine (SVM) classification method. SVM (Support Vector Machine) is a popular and effective machine learning method for performing classification and regression. SVM works by building a model that can separate two classes by identifying the best hyperplane that maximizes the margin between the classes [21]. And have the following Equation: SVM Linear, The linear model is one of the models commonly used to solve classification and regression problems. This model uses a linear combination approach of the basis function to predict the target value based on existing features, calculation of Equation (3). p(C = c/X = x).p(C = c).p(X = x/C =c). (1).p(X = x) SVM Polynomial, A polynomial kernel is a type of kernel function used in machine learning methods to solve characterization problems. It is used to transform a non-linear feature space into a higher dimensional feature space, where non-linear problems can be solved linearly. Calculation of Equation (4).

P(Ci|D).=.(P[D|Ci]xP[Ci])/P[D]) (4) SVM Gaussian, This
Gaussian kernel is the bit most involved in dealing with characterization problems for data sets that are not directly isolated, because this bit generally has very good expectations. Calculation of Equation (5) The last process, namely the evaluation stage, is to determine the performance of the test results. At this stage, the classification system is thoroughly evaluated to measure its quality. One of the metrics commonly used in evaluating classification systems is accuracy. Accuracy describes the degree to which a classification system can produce correct results. In the context of classification, accuracy is calculated by comparing the number of correctly classified cases to the total evaluated cases, then expressing it as a percentage [22], [23].

Results and Discussion
In this study, researchers used the SVM method with four different kernels, namely Polynomial, Linear, Gaussian, and Sigmoid [24]. Then do the testing by dividing the data into training data and test data using a split ratio. Split ratio is a comparison between the amount of training data and test data used in the classification process. The researcher used a split ratio from 10:90 to 90:10 in each classification. The purpose of using this split ratio is to compare classification performance from lowest to highest accuracy, so that they can choose the configuration with the best accuracy. Classification of Tempe can be seen in Table 1.

Presenting the Results
The data used in classifying tempe images consisted of 137 unripe tempe images, 137 ripe tempe images, and 136 rotten tempe images. The total number of tempe image data owned is 410. Table 2 is the result of training conducted on SVM with feature selection using Information Gain.  The confusion matrix with the highest value at the 90:10 split ratio can be seen in Figure  2. Table 3.

SVM Information
Gain Testing split ratio

Akurasi SVM Information Gain
Based on  The confusion matrix with the highest score at the split ratio of 80:20 can be seen in Figure 3.
In addition to performance testing being evaluated based on its accuracy, the computational time is also measured. This measurement is carried out for each training and test for both SVM without Information Gain and with Information Gain. Then in a row Table 4, Table 5, Table 6 and Table 7 present the results of the time measurement. Based on Table 3 and Table 4 shows the comparison of training time between using the Information Gain selection feature and without the Information Gain selection feature. The table shows that training using the Information Gain selection feature requires a faster time than training without the Information Gain selection feature. Meanwhile Table 5 and Table 6 show the test results with and without the Information Gain selection feature. Based on this comparison, it shows that testing without the Information Gain selection feature requires a longer time than testing using the Information Gain selection feature. Thus, the use of the Information Gain selection feature can provide a more efficient time in the testing process. Figure 5 and Figure 6 are bar charts that are used to make it easier to read Table 3, Table 4, Table 5, and Table 6. In these diagrams, the higher the bar, the slower the processing time required. Conversely, the lower the stem, the faster the processing time. in this test with the composition of the amount of training data and test data at a split ratio of 10:90 to 90:10, then using the number of existing datasets, namely 410 data with details of 137 unripe data, 137 ripe data, and 136 rotten data. In this test, the SVM Gaussian method provides the best accuracy. The accuracy obtained was 96.79% at a split ratio of 80:20 with a total of 328 training data and 82 test data.

Conclusion
It can be summarized in the research carried out, resulting in a process of detecting the maturity of tempe fermentation using the SVM method which works well. In the dataset that has been designed, the SVM method classifies quite a few errors with an accuracy percentage of 90.24% on a 90:10 data split ratio with the Sigmoid kernel, while using SVM with the Linear kernel the accuracy of the test results can reach around 90.24 % with a data split ratio of 90:10. Different accuracy values are also obtained with polynomial kernels reaching 96.74 with a data split ratio of 80:20. Furthermore, the last value obtained from the Gaussian kernel reaches 96.74% with a data split ratio of 80:20. In testing the SVM method using Information Gain, the fastest results were obtained, namely 2.18 seconds on the sigmoid kernel at a split ratio of 30:70 and the longest at a split ratio of 80:20, namely 2.50 seconds on a linear kernel. Meanwhile, trials of the SVM method without Information Gain obtained the fastest results, namely 2.28 seconds on a Gaussian kernel and a split ratio of 30:70 and the longest on a split ratio of 40:60, namely 3.00 seconds on a polynomial kernel. In the classification it can be concluded that the SVM method using Information Gain obtains results that are faster than not using the Information Gain selection feature and with significant accuracy.