Abnormal Data Analysis in Process Industries Using DeepLearning Method Wen Song, Wei Weng, Shigeru Fujimura Graduate School of Information, Production and Systems, Waseda University, Kitakyushu, Japan
[email protected] Abstract  This research is mainly about the abnormal data analysis in factories of process industries. In the processing factory, there are many sensors which transmit the values to each other. Workers in process factory need to be alerted when the values of some sensors are abnormal values. In our research, the main target is to detect the potential abnormal value from different sensors of process industries. Since the value is filled with noise and delays, we first use the crosscorrelation and wavelet transformation to remove them. Then, use deeplearning method to train the model with processed data and use the model to detect potential abnormal value. Finally, we evaluate the model we trained by the data extracted from a real process factory. The result shows that our model performs well. Keywords  Data analysis, waveletdenoising, deeplearning
(signal to noise ratio) of the signal, and since the data from the process industries is not such kind of common, we try to find a way to denoise signal without this information. We tried several signal denoising methods such as SVD, FFT, Wavelet Transform, etc. And we finally chose wavelet denoising because it has the best result. The third step is our main procedure, the abnormal value detecting, which could only be processed after the first two steps. Abnormal data refers to individual values within a set of data that deviate significantly from other values of the set of data. The emergence of abnormal data usually indicates the abnormal state of machines in process industries. If we can find or even predict this kind of abnormal data, we may inform the works before a tragedy happens. For abnormal data analyzing, Meng J.L. et al. [4] used Kmeans Cluster Algorithm to classify the data into several kinds. Shilton et al. [5] propose an SVM approach to multiclass classification and anomaly detection in wireless sensor networks. Their work requires that the data should be preclassified into categories, and then those data points which could be classified are considered abnormal. Xie et al [6] have proposed an online anomaly detection algorithm. Which uses a histogrambased approach to detect anomalies within hierarchical wireless sensor networks. However, deeplearning has been so popular and begin to play an important role nowadays. Recently, deeplearning subverted the speech recognition, image classification, text understanding, and many other areas of ideas. It gradually formed a way which trains data from the start and then transmitted through an endtoend model, finally directly output to get a result of a new model. This not only makes things simpler but also improve the selfadjustment ability that represents the layers of cooperation, which can greatly improve the accuracy of the task. To conclude, we mainly researched the relationship between different sensors, the denoising method which can make the values more accurate. And then, we try to use the deeplearning method to mine from the signals. This paper considers the methodology of mining useful information from the signals of process industries.
I. INTRODUCTION Nowadays, the traditional factory of process industries is not able to fulfill the current needs of the entire manufacturing demand. Along with the development of the process industry, people have a great demand for the information industry. Therefore, data analysis plays an important role in process industry. In the process industry, the sensor is a common large data acquisition device which can measure the information and the measurement is transmitted using electrical signals. To detect the potential abnormal value from different sensors, we mainly separated into the following 3 steps: Firstly, we discover the relation among different sensors. Practically, large data applications based on sensor data has been so popular: Michael A.H. [1] and Miriam A.Z. have proposed a contextual anomaly detection technique for use in streaming sensor networks, they pay more attention to the relationship between the contexts in a single sensor behavior; U. Surya Kameswari [2] and I. Ramesh Babu use the way of extracting the steady state of each sensor, and try to use cluster method to detect the stability of a sensor. Since most of the current research points out that the relation among different sensors might be very helpful for the factory to extract potential patterns or relation to decreasing the cost. Therefore, we should spend more effort on the relation among the sensors. And for the values from the sensors, secondly, one of the drawbacks of the current research is the data accuracy, which most of the sensors are filled with several kinds of noise and may reduce the accuracy of the data. Therefore, we consider that the value of the sensors should be denoised before processing. However, most of the research such as Liu F et al. [3] has the detailed information about the data which can easily define the threshold and SNR
9781538609484/17/$31.00 ©2017 IEEE
II. METHODOLOGY In this section, will introduce the details of how we cope with the data. It mainly includes 5 subsections. The first subsection shows the main flow chart of the whole procedure, and the second subsection shows how we recognize and remove the delay of the signals. The third subsection shows the way we remove the noise by using
2356
Proceedings of the 2017 IEEE IEEM
wavelet transformation. The fourth subsection shows how we process the data for the training set and testing set. Finally, the last subsection shows the procedure of creating and training of the deeplearning model. Since the relationship among the different signals is very complicated, so we first try to extract the relationship between each pair of the data. A. The flow chart of procedure
calculated the crosscorrelations of all the pairs, and then find the related ones to proceed the following steps.
Figure 2. A Schematic Diagram Of CrossCorrelation
Since the delay could be very large, so when we use the crosscorrelation, some value would not take part into the calculation(which usually occurs when the signal has a wide fluctuation), and that will cause a deviation. Therefore, we tried several ways to normalize the data and finally, we chose the value which is subtracted by the mean of the whole column (which is decided according to the shape of most of the data). In this way, the deviation will decrease to the lowest level. Per this method, most of the delays among related sensors could be removed. Only if the delay is longer than considered will be missed according to the method (however the range of delay is artificially controlled and the case which delay is larger than considered is almost impossible). C. Remove noise from the data In this part, we use the wavelet transformation to remove the noise from the data. In the processing factory, the signals from sensors remain some noise. Since all these signals are recorded from the machines and sensors, noise is filled with the signals and could affect the procedure of data analysis. For abnormal value detecting, the vibration amplitude could be completely different because of the noise. Therefore, we need to find a way to remove the noise from the signal. The data is extracted from different sensors in a chemical process factory, most of the noise is white noise[7]. Moreover, we have no idea about the SNR (Signal to noise ratio), so we could only choose a robust but high accuracy method to remove these incidental noises. Since the timefrequency domain has a good localization characteristics, wavelet analysis has been widely used in many fields. Therefore, the white noise signal is suppressed by the characteristics of the wavelet decomposition coefficient, in which the weak correlation signal contained in the sequence is also collated and provided with more suitable processing data. The following wavelet analysis method basically eliminates the potential white noise in the sequence and extracts practical information in the plant signal concisely and effectively. Wavelet denoising can be generalized mainly into three methods: (1) Mallat's[8] modulus maxima denoising based on wavelet coefficients (2) Liu’s[3] Beamforming correlation wavelet denoising (3) Donoho[9] and Johnstone proposed the wavelet threshold denoising method. Since the wavelet threshold method is the development of the other two methods which seem to
Figure 1. The flow chart of procedures
The whole procedure has been divided into 7 part which is attached above. The first step would be explained in section 3.B; The second step would be explained in section 3.C; The third and fourth steps would be explained in section 3.D; The fifth and sixth steps would be explained in section 3.E and the final step would be explained in the conclusion part. B. Recognize delay from signals The signals of different sensors contain some delay. For different machines, signals are transmitted through different media, which may take different time to transfer. And for each pair of the signals, maybe a potential order exists between them, which means a delay may occur according to a different time of transmissions. For this delay, it may occur different kinds of problems such as incorrect predictions and overfitting, which may not be proper in the following steps. Since we do not know the range of the delay time of each signal, we need a direct way to calculate the delay and remove it. In this part, we used crosscorrelation function to remove the delay part. In signal processing, the value of crosscorrelation shows the how similar the two signals are as the extent of relevance from the first signal relative to the other. However, the crosscorrelation is only used for related signals. Finally, we tried to calculate the crosscorrelation and then check whether it is related. If it is not related, the crosscorrelation would not be considered. For the time delay between the two signals is showed by the argument of when maximum (if the two signals are positively correlated, or minimum if the 2 signals are negatively correlated) is obtained, or the argmax (or argmin) of the crosscorrelation, as in = (( ∗ )( )) (1) ∂ Here f and g are the two signals and * is the cross correlation operator. However, the crosscorrelation function is mainly used for 2 correlated function. According to our data, most of the signals (which means different signals) don’t have a strong relationship with others. So here, we first
2357
Proceedings of the 2017 IEEE IEEM
signal. Assume the 2 endpoints of the piece of data are points ( , ) and ( , ) . Here  −  is a constant which has been predecided according to the actual situation of the data. If the absolute value of slope ) is larger than a of the segment AB (which is
be more effective in our research, we finally chose softthreshold wavelet denoising method. The basic idea of wavelet threshold denoising proposed by Donoho is: After processing the signal through the wavelet transform (using Mallat algorithm), the signal generated by the wavelet coefficients contain important information on the signal, which means the signal’s wavelet coefficient after wavelet decomposition is large. On the other hand, the wavelet coefficient of noise is comprised smaller. And the wavelet coefficient of the noise is smaller than the wavelet coefficients of the signal. For selecting a suitable threshold, only coefficients which are larger than the threshold are considered to be generated and should be preserved. The coefficients which are smaller than threshold values are considered as noise generation and would be set to zero to achieve the purpose of denoising. The basic steps are (1) Decomposition: select a signal with N layers of the wavelet decomposition to decompose; (2) Threshold processing: After decomposing, the coefficients of each layer are quantified by choosing a suitable threshold value. (3) Reconstruction: Reconstruct the signal with the processed coefficients. The basic problems of wavelet threshold denoising include three aspects: the selection of wavelet bases, the selection of thresholds and the choice of threshold function. For the wavelet bases, we chose Daubechies wavelet[10] as db8 with vanishing moments order of 4. For the threshold, we simply apply soft thresholding using the universal threshold [11]. The universal threshold is defined as (2) λ= 2*log(N) Where σ is a robust estimator of the standard deviation of the finest level detail coefficients, and N is the length of one signal. Here, we use the standardized median absolute deviation (MAD) for this variable. (3) σ = MAD(β  )
threshold k, it will be recorded as a sample. For this threshold k, it depends on how much samples (which is same with  − ) you want to extract from one signal. Here we mentioned an equation for calculating the threshold: k =
(
)

( 
)
*Ω
(5)
Here values from 0.01 to 0.20, which is different according to the actual situation of the data. The greater the amplitude is, the smaller the . Assume that every time we extract m points from one signal, and keep detecting from the first point till the last point. We can finally gain several pieces of samples from one signal. For each sample, since we get m points for one ) and use sample, assume their values as ( , . . them as the eigenvalues of the sample. Until now, we have already got some real samples with eigenvalues, and we still need some abnormal value to diversify the data set. Simply for each piece of the samples, We choose one of the m eigenvalues and modify it to another value. This value should neither be too big nor too small, so we also give an equation to define it: = (6) Here described the extent of degree of deviation of abnormal value. max and min are the biggest and smallest value of the whole m values. However, according to the data we got, the difference and values of the eigenvalues are too small and not suitable for training, so we use Zscore to normalize the sample, which is defined by the following formula: (7) = Where X represents each of the features in the sample, is the mean of all the features and is the standard deviation of the whole sample. Here Zscore help us to make the sum of the features equals 0 and square deviation equals 1, this can help us to amplify the difference, which is more suitable for gradient descent algorithm in the following steps. For the labels for the sample, since the data can be defined in only 2 situations: normal and abnormal, it is more suitable for using a discrete encoding method. So here we use onehot encoding to label this sample. Assume we label the signal as 1 , 2 . For the data which is extracted from the real data, 1 , 2 would be set as (0, 1) and the abnormal data would be set as (1, 0). Finally, for one sample, we have m values for features and 2 values for labels. For all the samples, we pick one fifth of the whole samples as the testing set and the rest of the samples as the training set.
For a univariate data set( , … ) the MAD is defined as the median of the absolute deviations from the data's median: MAD = median(X median(X))
(4) D. Process the data for the training and testing Now we have several signals without delay or noise. However, it is still difficult for us to extract the information from the data for training and testing. Here we present a method to extract the features in the data for labeled learning.
Figure 3. An example of extracting data
As the figure above shows, each time we extract a piece of fixedwidth data from the signal and we repeat this extraction from the beginning to the end of the whole
2358
Proceedings of the 2017 IEEE IEEM
E. Create and train the model This model is created by autoencoder and logistic regression. The feature extraction part is created by auto encoders with m neurons of middle layer by unsupervised learning. The training part has 3 layers, with an ordinary input layer with m inputs, one hidden layer with m neurons and an output layer with 2 outputs. For the auto encoder, it extracts the features from data, filters the useless information and compresses the whole information into m neurons. For the training model, we used a m*m full connection for training. And we added hyperbolic tangent function to the connection which is written below:
Here we use Tensorflow[12] (a popular deeplearning framework) to train the model. Since this framework can automatically use the backpropagation algorithm to efficiently determine how your variables affect the cost value that you want to minimize. In our training procedure, every time we randomly choose a minibatch to take part in the training and leave them out of the training set. And only if all the training set have been trained completely, the batches which already trained would go back into the training set. III. RESULTS A. Data Description The data source is the real data which recorded from 373 sensors through one year in a Japanese chemical process factory named “PLANET MEISTER”. These data are extracted from different signals of machines from the data source. The signals are separated into 3 kinds as manipulating the value, set value and process value. Finally, we extracted the signals of which the time interval is one minute, which means the signals are displayed as several signals of process data. We reloaded all the data and chose “Time” as an index for all the signals. The example is listed below. B. Result of recognizing delay We only considered the delay which less than a day, which means the delay t is larger than 1440 and less than 1440. If delay t is a positive value and the 2 signals are related, signal 2 should be affected by signal 1. On the other hand, if it is a negative value, the signal 1 should be affected by signal 2.
(8) f(x) = Where e is the base of natural logarithm. This function is a nonlinear function which values from (1, 1). The reason why we add this connection and function is that: Help lose some location information. Since in one sample, the location of where abnormal value occurs does not make sense, either place of the sample can hold an abnormal value.
Help lose the linear mapping relationship. Since we want to deepen the depth of the network, the only nonlinear function can effectively increase the depth of the network, and tanh() help us deepen the network.
Help train the model. Since we will use gradient descent algorithm to train the model, adding a normalized function would be more suitable for training.
Between the second and third layer, we use softmax function as our activation function to train the data. Since the input samples are accompanied by some unrelated interference, we also need to add some additional bias. Therefore, the evidence Y of whether the input samples is normal or abnormal is written below: Y = softmax(∑ + ) (9) , Where x is the hidden layer with m neurons, j is the number of samples, i is the label of whether it is normal or is the bias of label. abnormal, w is the weight and For training the model, we first need to define an indicator to evaluate whether the model is good. In deep learning, we usually define indicators to describe that a model is bad, this indicator is called loss or cost, and that represents how far off our model is from our desired outcome. After that, we try to minimize this indicator. A very common cost function is "crossentropy", which is derived from information compression coding technology in information theory, but later it evolved into an important technical means from game theory to machine learning and other fields. It is defined as follows:
Figure 4. An example of removing delay(before removing)
(10) Where t is the probability distribution of our sample, and ` is the true distribution (which is also the onehot vector with the digit labels).
Figure 5. An example of removing delay(after removing)
2359
Proceedings of the 2017 IEEE IEEM
IV. CONCLUSION
The values on the left are the detailed value of each minute of signals. And the image on the right is the simulation image of the data without the delay. After removing the delay between the 2 signals, we can simply find out that the 2 signals almost up and down at the same time, which means the delay has been removed completely.
Through this procedure of data analysis in process factory, we have already built a model to detect the abnormal values in the data and it works well according to the evaluation result. Also, it shows that it is possible to use data analysis method to help improve the development of process industry. However, it remains some problems: The model does not act well when the abnormal value does not have a significant deviation from the real value, we would try to modify our model and make it more accurate. For all the conditions, the standard deviation for all the models is a little bit high and we think in the factory, we may need a more stable model. According to our current research, recurrent neural networks (RNN) seem to be more sensitive on continuity, we may try to combine it with our current model in the future. To conclude, we will keep studying these issues and try to make a perfect model.
C. Result of removing noise
Figure 6. An example of removing noise
According to the pictures above, the images on the bottom show the signals before denoising and the images on the top show the signals after denoising. We can simply find that most of the white noise is removed from the signals. D. Result of deeplearning model We have trained for 20 models and for each model we have trained it 2000 times and then we use testing set to evaluate the accuracy of the model. at 0.05 and  − Per me experiment, we take  at 29 (which means 30 points in one sample). So every time we extract 30 points from one signal, and keep detecting from the first point till the last point. We can finally gain several pieces of samples from one signal. According to different τ which mentioned in (6), the accuracy distributed as below: TABLE I Accuracy
REFERENCES [1]
U. Surya Kameswari, I. Ramesh Babu, Sensor data analysis and anomaly detection using predictive analytics for process industries, 2015 IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions (WCI), Year: 2015, Pages: 1  8, DOI: 10.1109/WCI.2015.7495528, IEEE Conference Publications [2] Hayes M A, Capretz M A. Contextual anomaly detection framework for big sensor data[J]. Journal of Big Data, 2015, 2(1):122. [3] Liu F, Wang W, et al. MEMS Gyro's Output Signal Denoising Based on Wavelet Analysis[C]// International Conference on Mechatronics and Automation. IEEE Xplore, 2007:12881293. [4] Meng J, Shang H, Bian L. The Application on Intrusion Detection Based on Kmeans Cluster Algorithm[C]// International Forum on Information Technology and Applications. IEEE, 2009:150152. [5] A. Shilton, S. Rajasegarar, and M. Palaniswami, " Combined multiclass classification and anomaly detection for largescale wireless sensor networks," in Intelligent Sensors, Sensor Networks and Information Processing, 2013 IEEE Eighth International Conference on, 2013, pp. 491496. [6] M. Xie,J. Hu,B. Tian Histogrambased online anomaly detection in hierarchical wireless sensor networks in Trust, Security and Privacy in Computing and Communications (TrustCom) 2012 IEEE 11th International Conference on 2012, 751759. [7] Li Anying, Chen Ke, Song He, Lei Yu, The Industry Data Analysis Processing Model Design: The Regional Health Disease Trend Analysis Model, Cloud Computing and Big Data (CCBD), 2014 International Conference on Year: 2014;Pages: 130  133, DOI: 10.1109/CCBD.2014.11 [8] EeChien Chang, Stéphane Mallat, Chee Yap. Wavelet Foveation ☆ [J]. Applied & Computational Harmonic Analysis, 2000, 9(3):312335. [9] Johnstone I M, Silverman B W. Wavelet Threshold Estimators for Data with Correlated Noise[J]. Journal of the Royal Statistical Society, 1997, 59(2):319–351. [10] AntoninoDaviu, Jose A., et al. "Validation of a new method for the diagnosis of rotor bar failures via wavelet transform in industrial induction machines." IEEE Transactions on Industry Applications 42.4 (2006): 990996. [11] 1.Donoho, D. L. and Johnstone, J. M. (1994). Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455 [12] https://www.tensorflow.org/resources/
MODEL ACCURACY The degree of deviation
Average
0<τ≤0.5 0.71
0.5<τ≤1 0.915
1<τ≤2 0.964
Std.
0.053
0.0192
0.0102
Lowest
0.64
0.89
0.94
As we can see from the table, the numbers show the accuracy for evaluation. Average means the average accuracy for 20 models with different. Std. is the standard deviation for the accuracy for 20 models. And lowest is the lowest value for the 20 models under each condition. For in the high level, which the abnormal value has a relatively significant deviation from the real value, the model has a high accuracy on detecting and almost point out all the abnormal values. However, for in the low level, the model does not have and high accuracy on detecting the abnormal value.
2360