5 Literature Review

In order to address the objectives of the thesis, adequate knowledge in the domains of maintenance engineering, research, knowledge discovery, and statistical modeling will be required. This literature review will introduce and summarize some of the academic works regarding these domains in order to provide a solid foundation for analysis. Figure 5.1 provides a brief overview of some of the topics to be covered in order to achieve the research objective. The following sections of this chapter will give a synthesis of popular maintenance strategies, as discussed in the literature, and later, an overview of the statistical methods used with these strategies.

Overview of literature review topics

Figure 5.1: Overview of literature review topics

5.2 Knowledge Discovery Process

A Knowledge Discovery Process(KDP) represents an iterative framework for identifying characteristics or patterns in data and understanding how to apply them with domain knowledge. Swiniarski, Pedrycz, and Kurgan (2007) provides a wealth of information regarding KDPs, including supporting arguments for structuring a KDP as a standardized process model. In summary, the text advocates for a KDP that is ultimately useful to the user, is logical in approach and structure, follows established domain principles, and fosters standardization in data and procedures.

In addition to providing examples of KDPs suited for research or industry, Swiniarski, Pedrycz, and Kurgan (2007) provides an example of a hybrid model that can be applicable in a broad range of domains. The 6 steps for this KDP are:

  1. Understanding of the problem domain. This first step includes familiarization with the problem as well as the domain experts, terminology, standards, and restrictions.
  2. Understanding of the data. The second step encompasses all aspects of data collection based on the domain understanding established in step one.
  3. Preparation of the data. Preparation is arguably the most intensive step of the KDP as it provides a strong foundation for a successful and thorough analysis. This step involves, cleaning and formatting the data, in addition to corrections for noise or missing values. This step may include further methods such as dimensionality reduction, feature selection, or summarization in order to satisfy the input requirements of the problem,
  4. Data mining. The data mining step involves using functions relevant to the problem or domain to extract knowledge or insights from the cleaned and prepared data.
  5. Evaluation of the discoverable knowledge. The fifth step involves a thorough evaluation of the extracted information and an assessment of its value and contribution to the analysis. This step may include additional consultations with domain experts in order to assess the validity or novelty of the information.
  6. Use of the discoverable knowledge. The last step includes a detailed plan of how the extracted information is to be put to use in the current domain.

A similar process is introduced as the data processing pipeline by Aggarwal (2015), as represented in Figure 5.2. Although this approach is structured slightly differently from the hybrid KDP, it also places strong emphasis on a systematic approach to solving knowledge-intensive problems.

The data processing pipeline from Aggarwal(2015)

Figure 5.2: The data processing pipeline from Aggarwal(2015)

In addition to their inclusions in structured KDPs, most of the topics are frequently studied and expounded upon in their own capacity.

5.2.1 Data Collection

Although data collection is mostly driven or governed by the demands of the problem at hand, there are many relevant aspects that are common in all use cases. As detailed in Swiniarski, Pedrycz, and Kurgan (2007), an individual responsible for data collection must identify the types, techniques, amount, and quality of the data necessary to solve the problem. During this process, domain knowledge is key for understanding the requirements of the problem and identifying attributes of the information necessary for an insightful solution.

5.2.2 Data preparation

There exists a vast amount of literature regarding the concepts of data processing or data cleaning, as they are most often required for analysis of real-world data. Aggarwal (2015) details key topics of data preparation such as data cleaning, data structuring, and also integration of data coming from multiple sources. As many real-world data sources may contain errors, formatting, or other aspects that prevent it from being ready for analysis, data cleaning is often required as an initial step. Additionally, data may need additional structuring, including character string manipulations or variable transformations in order to satisfy the requirements of the analysis. The desired data may also be distributed between multiple data structures, requiring the researcher to integrate, or merge, the sources my a common reference value.

Swiniarski, Pedrycz, and Kurgan (2007) also provide an introduction into several popular methods for imputation of missing values when removing incomplete observations would not be a favorable solution. Additionally, Josse, Tierney, and Vialaneix (n.d.) provides an extensive repository detailing the available packages in R for exploring and resolving missing data of myriad forms. The repository provides guidance for identifying the scope of the missing data and choosing an appropriate imputation method based on the level of complexity and requirements of the problem.

5.2.3 Text mining

As mentioned in the KDP, data or text mining is the process of extracting additional information from clean and processed data that may otherwise not be available for analysis. The information may be domain-specific or in some way provide additional explanatory power during analysis.

As it has become increasingly popular, there is extensive literature detailing the concepts and methods used for data and text mining. An introduction to text mining is provided by Swiniarski, Pedrycz, and Kurgan (2007), detailing the ability of text mining methods to extract a large number of descriptive features from semi-structured(e.g., data table) or unstructured(e.g., free-form text) text. The concept of an Information Retrieval System is introduced as a system of methods for extracting and characterizing information about a subject.

Robinson and Silge (2017) provides an intuitive guide to text mining in R using the tidytext package and demonstrate capabilities for analysis of data in the tidy format. This text covers topics such as tokenization, the concept of breaking raw text into individual tokens, or terms, following the tidy structure, to allow for compatibility with other tidy structures and functions. In text mining, a dataframe of token is typically called a corpus and represents the core data source of intended analysis. In text mining practice, functions to remove stop-words, frequent yet irrelevant terms, are commonly used to greatly reduce the magnitude of text to process. Additionally, the text contains examples and demonstrations of text mining tools such as frequency analysis, sentiment analysis, correlations analysis, and n-gram analysis.

Frequency analysis represents a high-level summary of a corpus, and refers to parsing a corpus and identifying the frequency with which each token, or terms, occurs. Similarly, sentiment analysis refers to a Natural Language Processing(NLP) method involving parsing a corpus and comparing each token to a lexicon, or a large index of colloquial vocabulary, and identifying the sentiment of the token. In such methods, the sentiment typically represents the scale of emotion, positive or negative, associated with the term, based on common usage. Combining frequency analysis and sentiment analysis provides methods for identifying underlying sentiments of entire articles or books.

Furthermore, correlation analysis identifies the correlation coefficient between respective pairs of terms occurring together in the same comment. Although a high correlation for a pair of words does not imply that the words occur very frequently, it does imply that words occur together, or not at all. The correlation coefficient is defined in Equation (5.1) using the components defined in Table 7.3.

\[\begin{equation} \phi = \frac { n _ { 11 } n _ { 00 } - n _ { 10 } n _ { 01 } } { \sqrt { n _ { 1 .} n _ { 0. } n _ { . 0 } n _ { .1 } }} \tag{5.1} \end{equation}\]

Additionally, n-grams refer to consecutive sequences of \(n\) words that frequently occur together in a corpus. This methods combines elements of both frequency and correlation analysis, as it identifies recurring structures within the corpus. The result of n-gram analysis is a network of term structures that represent the relationship between the most frequent terms in a corpus. The n-gram provides a high-level summary of the text that is best understood with visual representation.

Another NLP method for text mining is part-of-speech tagging, whereby each token is processed by an NLP model to identify the root of the word and identify the respective part-of-speech. The UDPipe R package performs several NLP functions using including part-of-speech tagging, and allows for the usage of pre-trained or customized models.

Fridolin (2019) also maintain a detailed repository of R packages intended for use in the natural language processing of both tidy and non-tidy data structures.

### Data standardization In addition to the practices of numerical standardization, there are many resources regarding standardization in terms of the structure and format of data sources, such as those offered by the International Organization for Standardization(ISO). Such standards include ISO 14224:2016 (Collection and exchange of reliability and maintenance data for equipment)and ISO 13306:2010 (Maintenance-Maintenance terminology) in the maintenance field.

Although intended for use in petrochemical industry, (“ISO 14224:2016 (Collection and Exchange of Reliability and Maintenance Data for Equipment)” 2016) provides a detailed framework reliability and maintenance(RM) data that is generally applicable to any maintenance-intensive industry. Specifically, the text provides a guide for the process and methods involved in data collection, placing a strong emphasis on the quality of data. In addition to outlining the taxonomy and subdivision of data collection regarding safety, reliability, maintenance, and business processes, the document also provides detailed recommendations for the structure and format of stoppage event information, such as Failure Mechanism and Maintenance Action for different types of equipment.

(“ISO 13373-1:2002 (Vibration Condition Monitoring - Part 1:General Procedures)” 2002) provides an introductory overview of suggested standard procedures regarding the use of vibration measurements for the purpose of condition monitoring. This text provides a summary of the concept of condition monitoring, along with recommendations regarding data collection, types of measurements, transducer types, as well as data analysis. (“ISO 13373-2:2016 (Vibration Condition Monitoring - Part 2:Processing, Analysis and Presentation of Vibration Data)” 2016) provides a a follow-up to the first document, delving deeper into the specific methods available for analysis of vibration measurements.

## Statistical Methods for Analyzing Event Data Following preparation of the data to ensure it is ready for analysis, one of the popular techniques for analyzing event data is survival analysis or reliability analysis. The terms survival analysis or reliability analysis, refer to the study of event occurrence along a time scale. The events can be single or recurring events, with the subject of interest being the rate of occurrence, event count, or time-to-event measure. In terms of survival, this event might represent the onset of disease, and for reliability, the event may be equipment failure, threshold of degradation, or stoppage. In other words, the reliability of a component represents the probability of a component surviving(not failing) at least until a specific point in time. For the purpose of this analysis, the terms survival and reliability will be used interchangeably, as they are equivalent. In any case, reliability analysis involves probabilistically modelling the durations of observed events, in order to predict the time until a future event occurs.

There is myriad literature available regarding the statistical methods available for the analysis of event data. In providing an overview of such methods, Lawless (2007) makes a primary distinction between methods aimed at modelling counting processes and those for modelling gap times. In the text, a counting process, defined as \(N(s,t)\), representing the cumulative number of events occurring during the time interval \((s,t]\). Furthermore, counting processes are most often the outcome of interest when the underlying event process is such that the recurring events do not effectively change the event process itself. In such situations, the recurring events are not marked by an associated intervention, which would change the process itself. An example of a counting process in which interventions are not required following events is the count or rate of occurrence of epileptic seizures.

Gap times are defined as \(W_{j}=T_{j}-T_{j-1}\), where \(W_{j}\) represents the time between the \((j-1)\)st and \(j\)th event. Conversely, gap times are typically the outcome of interest, when the recurrence of an event is relatively rare, and is marked by an intervention, having an effect on the underlying process. Although the models used for both event counts and gap times are very similar, the distinction between methods is typically motivated by the objectives of the analysis and the characteristics of the underlying event process(Lawless 2007).

Additionally, Lawless (2007) mentions that regardless of the type of event process under study, two features of the process are typical of interest, namely time trends and event clustering. In the text, a time trend is defined as a systematic change to the event process that occurs over time. In the case of mechanical equipment, time trends may manifest in the form of an increase in the intensity, or rate, or failure occurrence, or in the duration of gap times between failures. Such a time trend may be representative of a change in the behavior of an equipment item as a result of an accumulation of wear, or in response to a change in maintenance policies or the effectiveness of maintenance actions.

On the other hand, Lawless (2007) defines event clustering simply as “the tendency for events to cluster together in time.” Similar to the notion of a time trend, event clustering represents a potential change in the underlying event mechanism in response to the occurrence of other events in nearby time. When studying the recurrence of events of multiple types, event clustering considers the possibility that the close proximity or frequency of one type of event may influence the occurrence of another type.

An additional consideration in terms of the methods employed for analyzing a process of recurrent events is the type of covariates available for study. As in Lawless (2007), covariates are most commonly classified as fixed or time-varying covariates based on the distinct relationship between the covariate value and time. In other words, fixed covariates are also referred to as time-independent covariates, as their value does not change during the event process. Examples of time-independent covariates are birth year, identification number, or treatment group.

On the other hand, time-varying covariates are also referred to as time-dependent variables, as their values typically change throughout the event process. Examples of time-dependent covariates include age, weight, or presence of infection. An additional distinction is frequently made between internal and external time-varying covariates, or between internal and ancillary time-varying covariates, which are equivalent distinctions. Time-dependent variables are typically “internal” variables, which represent values corresponding to the intrinsic properties of the focus of study, or observational unit. Internal variables are typically the product of a stochastic process. Conversely, ancillary or external variables are those that change value as a result of an external influence which may effect more than a single observational unit(Kleinbaum, Klein, and Samet 2006). As Lawless (2007) describes, external variables are typically determined independently of the underlying stochastic event process, although they may still be time-dependent.

Cox and Oakes (1998) introduces an additional category of time-dependent covariates called evolutionary covariates, which depend only on the history, \(\mathscr{H}_{t}\), of the event process. In the example, the history of the event process refers to “the history of failures, censoring and of any other random features of the problem all up to time \(t\)(Cox and Oakes 1998).”

5.2.4 Models for Event Counts

As summarized in Lawless (2007), event counts are often represented by a Poisson process, “which describes situation where events occur randomly in such a way that the numbers of events in non-overlapping time intervals are statistically independent.” Additionally, the text also mentions that Poisson processes are often used to model events that are considered “incidental”, or caused by random external factors. In the context of modelling event counts or occurrence rate, the Poisson process is defined by the intensity function:

\[\begin{equation} \lambda_{i}\left(t | H_{i}(t)\right)=\lim _{\Delta t \downarrow 0} \frac{\operatorname{Pr}\left\{\Delta N_{i}(t)=1\right\}}{\Delta t}=\rho(t),\quad t>0 \end{equation}\]

where \(\rho(t)\) is a non-negative integrable function. When the function \(\rho(t)\) is constant, the process is called a Homogeneous Poisson Process(HPP), otherwise it is a Non-Homogeneous Poisson Process(NHPP). Poisson process models may be non-parametric, semi-parametric, or fully parametric and include covariates.

Both the homogeneous Poisson and renewal process models are based on the assumption that the time between events are independent and identically distributed. In the context of reliability, these assumptions are indicative of equipment that is entirely replaced following failure. In this regard, both models are considered perfect repair models, such that when operation resumes, the equipment is “as good as new”(Wu and Scarf 2017, @lindqvist2006). The homogeneous Poisson process typically models reliability in terms of the MTBF, which is represented by a Poisson distribution. The more general renewal process, is equivalent to the HPP when specifying a Poisson distribution, but can also incorporate other distributions, such as the Weibull(Yanez, Joglar, and Modarres 2002).

In contrast to the HPP, the non-homogeneous Poisson process is suited to describe “as bad as old” restoration following failures. The NHPP follows from the HPP model, but includes an intensity function that varies with time(Wu and Scarf 2017). In this aspect, NHPP models have the ability to describe long term trends in reliability, such as “wearing in”(growth) or “wearing out”(degradation)(Tanwar, Rai, and Bolia 2014). Additionally, as shown by Coetzee (1997), NHPP models are capable of describing the reliability of repairable systems(Hartler (1989), Coetzee (1997)).

5.2.5 Models for Gap Times

In contrast to studying the count or rate of event occurrence, gap times are often of interest when events are rare enough that the occurrence of a single one. Although sometimes denoted separately, the methodology for survival times and gap times is equivalent.

The survival function is commonly denoted as: \[\begin{equation} S(t)=1-F(t)=\operatorname{Pr}\{T>t\}, \quad 0 \leq t<\infty \end{equation}\]

The function \(F(t)\) is the Cumulative Distribution Function(CDF) which represents the probability that the observed duration \(T\) will be less than or equal to \(T\). With \(f(x)\) representing the Probability Density Function(PDF), the CDF is denoted as:

\[\begin{equation} F(t)=\operatorname{Pr}\{T \leq t\}=\int_{0}^{t} f(x) dx, \quad 0<t<\infty \end{equation}\]

Given a sample of observed failure or survival times the Empirical survivor function can be derived as(Collett 2003):

\[\begin{equation} \hat{S}(t)=\frac{\text { Number of individuals with survival times } \geqslant t}{\text { Number of individuals in the data set }} \end{equation}\]

Another facet of reliability analysis in the concept of the hazard function, which represents the instantaneous hazard rate. This hazard rate is interpreted as the probability of failure by a future time point, given that the equipment has been operational until the present time(Zacks 2012). The Hazard function is denoted as:

\[\begin{equation} h(t)=\frac{f(t)}{S(t)}=\lim _{\delta \rightarrow 0} \frac{p r(t<T<t+\delta | T>t)}{\delta} \end{equation}\]

Non-parametric Estimation

The Kaplan-Meier(KM) estimator provides a non-parametric estimate for \(\hat{S}(t)\), where \(n_{j}\) is the number of observations still alive at time \(t_{j}\) and \(d_{j}\) is the number of deaths at \(t_{j}\), is defined as:

\[\begin{equation} \hat{S}(t)=\prod_{j=1}^{k}\left(\frac{n_{j}-d_{j}}{n_{j}}\right) \end{equation}\]

Collett (2003) provides a description of the Nelson-Aelen(NA) estimator, which provides an alternate estimate of the survival function S(t). Although the NA estimate may perform better than the KM estimate for small samples, the KM estimator can be considered an approximation of the NA estimate, especially at short survival times. The NA estimate for \(\tilde{S}(t)\), where \(n_{j}\) is the number of observations still alive at time \(t_{j}\) and \(d_{j}\) is the number of deaths at \(t_{j}\), is defined as:

\[\begin{equation} \tilde{S}(t)=\prod_{j=1}^{k} \exp \left(-d_{j} / n_{j}\right) \end{equation}\]

Semi-parametric Estimation

The Cox Proportional Hazards Model(PH), introduced by Sir David Cox, designates a model in which the hazard function, \(h(t)\), is a product of a baseline hazard function \(h_{0}(t)\) and an exponent term, \(\exp \left(z^{\prime} \beta\right)\). The baseline hazard function only depends upon time \(t\), and represents the hazard when all covariate values are zero. The second term in the product, \(\exp \left(z^{\prime} \beta\right)\), only depends upon the value of the covariates \(z^{\prime}\), and does not depend upon time. As such, the main characteristic of of PH models is that for any given time \(t\), a change in covariate values represents a proportional change of the respective hazard function. As summarized by Lengerich (n.d.), “his assumption[PH] means if a covariate doubles the risk[hazard] of the event on day one, it also doubles the risk of the event on any other day.”

Proportional hazards models are very commonly used to solve regression problems in survival analysis, but is also applicable in engineering reliability. The PH model is a technique used to quantify the effect of covariates(environmental factors, maintenance actions, etc.) on a baseline hazard function(the instantaneous failure rate)(Moore 2016). In contrast to other methods, the Cox proportional hazards model is a semi-parametric model in that although the regression parameters are estimated, the baseline hazard function is never specified and remains unknown(Kleinbaum, Klein, and Samet 2006).

In a Cox model, estimating the beta coefficients is all that is necessary for the purpose of inferring the effect of covariates on the hazard function. As summarized by Cleves et al. (2010) “in a proportional hazards model the effect of covariates is multiplicative(proportional) with respect to the hazard.”

The Cox PH model has many equivalent representations in the literature, but is often defined as: \[\begin{equation} h(w | z)=h_{0}(w) \exp \left(x^{\prime} \beta\right) \end{equation}\]

where \(w_{j}\) are the gap times between events, \(h_{0}(w)\) is the baseline hazard function, and \(x'\) is the covariate vector.

As explained in Therneau (2000), the Extended Cox Model allows for stratification according to covariates, such that the observations are divided into disjoint strata or groups. Each strata has its own baseline hazard function, but common coefficient values for the coefficient vector \(\beta\). Thus, the hazard for interfailure duration \(i\) in stratum \(k\) has the form \(h_{k}(t)e^{X_{i}\beta}\).

Stratification is useful because it allows for adjustment of confounding covariates, or covariates which do not satisfy the proportional hazards assumptions. An unfortunate aspect of stratification in the extended Cox model is that as the baseline hazard function is not estimated, the effect or importance of the strata is not estimated(Therneau 2000). Extended Cox models may include the interaction between strata and covariates, which identifies whether the effect of covariates differs by strata. Including each covariate by strata interaction is equivalent to modeling each strata separately(Therneau 2000).

Additionally, the extended Cox model

An extended Cox model including both time-independent and time-dependent covariates takes the form: \[\begin{equation} \begin{array}{r}{h_{k}(t, \mathbf{X}(t))=h_{k}(t) \exp \left[\sum_{i=1}^{p_{1}} \beta_{i} X_{i}\right.} {+\sum_{j=1}^{p_{2}} \delta_{j} X_{j}(t) ]}\end{array} \end{equation}\]

With \(X_{1}, X_{2}, \ldots X_{p_{1}}\) the time-independent covariates and \(X_{1}(t), X_{2}(t), \ldots X_{p_{2}}(t)\) the time-dependent covariates of interest(Kleinbaum, Klein, and Samet 2006). However, when the extended Cox model includes time-dependent covariates, it may no longer satisfy the proportional hazards assumption, as both the baseline hazard and the covariates depend upon time.

Fully Parametric Estimation

Although non and semi-parametric methods yield conclusions about survival times and the effect of covariates, fully parametric models may be an ideal solution. Given an assumption about the underlying distribution, parameters can be estimated to full specify the survival and hazard functions, allowing for a complete model capable of simulation(Kleinbaum, Klein, and Samet 2006).

The Weibull Proportional Hazards Model, sometimes referred to as a Weibull analysis, is a fully-parametric extension of the proportional hazards model, in that it assumes a Weibull distribution for the failure times(Collett 2003). An example of a Weibull proportional hazards model is given in Jardine, Anderson, and Mann (n.d.), where it is used to assess the effect of oil composition on aircraft and marine engines(Jardine, Anderson, and Mann, n.d.). The Weibull proportional hazards model is a unique case of the PHM, that is equivalent to the accelerated failure time model(Moore 2016).

Additionally, the Accelerated Failure Time Model(AFT) is a fully parametric technique in which the survival function is assumed to follow a specific parametric distribution function. The presence of covariates in the model contributes to an acceleration factor, which represents the extend to which the time until failure is shortened or lengthened. As summarized by Cleves et al. (2010) “in an AFT model the effect of covariates is multiplicative(proportional) with respect to the survival time”. The Weibull distribution is frequently chosen in AFT models and can be parameterized as:

\[\begin{equation} S(t)=\exp \left(-\left(\frac{t}{\mu}\right)^{\alpha}\right), \quad \log (\mu)=x'\beta \end{equation}\]

Cox and Oakes (1998) details various methodologies for comparing distributional families for the purpose of parametric survival models, mentioning consideration of the convenience for statistical inference, comparison behavior and fit at different time durations, and evaluation log-transformations of hazard and time, among others.

### General Intensity-Based Models Lawless (2007) provides an in-depth summary of broad classes of “hybrid” intensity-based models that allow for the inclusion of both calendar time trends and gap times. As illustrated in the text, such models may be applicable either when either changes in the underlying event process occur or when a subject’s propensity for event occurrence changes over time. In the context of equipment failures, such models may be able to reflect time trends, such as equipment degradation, or changes to the equipment itself, such as repairs performed following failures.

The Trend Renewal Process(TRP) model is similar to the NHPP in that it features a trend function, similar to the intensity function of the NHPP. However, the TRP model is unique in that it has the capability to describe trend in failure occurrences in addition accommodating different types of repair. In Gamiz and Lindqvist (2016), which details thorough use of the TRP, it is described as “the least common multiple of the RP and the NHPP”. Another example of a TRP model used on engine failure data, including comparison with NHPP and RP models, is given in Elvebakk, Lindqvist, and Heggl (1999). A description of the usefulness of the TRP model by Lindqvist (2006) notes that it is capable of illustrating the three dimensions of repairable systems: quality of repair, existence of trend, and heterogeneity between systems.

The Generalized Renewal Process, detailed by Kijima (1989), incorporates two sub-models which involve the concept of virtual age. Both of the models originally introduced by Kijima involve a stochastic term on the unit interval, representing the quality of repair(effectiveness to reduce age), used to determine the virtual age following each subsequent repair. In the context of generalized renewal theory, the concept of virtual age seeks to differentiate between the operational age of a component and the actual health in relation to a new component. Extensions of the original Kijima models, namely arithmetic reduction of intensity(ARI) and arithmetic reduction of age(ARA) models are described by Tanwar, Rai, and Bolia(Tanwar, Rai, and Bolia 2014). While the ARA models follow directly from the virtual age models, the ARI model describes the repair effectiveness in terms of the change in reliability(failure intensity) immediately prior to, and following, failure. A further description of the GRP models and their usage in repairable systems is provided by Yanez, Joglar, and Modarres (2002).

5.3 Multivariate Techniques

In addition to statistical techniques for modelling recurrent events, several other multivariate methods may prove useful in accommodating numerous covariates or building predictive models for classification. Dimensionality reduction techniques such as principal component analysis facilitate the reduction of the number of model inputs while still attempting to explain maximum variation in the data. Supervised learning techniques such as artificial neural networks and logistic regression are highly flexible techniques in which classification models can be trained using multivariate inputs.

Principal component analysis(PCA) is a multivariate technique primarily used for the purposes of dimension reduction capabilities or in cases of strong correlation between predictors. The resulting product of PCA is a set of new variables, the principal components, which are linear combinations of the original variables. The results of PCA may sometimes be the desired objective of analysis, but most often the resulting principal components are used in further analysis. As described in Everitt and Hothorn (2011), the principal components have an ordering such that the first component “explains” the largest amount of variation in the data, with each subsequent component explaining a smaller amount of variation.

As outlined in Sharma (1996), when there are \(p\) original predictor variables, \(x_{1},x_{2},\cdots,x_{p}\), PCA is intended to identify \(p\) linear combinations of these predictor as defined below:

\[\begin{equation} \begin{split} {\xi_{1}=w_{11} x_{1}+w_{12} x_{2}+\cdots+w_{1 p} x_{p}} \\ {\xi_{2}=w_{21} x_{1}+w_{22} x_{2}+\cdots+w_{2 p} x_{p}} \\ {\vdots} \\ {\xi_{p}=w_{r 1} x_{1}+w_{p 2} x_{2}+\cdots+w_{p p} x_{p}} \end{split} \end{equation}\]

Where \(\xi_{p}\) is the \(pth\) principle component and \(w_{ij}\) is the weight of the \(jth\) variable on the \(ith\) principal component, such that the \(\xi_{p}\) are uncorrelated and the \(w_{ij}\) are orthogonal. Although the total variation in the original variables can only be captured by using all \(p\) principal components, a subset of the identified principal components may be used to explain a desired amount of the original variation.

Rencher (2002) provides a detailed summary of the procedures and consequences for selecting a number of principal components, and notes that the greatest risk is retaining components that are sample specific or variable specific. In the text, sample specific refers to components that do not generalize to the population under study, and variable specific refers to components that represent only a single variable rather than a combination of variables. The presence of variable specific principal components can typically be identified by comparing the values for the component loadings, which are useful for interpretation of the components. Common methods for choosing the number of principal components to retain includes picking a cutoff for the cumulative amount of variation to be explained, typically 70-90%, picking components based on a comparison of eigenvalues to average eigenvalue, as well as comparing such values using scree plots.

Principal components are typically interpreted using the respective component loadings, which represent the the correlation between the original variables and the new variables. As summarized by Sharma (1996), the loadings “give an indication of the extent to which the original variables are influential or important in forming new variables”.

An Artificial Neural Network(ANN), refers to a function intending to mimic the behavior of a biological neuron(Blockeel 2016). A biological neuron is a cell that “fires” in response to an accumulation of biological inputs, according to a specific activation threshold.

The most basic representation of an ANN is the single-layer perceptron defined as:

\[\begin{equation} y=f(\sum_{i=1}w_{i}\cdot x_{i}+b) \end{equation}\]

where \(f\) is the transfer function, or activation function with \(w_{i}\) the weights, \(x_{i}\) the inputs, and \(b\) the bias or activation term(Blockeel 2016). In practice, the transfer function may take several different forms, such as a logistic function or a hyperbolic tangent functions, among others.

As described in Blockeel (2016), Multi-layer Perceptrons(MLP), commonly referred to as feed-forward neural networks, are ANNs consisting of multiple layers such that the output from one layer is the input of the next layer. Hidden layers are the layers of neurons that exist between the input layer and the output layer. The number of hidden layers, in addition to the number of neurons in each layer, has an effect on the complexity of the derived approximation.

As illustrated in Bishop (1996), there are many algorithms available for training ANNs, with the most common being backpropagation. In short, this algorithm represents a propagation of errors back through the network, in order to minimize a specific error function. According to Bishop (1996), “most training algorithms involve an iterative procedure for minimization of an error function, with adjustments to the weights being made in a sequence of steps.” Additional popular training algorithms include the Newton method, the Levenberg-Marquardt method, and the Quasi-Newton method.

Kutner et al. (2005) provides a detailed summary of the advantages and disadvantages of ANN models as compared to traditional statistical modelling. The primary disadvantages of ANN usage are that model parameters are generally uninterpretable, and covariate effects must be identified through the use of conditional effects plots, among other tools. Additionally, tools for traditional model diagnostics for outliers, lack-of-fit testing, and covariate significance testing are less established.

On the other hand, ANNs are not contingent upon many of the common independence and distributional assumptions of traditional statistical models. Furthermore, usage of ANNs allows for modelling of complex response surfaces when using large samples. Additionally, Kutner et al. (2005) notes that the usage of bounded logistic activation functions makes ANNs more robust to the influence of extreme outliers.

Logistic regression is a member of the Generalized Linear Models(GLM) family, and represents useful technique for applying a linear regression model to predict a binary outcome. As detailed in Kutner et al. (2005), a multiple logistic regression can be used to predict Bernoulli random variables, \(Y_{i}\), with expected values \(E\left\{Y_{i}\right\}=\pi_{i}\), where:

\[\begin{equation} E\left\{Y_{i}\right\}=\pi_{i}=\frac{\exp \left(\mathbf{X}_{i}^{\prime} \beta\right)}{1+\exp \left(\mathbf{X}_{i}^{\prime} \beta\right)} \end{equation}\]

Logistic regression is defined by the logit canonical link function, such that:

\[\begin{equation} \log\left(\frac{\pi}{1-\pi}\right) =\mathbf{X}^{\prime} \boldsymbol{\beta} \end{equation}\]

where \(\mathbf{X}^{\prime} \boldsymbol{\beta}\) represents the linear predictor.

Although logistic regression is contingent upon traditional distribution assumptions, this allows for the usage for classical diagnostics tools for significance testing, residual analysis, and outlier detection. Furthermore, the use of the logit link functions allows for straightforward interpretation of respective covariate effects.

In addition to standard logistic regression, methods of ordinal and multinomial logistic may prove useful for classification problems. Other classification models include random forest, deep learning, decision trees, and support vector machines(Blockeel 2016).

In summary, given divergent data sets collected from corrective maintenance actions, preventive and condition-based maintenance strategies, it is necessary to process the data prior to analysis. This procedure will take into consideration the different structures, formats and information contained within each source. Through the use of a KDP tailored for the characteristics of the different data sets, the data will be prepared for analysis, and later integration, for future analysis. Eventually, the data will be analyzed using the various techniques in order to derive a decision support framework.