# 7 Results and discussion

The following chapter will present the results of the research after applying the previously outlined methodology. As all data sets were collected prior to the start of the research, this chapter will begin with a detailed description of the steps undertaken for pre-processing the data. The pre-processing section will include an explanation of the information contained within each data source and what value it will contribute to the analysis. After pre-processing, the data is now ready for analysis, which will begin with a descriptive analysis of each data set In addition to providing a high-level summary of each data, the descriptive analysis will highlight any nuances or insights that may already be appearing. Along with the descriptive analysis, the process of performing a criticality analysis, in which important plants sections and equipment are identified, will be presented. Following the identification of critical equipment, the process of data integration will be expounded, while detailing how the integrated data can be used for predictive modelling.

Next, basic reliability models will be estimated using the failure event history of the critical equipment. The basic analysis will identify and present reliability characteristics of the critical equipment that can be extended in the final predictive models. Finally, the basic models will be extended using the entirety of the integrated dataset in order to generate models to predict reliability based on past events, production level, and vibration condition monitoring.

## 7.1 Data pre-processing

### 7.1.1 Plant stoppage event data

After aggregating all of the stoppage event records into a single data frame, duplicate headings and formatting were removed to leave only the raw data. As illustrated in Figure 6.2, values such as date and subsection were only populated for the first relevant stoppage, requiring additional scripting to identify rows for which these values should be carried over. Date, time, and other manual entry errors were manually corrected as discovered, and any duplicate event records were removed.

Given that the maintenance engineers recorded stoppage events in shifts, if a piece of equipment was already stopped at the start of a new work day(6:00am), a new stoppage entry was created for the current day, and a code indicating “before the shift” was assigned to the stop time of the new event. Similarly, if an equipment remained stopped at the end of a shift(6:00am), a code indicating “continued stoppage” was assigned to the equipment start-up time. As a result, unless a stoppage occurred entirely within a single shift, the total stoppage duration was computed by tracking stoppage events across multiple days and sometimes multiple report files. In addition to inflating the number of daily events recorded, spreading information about a single stoppage event across multiple files increases the risk of human error in data entry.

To resolve this issue, the full data frame was parsed to identify stoppage events that span multiple rows, such that the beginning timestamp of the stoppage was recorded in the first row and the end timestamp of the stoppages was recorded in the last row. These rows were merged into a single event in only a single row, listing both beginning and ending timestamps of the stoppage. This correction greatly reduced the number of data rows, ensuring that each row contained the complete information for only a single event. Cleaning and merging the stoppage records reduced the number of rows from 37549 to 30380, while retaining all available information.

Finally, as it is important to be able to distinguish between different types of stoppage events(e.g., failure, planned maintenance, etc.) to achieve the objective of the analysis, several unique categories of stoppages were identified based on the noted responsible organization. The coded abbreviation and description of each of the stoppage categories is provided in Table 7.1, in which failure events are further distinguished as either mechanical failures(MECH) or electrical failures(ELE).

Table 7.1: Cement plant stoppage categories
Code Stoppage category description
SC Circumstantial stoppage not related to mechanical failure of breakdown
SM Stoppage caused by lack of raw materials
BF Stoppage due to lack of storage space
SP Stoppage related to utilities or related projects
SD Stoppage related to planned shutdown
SL Stoppage in response to low sales volume
ELE Stoppage caused by failure of electrical system
ENG Stoppage to perform planned maintenance
MECH Stoppage due to mechanical equipment failure
MOB Stoppage due to lack of mobile equipment(eg. front-end loader)
ICT Stoppage related to IT infrastructure
PROC Stoppage caused by plant processes not classified as breakdowns
PM Stoppage related to preparation for planned maintenance

#### Text mining for failure mechanism and maintenance action extraction

In addition to equipment code, plant section and interval time stamps, each stoppage record contains a Comments field in which the person who recorded the stoppage could insert additional information about the event. In total there were 30380 total stoppage events, each containing a short sentence describing details relevant to the event. The first step in analyzing the text was to turn this list of sentences into a list of single words, called tokens. In this format, the tokens could be analyzed to identify their frequency of occurrence, in which stoppage category they occur, their sentiment, and even their part-of-speech. Using the tidytext package, tokenizing the comments resulted in 163643 total tokens, or 1535 unique words, as depicted in Table 7.2.

Table 7.2: Number of tokens and unique tokens
Number of Tokens Number of Unique Tokens
After removing stopwords 122296 1461
After removing numerics and punctuation 119676 1122
Note:
Any token containing numeric characters was removed
Punctuation was removed from each token while preserving the remaining characters

Although tokenizing the stoppage comments separates each word from the rest of the sentence, the identifier of the originating sentence, as well as the stoppage category can be retained. Grouping the tokens by stoppage category makes it possible to identify the most frequently used words to describe stoppages for each category. Figure 7.1 shows the 10(unless fewer than 10) most frequently used words for each respective stoppage category. This comparison is informative as it not only describes how categories of stoppages differ from one another, it begins to describe the spectrum of stoppages that occur within each category. As shown in the figure, the most frequently used terms under the Planned Maintenance(PM) category are “Planned”, “maintenance”, “preparation”, and nothing else. In contrast, the Mechanical(MECH) category shows a variety of common terms to describe stoppages, namely “tripped”, “repaired”, “leaking”, etc..

As the concepts of failure mechanism and maintenance action are only applicable for failure events, further insights will focus of the events categorized under MECH and ELE.Considering only failure events consists of 4837 stoppages and 24749 tokens (835 unique tokens). In addition to identifying the frequency of word occurrence, Sentiment Analysis is a popular technique for text mining, which involves assigning a sentiment, or subjective emotional association, to each token(Robinson and Silge 2017). In the tidytext package, the sentiment of a word is assigned based on either of three lexicons; the AFINN lexicon assigns a sentiment in terms of emotions(joy, fear, anger, etc.), the bing lexicon assigns a sentiment as either positive or negative, and the nrc assigns an integer score between -5 and 5. For the purpose of extracting additional information about equipment stoppage events, associating words with emotions may not be useful, but binary classification of words as either negative or positive may provide a step towards separating words describing the failure mode from those describing the related maintenance action. Furthermore, using a range of integer values to represent the sentiment may demonstrate the the potential to describe the severity of failures modes and maintenance actions.

The results of a sentiment analysis on the failure stoppage comments using the Bing lexicon are shown in Figure 7.2. Each graph indicates the frequency of the 10 most frequently used terms for both positive and negative sentiments. As evident by the figure, the initial classification of words by sentiment appears intuitive, since words related to failure modes, “alarm”, “fault”, “leaking”, etc., are classified as negative, and words associated with maintenance actions, “rectify”, “cleared”, “restored”, etc., are classified as positive. However, words typically associated with a negative sentiment(by the nrc lexicon) are used far more frequently than words associated with a positive sentiment. This could indicate that the majority of the information contained in the Comments section of each stoppage event refers to the failure mode.

In addition to identifying the frequency with which unique words are used to describe stoppages, it is also possible to identify the most commonly used pairs of words, called bigrams. The network graph in Figure 7.3 depicts a portion(bigrams occurring 50 or more times) of the most frequently occurring bigrams in relation to other bigrams. In the network graph, each node represents a unique word, and nodes are connected when both words occur in the same bigram. The size and opacity of the line represents the frequency with which the bigrams occur. Nodes directly connected to many other nodes represent words that frequently appear with other words. For example, faulty, alarm, and leaking are nodes with many connections, indicating that they frequently appear next to many different words. This can be validated from the stoppage records, which indicate numerous failures caused by faults and alarms. These words are likely being used to describe the failure mode of the respective stoppages. This same type of representation can be performed with n-grams, where n is the number of words occurring in succession.

To gain more insight into the words occurring together, we employed correlation analysis. Correlation analysis identifies the correlation coefficient, as defined in Equation (5.1), between respective pairs of words occurring together in the same comment. Although a high correlation for a pair of words does not imply that the words occur very frequently, it does imply that words occur together, or not at all. Table 7.4 shows the 15 pairs of words with the highest respective correlations. $$$\phi = \frac { n _ { 11 } n _ { 00 } - n _ { 10 } n _ { 01 } } { \sqrt { n _ { 1 .} n _ { 0. } n _ { . 0 } n _ { .1 } }} \tag{5.1}$$$

Table 7.3: Pairwise frequency combinations used to calculate correlation
Has word 2 No word 2 Total
Has word 1 $$n _ { 11 }$$ $$n _ { 10 }$$ $$n _ { 1. }$$
No word 1 $$n _ { 01 }$$ $$n _ { 00 }$$ $$n _ { 0. }$$
Total $$n _ { .1 }$$ $$n _ { .0 }$$ $$n$$
Table 7.4: 15 highest correlations between pairs of words
Word 1 Word 2 Correlation
public holiday 1.0000000
house keeping 0.9994159
hot gases 0.9985030
motion detector 0.9973061
planned maintenance 0.9970205
weighout taking 0.9946017
physical weighout 0.9911390
impact crusher 0.9894996
sample analysis 0.9863776
physical taking 0.9857886
tertiary materials 0.9857834
touched table 0.9845732
final product 0.9799933
shifting mills 0.9764481

As illustrated in Table 7.4 some pairs of words with high correlations are less informative, such as “public holiday”. On the other hand, some pairs such as “hot gases” and “tertiary materials” may imply a failure due to lack of or inability to produce, respectively.

Additional information was extracted from the comments of each stoppage by tagging each tokenised word with the respective part-of-speech. The UDPipe R package performs several Natural Language Processing functions using pre-trained or customized models. In this case, a pre-trained model was used to annotate the tokenised terms with the relevant part-of-speech. The graphs in Figure 7.4 display the 10 most commonly used terms to describe failures,divided by their part-of-speech. The full definitions of the part-of-speech abbreviations are provided in Table 8.6 in the appendix.

Despite several miss-classifications when using the pre-trained model, classification by part-of-speech proved the concept of using categorization to extract relevant information from the stoppage comments. As shown in the figure, the list of most common verbs contain words such as tripped and leaking which could be identified as specific failure modes of a stoppage. The list also contains repaired and replaced which could be identified as the respective maintenance actions performed in response to the stoppages. The lists of adjectives and nouns also show potential to identify a specific component of a unit of equipment( motor, pump, switch, etc.), as well as its operational status( faulty, empty, worn, etc.).

The use of a customized model with a standardized lexicon of failure modes and maintenance actions may make it possible to extract and classify the failure mode, unit component, and maintenance action for each individual stoppage. This could allow maintenance engineers to estimate more specific reliability models according to specific failure modes or maintenance actions.

Continuing with the subset of failures only, an additional sentiment analysis was performed on the most frequently used verbs, shown in Figure 7.5. The results of this analysis appear to confirm the notion that the stoppage comments contain far more negatively associated words than positive ones, in terms of both number of unique words and total number of words. Despite this fact, the pre-trained model appears to classify failure modes(negative verbs) and maintenance actions(positive verbs) quite accurately.

As the ultimate goal of the text mining was to extract a failure mechanism and maintenance action according to (“ISO 14224:2016 (Collection and Exchange of Reliability and Maintenance Data for Equipment)” 2016) standards, the terms identified in Figure 7.5 were manually mapped to the failure mechanism and maintenance action categories in Tables 8.1 and 8.2 in the appendix. After an initial mapping, the text mining procedure was repeated iteratively, until all possible failures could be categorized. The remaining failure events for which the description did not provide enough information for classification were assigned to the other category.

The results of this text mining endeavor are summarized in Table 7.5 and Table 7.6, which show the distribution of respective failure mechanism and maintenance action categories extracted from all of the failure events.

Table 7.5: Frequency of failure mechanisms
FM Meaning Count
1.0 General Mechanical 2470
1.1 Mechanical Leakage 531
1.2 Abnormal Vibration 70
1.3 Alignment Failure 130
1.4 Deformation 6
1.5 Looseness 145
1.6 Sticking 26
2.4 Wear 32
2.5 Breakage 482
2.7 Overheating 426
3.0 General Instrument Failure 127
4.4 Faulty Power 2
6.3 Miscellaneous Other 274
Table 7.6: Frequency of maintenance actions
MA Meaning Count
1 Replace 668
2 Repair 1034
5 Refit 19
9 Inspection 18
12 Other 165

### 7.1.2 Vibration measurement data

As the vibration data was provided in a large HTML document, Python was used to scrape the data from the table elements and insert the entries into a new dataframe. Whitespace, formatting, and duplicate headings were removed, and human input errors for the date, equipment code, and measurement type were manually corrected. Since vibration measurements that were recorded on the same day were observed as close together in time as possible, they were combined into single rows creating a multivariate entry.

In this format, each row represents observations taken on a unique component of a unique equipment on a given day, with up to 9 variables measured. The 9 vibration variables and descriptions of the type of vibration being measured are provided in Table 7.7.

Table 7.7: Description of vibration measurement variables
Variable Unit Description
ANDE mm/s Axial non-drive end
HDE mm/s Horizontal drive end
HDE.CAV mm/s Horizontal drive end cavitation
HDE.ENV gE Horizontal drive end enveloping
HNDE mm/s Horizontal non-drive end
HNDE.ENV gE Horizontal non-drive end enveloping
HNDE.FL mm/s Horizontal non-drive end
VDE mm/s Vertical drive end
VNDE mm/s Vertical non-drive end

As the vibration measurements were observed manually using a portable probe, some observations were incomplete such that values were recorded for some variables and not for others. A summary of the proportion of missing values, along with the respective combinations and proportions is provided in Figure 8.1 in the appendix. Approximately 75% of the vibration measurements were complete observations, with HDE.CAV, ANDE, and ADE being the most frequently missing, with 53, 44, and 40 missing values respectively. In order to make use of these partial observations, the missing values were imputed from the remaining vibration measurement through the use of the PMM (Predictive Mean Matching) algorithm from the MICE(Multivariate Imputation by Chained Equations) package in R. The PMM algorithm imputes a missing value by identifying observations with similar remaining covariates and randomly draws a value from these observed candidates to ensure a plausible outcome.

### 7.1.3 Monthly production data

As the plant was divided into only 10 subsection for the purpose of recording monthly production, this is the smallest data source. Aggregation of this data involved manually coping the values from each monthly report into a new dataframe. Once aggregated, the final table contained one row for each sub-section, with the columns representing the total production tonnage for each month.

Given that the maximum production rate for each sub-section of the plant was also provided, but in tons per hour, this figure was be multiplied by the number of hours in each respective month to identify the maximum monthly production. Dividing the observed monthly total by the maximum monthly total yielded a measure of relative production rate as compared to the maximum capacity. This provides an effective measure of the relative load that a sub-section of the plant experienced each month.

## 7.2 Descriptive analysis of respective data

The following sections will provide a descriptive analysis of each data set after being pre-processed.

### 7.2.1 Plant stoppage event data

After processing the historical stoppage data, it is useful to generate some summary measures relating the data to the structure and organization of the cement plant. Given that for each stoppage event, the time and date of the beginning and end of the event are recorded, as well as a classification of system, subsystem, and component, insights are available at each organizational level. Using the stoppage categorization and plant section outlined in the previous chapter, we compared a frequency breakdown of the stoppages. Table 7.8 provides a breakdown of the total number of stoppages by category for each respective section of the plant. As shown in the table, Section 2 observes the single largest number of stoppages, 7880, which are categorized as PROC. Although Section 2 observes the largest number of stoppages, Section 6 observes the largest amount of failure( ELE or MECH) events, 1098 and 1225 respectively. A graphical comparison of the stoppage frequency breakdown is provided in Figure 7.7.

Table 7.8: Total number of stoppages by category per plant section
Plant Sections
Category 2 3 4 5 6 7 8
BF 38 NA 15 548 65 NA NA
ELE 376 559 14 326 1098 144 NA
ENG 258 62 NA 8 250 398 NA
ICT NA NA NA NA NA 35 NA
MECH 236 551 8 185 1225 113 2
MOB 62 1 NA NA 39 NA NA
PM 119 63 NA 1 257 377 NA
PROC 7880 1011 12 709 4188 89 1
SC 360 207 13 750 679 2707 NA
SD 10 5 NA 112 9 8 3
SL NA NA NA NA 1 2618 NA
SM 12 NA NA 36 60 11 1
SP 93 235 NA 401 543 183 NA
Note:
NA indicates that no stoppages of that category were observed

### 7.2.2 Criticality analysis

Figure 7.6 displays a comparison of the total number of stoppages, as well as the percentage contribution to overall plant downtime broken down by stoppage category and also plant section. The two left plots, in which stoppage events are grouped by the plant section in which they occur, identify plant sections 2 and 6 as having both the highest total number of stoppages(top left plot) as well as being responsible for the largest contributions to overall plant downtime(bottom left plot). As indicated by the two right plots, process(PROC) and circumstantial(SC) stoppages are the most frequently occurring categories, as well as the largest contributors to overall plant downtime. Neither circumstantial stoppages(e.g., lack of alumina or hot gases) nor process related stoppages(e.g., silo change over) are considered equipment failure events.

Figure 7.7 provides a breakdown of the total number of stoppage events, per category, for each of the respective plant sections. Following the previous assessment that identified section 2 as the the most critical section in terms of all stoppage events, it is now evident from Figure 7.7 that almost all of these stoppages(nearly 8000) are process stoppages(PROC) with very few stoppages categorized as failure events. Although the majority of stoppage events in section 6 are also process related, this section has the largest number of failure events. Additionally, sections 3 and 5 are responsible for the next highest number of failure events, respectively. In light of this, section 6 is identified as a critical section based on not only the number of failures, but also the contribution to downtime.

Following from the observation that Section 6 observes the largest number of failures, the following analysis will consider only those failure events. As suggested by Jardine and Tsang (2013), a Pareto chart and jackknife plot can help identify equipment items that are prone to frequent or long stoppage events, which may be of particular interest going forward.

The plot in Figure 7.8 depicts the total number of failure events in Section 6 according to the related equipment(only equipment with more than 20 recorded stoppages). As shown in the plot, cement mill 4 (equipment 06CM41) is responsible for approximately 400 of the stoppages in the historical record, roughly twice as many as the next equipment.Furthermore, cement mills 5 and 1(06CM51 and 06CM11) are each responsible for over 100 failure events. As such, section 6 contains an additional cement mill, cement mill 3(06CM31), responsible for over 50 stoppages. Due to the large number of failure events(688) and being responsible for the core function of section 6(cement grinding), the cement mills are identified as the most critical equipment.

### 7.2.3 Event clustering(All stoppage events)

Despite the occurrence of failure events being of primary interest to maintenance engineers, as identified previously, they are far outnumbered by non-failure stoppage events from the procedural and circumstantial categories. As such, it is important to understand their relevance in understanding and modelling the occurrence of failure events.

For the purpose of looking closer at non-failure events, we compare the mean stoppage duration for each category of stoppages across all plant sections, as shown in Figure 7.9. These plots indicate that on average SD stoppages, which correspond to planned shutdowns, last much longer than any other category. In contrast with the frequency plots in Figure 7.7, which identified process(PROC) as the most frequent stoppage category for almost every plant section, the mean stoppage duration plots indicate that procedural stoppages have short durations, on average. Similarly, circumstantial(SC) stoppage events, which have the second highest frequency of occurrence, have extremely short durations in all plant sections.

As all sections of the plant experience frequent and repeated stoppages of relatively short duration, it is logical to presume that these events may have a negative impact on related mechanical equipment. As mechanical equipment is prone to natural wear from normal operating conditions, it may also experience additional wear from repeatedly stopping and restarting. As the amount of wear on a mechanical equipment accumulates, the likelihood of experiencing an imminent failure increases. As such, it may be important to identify clusters, or patterns, of events that may indicate a relationship between failure and non-failure related events.

As described by Lawless (2007) when studying recurrent events of multiple states, visually identifying clusters of time-ordered events may indicate a relationship between the frequency, order, or time between events. Despite identifying the cement mills in section 6 as the most critical equipment items in terms of failure events, the total number of events makes visual representation difficult, as points will be continuously overlapping. Section 3(raw mill) will be used as an example of the type of clustering that can occur between failure and non-failure events, as the number of events is frequent enough to be observed without hindering visual observation.

As shown in Table 7.9 the raw mill has experienced 117 mechanical failures and 64 maintenance interventions(ENG and PM), in addition to 161 procedural and 205 circumstantial stoppages. As an increase in wear is more likely to have an impact on mechanical failures as opposed to electrical ones, this failure type will be the focus.

Table 7.9: Number of stoppage events per category for plant section 3(Raw Mill)
ELE ENG MECH PM PROC SC SD
42 62 117 2 161 205 5

Figure 7.10 contains three plots which display event occurrences, according to their time ordering, against a time scale, representing time(in days) since the first failure. In order to clearly visualize the events, each plot only shows events which occurred between 500 and 700 days since the first failure. For the top two plots, a cluster of black triangles represents a period of time in which the raw mill was repeatedly stopped and restarted for non-failure reasons, potentially causing an accumulation of wear. It is of particular interest to identify whether the presence and frequency of these non-failure stoppages has an effect of the interfailure durations(the time between red circles) of mechanical equipment, essentially causing failures to occur sooner. Conversely, events of planned maintenance,identified in the bottom plot, would theoretically have a positive impact on interfailure durations, thereby prolonging the time until the next failure event occurs. The chosen method for accounting for these non-failure events will be detailed in the section about data integration.

### 7.2.4 Vibration measurement data

After collecting and cleaning the vibration measurements into a single data frame, a descriptive analysis was performed to provide insight regarding the scope of information the data contains. In total, the vibration data consists of 1634 observations made on different equipment throughout the plant. Figure 7.11 contains a comparison of boxplots for each of the 10 available vibration measurements, with the colored box representing the interquartile range(IQR), or 1st and 3rd quantiles, in addition to the median value. All variables are measured in units of mm/s except for HDE.ENV and HNDE.ENV which are measured in units of gE, which represents enveloping acceleration. The enveloping observations measure vibrations in a higher frequency caused by rolling element bearing or gear mesh problems.

As evident from the plots, the median values are all relatively small, indicating that machines typically operate with minimal vibration. Both plots show frequent outlying observations, with several extreme outliers(200-500mm/s range) observed which are not shown. Given the units of measurement, and the fact that these observations were recorded via a handheld probe, it is more likely that the most extreme outliers are the result of observational error, and not correct values.

As the individual vibration measurements are observed and recorded, the probe assigns a health status of Normal, Alert, or Danger, which is also recorded. Figure 7.12 contains a breakdown of the range of recorded observations, per variable, according to the respective assigned status. The variable HDE.ENV is not included as all observed values were labeled a normal status.

As illustrated in the plots, the range of values corresponding to normal observations is the largest for each respective variable. Conversely, the range of values corresponding to alert status observations is significantly smaller, with values close to 5 for all variables. Additionally, the range of values for danger status observations is consistently above the IQR for alert status observations, while still occasionally having lower values than some normal observations.

This indicates that the alert status may be assigned by the probe in a proprietary manner accounting for more than just the observed value, or that there may be error or inconsistency in observation status assignment. At any rate, this insight provides context for what magnitude of vibration would be considered dangerous to the operational ability of equipment. Table 8.3 in the appendix contains a summary of the quantiles for each vibration variable and status combination.

As vibration measurements are neither continuously observed nor observed at regular intervals, the number of observations per equipment item is highly varied. Additionally, depending upon the equipment, vibrations may be observed at different locations on the equipment, identified by component(i.e., fan, motor, bearing, etc.). As such, in order to use these historical records to describe the health and event patterns for equipment items, it is important to understand the the quantity of available information for relevant plant sections and equipment.

Figure 7.13 provides a breakdown of the number of vibration readings by equipment and section, denoting the proportion from respective component locations. Since some equipment have relatively few vibration observations, only the top 6 equipment per section are shown in the figure.

Based on the plots, it is evident that the equipment in section 5 is responsible for the largest sample of vibration observations, with all of these equipment items being fans(xFNx equipment code). Additionally, fans appear to be some of the most frequently monitored equipment in the remaining sections of the plant. Based on the largest representative sample size, fans will be of primary interest in later condition-based models.

Despite the vibration measurements being observed at irregular intervals, observing the changes in vibrations over time may describe changes in the “health” of equipment over time, which would otherwise be unknown without such an indicator.

Figure 7.14 contains a time series plot of the available vibration measurements for fan 2 in section 5(05FN02), which tracks the progression of values throughout time. The bottom plot shown a time series of the 8 vibration variables measured in mm/s along with two horizontal dashed lines at values of 3.5 and 6, representing approximate minimum values for being classified as alert or danger by the recording probe.

As evident from the plot, during periods where observations are more frequent, the vibration levels, and health of the equipment, has a large amount variation of variation over time. The two enveloping variables in the top plot also display distinct variation, despite all measurements of these variables for fan 2 being classified as normal.

As shown in Table 8.3 in the appendix, the minimum observed alert value for the HNDE.ENV variable was 6.09gE. The near-horizontal lines between December, 2016 and September, 2017 represent a period of time in which no vibration readings were recorded for fan 2.

Several fans throughout the plant were identified as having been monitored via vibration measurements, and have also experienced failure events, thus having near-complete records in all data sources. These fans are referenced by equipment ID “03FN08”, “05FN01”, “05FN02”, “05FN31”, “05FN33”, “06FN11”, “06FN41”, “06FN51”, “06FN52”, “06FN53”, and “07FN01”. When building predictive models for reliability using using fan data, this subset will be used unless otherwise specified. Unfortunately, as the variable HNDE.FL will not be included in this subset as it is not measured on fans, but on motors.

Although each vibration variable represents velocity or acceleration in a specific direction, measured at a specific location on the equipment, different parts of the equipment do not vibrate in isolation. As shown in the time series in Figure 7.14, each of the observed variables tend to follow the same patterns in time, implying that as the equipment vibrates more, the increase is is typically reflected in all variables. However, the sparsity of observations in the time series may not adequately reflect the relationship between variables, making them appear far more similar than they are.

Figure 7.15 contains a correlation matrix for all available vibration variables, with values close to 1 indicating high positive correlation, values close to 0 indicating no correlation, and values represented by “x” indicate values that are not observed together. As shown in the figure, the variables with the highest correlation of 0.79 are ADE and ANDE, which measure vibrations in the axial direction on the drive-end(ADE) and non-drive-end(ANDE) of an equipment. Furthermore, axial drive-end(ADE) and horizontal drive-end(HDE), as well as ANDE and vertical drive-end(VDE), have strong correlations of 0.63.

#### Principal component analysis on the vibration measurements

Based on the correlation between most of the vibration measurements and the larger number of predictors, principal component analysis was performed in order to offer a reduced set of uncorrelated covariates for future condition-based reliability models using the original observations. For the purpose of this PCA, only the data from the subset of fans is used, as the results of the PCA will be used to model fan reliability.

As described in Rencher (2002), a cutoff of 80% for the cumulative amount of explained variance was used to justify retaining the first 4 principal components. Additionally, as these components will be used in later models to predict reliability, it is advantageous to retain too many components now, as excess components can simply be excluded during model comparison. Table 8.4 in the appendix contains the component loadings for the first 5 principal components using 9 of the original vibration variables.

A biplot of the PCA results, as shown in Figure 8.2 of the appendix, revealed a strong influence of the outlying values previously described. In response, values above 50mm/s for ADE, ANDE, HDE, HNDE, and VDE were reduced to the mean values for each variable. After reducing the outlying values, a second PCA was performed in order to provide additional predictors for future modeling. Similar to the first PCA, the second technique identified 4 components to explain more than 80% of the variation in the original variables. Table 8.5 in the appendix contains the component loadings for the first 4 principal components after reducing the outlying values. A biplot of the results from the second PCA is shown in Figure 7.16.

Based on the loadings for the second PCA, the first principal component largely represents HDE.CAV and HDE.ENV, which are both horizontal drive-end vibrations, as their loadings have the largest absolute values. Component 2 is dominated by ANDE and HNDE, which are both non-drive-end measurements. Component 3 has the largest correlations with HDE and ADE, which are both drive-end measurements. Additionally, component 4 is correlated highest with VDE and ANDE, both of which are drive-end measurements. The relationships between the first two principal components and the original observations are best illustrated in the respective bigram.

### 7.2.5 Monthly production data

In addition to historical event records, the plant also maintains a record of total monthly production output(measured in tons) for 9 subsections of the plant. The figures are reported by subsection as some plant sections contain multiple subsections which may be operating in parallel(e.g., cement mills) or processing different materials(e.g., coal mill and kiln). Section 5, for example, is made up of 4 cement mills operating in parallel subsections, with total material output being recorded for each.

As the production records were recorded during the same 3-year period as the stoppage event records, this information will provide an important indication of the production performance by vital equipment leading to failure events.

Figure 7.17 contains a comparison of the monthly relative production for each of the four cement mills, over the 3-year period. As evident in the plot, cement mills 4 and 5 maintain an average monthly production rate that fluctuates around 50% for the majority of the monitoring duration. In contrast, cement mills 1 and 3 operate at considerably lower average monthly production rates, and even experience periods of complete shutdown(these shutdowns are indicated in the stoppage event records). This information corroborates with the earlier conclusion from the descriptive analysis of the stoppage event records, that cement mills 4 and 5 experience more failure events than the remaining cement mills, which may be attributed to the fact that they are used most often.

## 7.3 Data integration

Following the pre-processing and descriptive analysis of all data sources there is a better understanding of what information each data set contains. Using the criticality analysis to compare areas of the plant based on the frequency of failure events has identified the cement mills located in section 6 as the most critical. In addition to experiencing the highest number of failure events, the cement mills also experience a large number of non-failure stoppages due to raw material or other resource shortages. Although the cement mills themselves were not monitored for vibrations, the respective stoppage event and production records will be integrated.

The descriptive analysis of the vibration data has concluded that fans are the most frequently monitored equipment throughout the plant. As the fans are responsible for performing temperature regulation for large equipment throughout the plant, such as the kiln in section 5, the condition of the fans is critical to the functionality of each section.

As both the cement mills and fans have been identified as critical equipment throughout the plant, the following section will detail the process of data integration for each.

### 7.3.1 Integrated cement mill data

In its prepared form, each entry in the stoppage record data represents a stoppage event, denoting the category of stoppage, beginning and ending timestamp, equipment, stoppage duration(repair time), and respective failure mechanism and maintenance action(if failure event).

In order to prepare the data for integration, the event records representing failure durations should be turned into records representing interfailure durations, or gap times. Whereas the original data identified when the equipment failed, and the duration of downtime, the integrated data must represent how long the equipment was operational for, prior to failure, and what occurred to the equipment during this time. The following steps were performed to integrate the data sets for the cement mills in section 6.

1. Select only the failure events for equipment “06CM11”, “06CM31”, “06CM41”, and “06CM51”.
2. Using these failure events, convert the failure records into gap times. As the first gap time corresponds to the time between the first and second failures, the sample size for gap times is two records smaller than the sample size of failure events. Since the failure mechanism, maintenance action, and repair time have been extracted earlier, they are retained such that each value represents what occurred prior to the gap time. The gap times now represent all available information regarding failure events, with each row corresponding to a single gap time.
3. To incorporate the production records into the gap times, the beginning and end of each respective gap time is referenced with the production data. If the gap time occurs within a single month, the corresponding monthly production rate is assigned to the gap time. If the gap time spans multiple months, a data row is inserted for each month for which production is recorded. As multiple rows now represent a single gap time, each data row will be identified as a segment for which a new variable, stat, is used to identify whether the segment ended with failure of the equipment. In this format, stat is set to 0 for all but the final segment, indicating that the equipment was continuously operational until the final segment. The production rate for each segment is considered a time-dependent covariate, as the segments are used to track the changes in production while the equipment is approaching the next failure.
4. A similar step is used to incorporate information regarding the non-failure events that occur within the gap times for each respective equipment. For each segment, the beginning and end timestamps can be referenced against the non-failure events for each equipment. In order to account for maintenance interventions(planned maintenance not initiated by failures) that occur during the gap time, a new variable maint. is introduced, which keeps a running total of these events. Additionally, a variable numstop is used to keep a running total of the stoppages that are neither failure-related nor maintenance interventions. Since both of these new variables are also time-dependent variables, new time segments are inserted when either of these values changes. When iterating through the consecutive gap times, each row corresponds to a segment of time in which only one of the time-dependent covariates change, while all other variables remain constant.

Although the integration procedure considerably increases the complexity of the data, it provides a framework for accounting for maximal information regarding the behavior of an equipment between failure events. Table 7.10 provides a sample of the first interfailure duration for cement mill 1 after integration of the data. Due to the size of the dataframe, the additional covariate values are not shown, but a description of all covariates is provided in Table 7.11.

Table 7.10: Sample of the intregrated cement mill data for the first interfailure duration for cement mill 1
id seg stat start stop tstart tstop equipment
1 1 0 2015-01-27 04:20:00 2015-01-30 06:00:00 0.000000 3.069444 06CM11
1 2 0 2015-01-30 06:00:00 2015-01-31 23:59:59 3.069444 4.819433 06CM11
1 3 0 2015-01-31 23:59:59 2015-02-27 08:10:00 4.819433 31.159722 06CM11
1 4 0 2015-02-27 08:10:00 2015-02-28 23:59:59 31.159722 32.819433 06CM11
1 5 0 2015-02-28 23:59:59 2015-03-01 13:40:00 32.819433 33.388889 06CM11
1 6 0 2015-03-01 13:40:00 2015-03-01 14:47:00 33.388889 33.435417 06CM11
1 7 0 2015-03-01 14:47:00 2015-03-01 22:30:00 33.435417 33.756944 06CM11
1 8 0 2015-03-01 22:30:00 2015-03-02 05:15:00 33.756944 34.038194 06CM11
1 9 1 2015-03-02 05:15:00 2015-03-02 16:30:00 34.038194 34.506944 06CM11

As shown in Table 7.10, the id variable indicates all the rows that correspond to the first interfailure duration. The interfailure durations are numbered only to facilitate estimation, and are not treated as ordered. The seg variable indicates the segment, or time interval, which is a further subdivision of the gap time. As this interfailure duration contains 9 segments, it indicates that there have been 8 changes in the time dependent variables for cement mill 1 during this gap time.

As evident in Table 7.10, the stat variable indicates when the equipment finally fails, and is necessary for model estimation. The start and stop variables keep track of the segment time via timestamp, and tstart and tstop track the progression of each segment by counting the days since the start of the interfailure duration.

A description of each covariate is provided in Table 7.11, which is classified into identifiers, time-independent covariates, and time-dependent covariates. The identifiers are used to keep track of the time, equipment, and equipment status. The time-independent covariates do not change throughout the interfailure duration, and represent what occurred to the equipment immediately prior to becoming operational. The time-dependent covariates represent the incoming information about what is occurring to the cement mills, and is accounted for based on the time it occurs.

Table 7.11: List of covariates and descriptions for integrated cement mill data
Variable Type Variable Description
Identifiers id Identifier for each interfailure duration
seg Identifier for time-dependent segment of duration
stat Status of whether the time segment ended in failure
start Timestamp for start of each segment
stop Timestamp for end of each segment
tstart Time(in days) since start of first segment
tstop Time(in days) since start of first segment
equipment Equipment being observed
Time-independent ma Previous maintenance action performed
fm Failure mechanism of previous failure event
ma Maintenance action of previous failure event
reptime Repair time(downtime) for previous failure event
Time-dependent prod Relative production rate for associated equipment
numstop Cumulative count of non-failure stoppages
maint Cumulative count of planned maintenance actions

### 7.3.2 Integrated fan data

Since the fans are also monitored for vibrations, the integration of data sources follows the same procedure, but with an additional step to insert the vibration measurements as time-dependent covariates, as indicated below.

1. Select only the failure events for equipment “03FN08”, “05FN01”, “05FN02”, “05FN31”, “05FN33”, “06FN11”, “06FN41”, “06FN51”, “06FN52”, and “06FN53”.
2. The same step(2) is repeated as for cement mills
3. The same step(3) is repeated as for cement mills
4. The same step(4) is repeated as for cement mills
5. A similar step is performed to integrate the vibrations measurements into the data frame. For each segment, the vibration data is queried to retrieve observations for the respective equipment. For each observation, a new segment is inserted into the interfailure duration, keeping the remaining covariates unchanged. For the first segment, representing the time interval in which the equipment just became operational, the vibration level is unknown, but assumed to be at an acceptable level. As such the values for the first time segment are imputed under this assumption.

The integration of the data sources for fans follows a similar procedure, and provides a similar framework for accounting for the changes the fans experience between failure events. Table 7.12 provides a sample of the first interfailure duration for section 3, fan 8 after integration of the data. The additional covariate values are not shown, but a description of all covariates is provided in Table 7.13.

Table 7.12: Sample of the intregrated fan data for the first interfailure duration for section 3 fan 8
id seg stat start stop tstart tstop equipment
1 1 0 2016-10-30 00:28:00 2016-10-31 23:59:59 0.000000 1.980544 03FN08
1 2 0 2016-10-31 23:59:59 2016-11-30 23:59:59 1.980544 31.980544 03FN08
1 3 0 2016-11-30 23:59:59 2016-12-31 23:59:59 31.980544 62.980544 03FN08
1 4 0 2016-12-31 23:59:59 2017-01-31 23:59:59 62.980544 93.980544 03FN08
1 5 0 2017-01-31 23:59:59 2017-02-28 23:59:59 93.980544 121.980544 03FN08
1 6 0 2017-02-28 23:59:59 2017-03-31 23:59:59 121.980544 152.980544 03FN08
1 7 0 2017-03-31 23:59:59 2017-04-30 23:59:59 152.980544 182.980544 03FN08
1 8 0 2017-04-30 23:59:59 2017-05-31 23:59:59 182.980544 213.980544 03FN08
1 9 0 2017-05-31 23:59:59 2017-06-30 23:59:59 213.980544 243.980544 03FN08
1 10 0 2017-06-30 23:59:59 2017-07-14 07:22:00 243.980544 257.287500 03FN08
1 11 0 2017-07-14 07:22:00 2017-07-26 12:05:00 257.287500 269.484028 03FN08
1 12 0 2017-07-26 12:05:00 2017-07-27 03:28:00 269.484028 270.125000 03FN08
1 13 0 2017-07-27 03:28:00 2017-07-31 23:59:59 270.125000 274.980544 03FN08
1 14 0 2017-07-31 23:59:59 2017-08-31 23:59:59 274.980544 305.980544 03FN08
1 15 1 2017-08-31 23:59:59 2017-09-17 02:22:00 305.980544 322.079167 03FN08
Table 7.13: List of covariates and descriptions for integrated fan data
Variable Type Variable Description
Identifiers id Identifier for each interfailure duration
seg Identifier for time-dependent segment of duration
stat Status of whether the time segment ended in failure
start Timestamp for start of each segment
stop Timestamp for end of each segment
tstart Time(in days) since start of first segment
tstop Time(in days) since start of first segment
equipment Equipment being observed
Time-independent ma Previous maintenance action performed
fm Failure mechanism of previous failure event
ma Maintenance action of previous failure event
reptime Repair time(downtime) for previous failure event
Time-dependent prod Relative production rate for associated equipment
numstop Cumulative count of non-failure stoppages
maint Cumulative count of planned maintenance actions
ANDE Axial non-drive end
HDE Horizontal drive end
HDE.CAV Horizontal drive end cavitation
HDE.ENV Horizontal drive end enveloping
HNDE Horizontal non-drive end
HNDE.ENV Horizontal non-drive end enveloping
HNDE.FL Horizontal non-drive end
VDE Vertical drive end
VNDE Vertical non-drive end
comp1 First principal component score from primary PCA
comp2 Second principal component score from primary PCA
comp3 Third principal component score from primary PCA
comp4 Fourth principal component score from primary PCA
compb1 First principal component score from secondary PCA
compb2 Second principal component score from secondary PCA
compb3 Third principal component score from secondary PCA
compb4 Fourth principal component score from secondary PCA

As evident by Table 7.12, the integrated fan data follows an identical structure as for cement mills integrated data set, apart from the addition of more covariates. However, the major difference between the two is that because each vibration measurement corresponds to a new observation of several different variables(e.g., ADE, ANDE, etc.) at the same point in time, all vibration variables change at the same time. In the case of the cement mills, for each additional segment in a gap time, a maximum of 1 variable has been updated, but for the fans, a maximum of 17 variables(9 original measurements, 8 principal component scores) have been updated.

A description of each covariate is provided in Table 7.13, which is also classified into identifiers, time-independent covariates, and time-dependent covariates. The meaning of these covariates is consistent between data sources, with the exception of including additional time-dependent covariates to incorporate the change in vibrations being observed.

## 7.4 Basic reliability models

### 7.4.1 Basic cement mill reliability

In the following analysis, a subset of failure events for cement mills from plant section 6 will be used. Considering this data, the random variables is the time between failures, which will be referred to as interfailure duration.

As illustrated in Table 7.14, which contains descriptive statistics for the current subset of interfailure times, the sample mean for cement mill 4 is more than 3 times the median interfailure time. Furthermore, the maximum interfailure time is more than 10 times the sample mean value, and the sample mean is very close to the 3rd quantile value. These statistics indicate that the sample distribution is highly skewed to the right, suggesting that extremely large interfailure durations do exist, yet are far less frequent.

Table 7.14: Descriptive statistics for interfailure times(days) for section 6 cement mills
Mill Min. Q1 Median Mean Q3 Max.
ALL 0.001 0.125 0.948 5.971 4.805 223.757
1 0.003 0.141 0.967 8.849 4.454 223.757
3 0.007 0.188 4.309 19.051 12.415 171.381
4 0.001 0.086 0.768 2.673 3.146 33.939
5 0.007 0.473 2.120 8.518 8.785 116.829

#### 7.4.1.1 Trend assessment

In the study of recurrent events, there are several common methods for identifying the presence of a trend, which can refer to several different aspects.

As suggested by Lindqvist (2006), the plot in Figure 7.18, represents a trend chart of the cumulative number of failures against the time since the first failure for cement mill 4. The cumulative number of failures is simply an ordering of the approximately 400 failure events that have occurred for cement mill 4 in the historical records. The failure rate is visualized against a time scale representing the approximately 1,100 days over which the cement mill was monitored.

This type of plot is one method for assessing the presence of trend failure occurrence rate, either increasing or decreasing. In the literature, this plot is derived from an adaptation of the Nelson-Aelen estimate, but is equivalent to what is shown below. In terms of this plot, a straight diagonal line would indicate no trend. In this instance, apart from a large increase near 300 days since first failure, the line appears relatively straight, indicating no obvious trend in failure rate.

Although this plot suggests that the rate of failure occurrence for cement mill 4 has remained relatively constant, the visual representation emphasizes the rate of occurrence, making it difficult to compare the actual gap times, or interfailure durations. Comparing the actual observed gap times, as presented in Figure 7.19 provides a complementary assessment.

The plots in Figure 7.19 contain a slightly different representation of failure occurrence for each of the 4 cement mills in plant section 6. Similar to the plot in Figure 7.18, each failure event is identified by the order in which it occurred. However, in these plots, the time scale represents the interfailure duration, as opposed to the time since first failure. As the interfailure duration represents the duration of time in which an equipment was operational, or time between failures, this representation allows for easier comparison of equipment behavior.

In these plots, each point represents an interfailure duration; points on the far left indicate instances where the cement mill failed after a shorter duration, and points on the far right indicate instances when the cement mill was operational for a long time before failure occurred. A negative trend in interfailure duration, indicating possible degradation, would be represented by shrinking interfailure durations. As evident in the plots, neither cement mill appears to have a distinct negative trend. In fact, cement mill 5 experienced most of the exceptionally long interfailure durations near the end of the three-year historical record.

In addition to visual inspection, Lawless (2007) recommends using a parametric approach to formally test the presence of trend in failure occurrence. By employing a Cox model such that $$h_{i j}(w)=h_{0}(w) \exp \left(z_{i j} \beta\right)$$, with $$z_{i j}$$ representing the $$j$$th gap time of the $$i$$th cement mill, and a test for $$H_{0} : \beta=0$$. In this case, the null hypothesis represents no trend, and alternative hypothesis represents a non-zero trend. When performed individually for each cement mill, the Cox model reported no trend, supporting the conclusion based on visual inspection.

Alternatively, Table 7.15 contains respective Pearson correlation coefficients of the interfailure durations and ordered failure numbers for each cement mill. The small values further support the conclusion of no trend, allowing for the usage of standard modelling procedures, with the assumption that failure events occur independently. As no trend was detected, the analysis will continue with traditional survival analysis techniques.
Table 7.15: Pearson correlations coefficients between interfailure duration and ordered failure number
Cement Mills
1 3 4 5
0.1640296 0.1398092 0.0516574 0.256835

#### 7.4.1.2 Non-parametric reliability estimation

Figure 7.20 shows a comparison of the Kaplan-Meier(KM) survival curves for each of the four cement mills. Each line represents the decreasing probability of a respective cement mill being functional at a particular point in time after having been operational for the specified number of days. The dashed lines represent a comparison of the median survival times for each of the cement mills. The median survival time can be interpreted as the time at which half of the previous failures have occurred.

The KM procedure provides a non-parametric estimation of the survival function $$\hat{S}(t)$$ for each cement mill using the observed interfailure durations.

Another facet of reliability analysis is the concept of the hazard function, which represents the instantaneous hazard rate. This hazard rate is interpreted as the probability of failure by a future time point, given that the equipment has been operational until the present time(Zacks 2012). Although not provided here, it is possible to derive the hazard function $$h(t)$$ from the survival function $$\hat{S}(t)$$, and the PDF $$f(t)$$, and vice versa.

$$$h(t)=\frac{f(t)}{S(t)} \quad \quad S(t)=\exp \left\{-\int_{0}^{t}h(x) d x\right\}$$$

Figure 7.20 also contains a plot of the Cumlative Hazard Function $$H(t)$$ for each of the respective cement mills. The cumulative hazard for time $$t$$ is the integral of the hazard function $$h(t)$$ from 0 to $$t$$, and can be interpreted as the number of failures one would expect by time $$t$$, if the failure process were repeatable(Cleves et al. 2010).

Figure 7.20 shows that median survival time is largest for cement mill 3(close to 5 days), and second largest for cement mill 5(close to 2.5 days). The median survival times for cement mills 1 and 4 appear approximately equal from the plotted KM estimates. Additionally, cement mill 3 appears to have the highest survival probability for most time durations.

In addition to visual comparison, the KM estimates for each cement mill may be compared using the log-rank test, which will identify whether the cement mills have differing reliability estimates. A p-value of 0 for the $$\chi^{2}_{3}$$ estimate of the log-rank test suggests that the survival functions of the four cement mills are not identical. Based on this result, we will incorporate the cement mill ID as a categorical covariate in the forthcoming parametric models.

The plots in Figure 7.21 shows a comparison of the KM survival estimate for each cement mill against fitted Weibull, Log-normal, Gamma, Exponential, Log-logistic, and Gompertz distributions using the flexsurvreg function from the flexsurv package in R. As shown in the plots, the regression model using a Weibull distribution provides a reasonable approximation of the KM estimate for each of the cement mills.

Furthermore, Table 7.16 contains a comparison of the Akike’s Information Criterion(AIC) from each of the respective fitted models. The smallest AIC value of 2681.79 supports the notion that a Weibull model provides the closest approximation of the survival function for each cement mill. As such, the Weibull model will be used for the basis of a parametric model of the reliability characteristics of each cement mill.

Table 7.16: AIC comparison of model distribution options
Weibull Log-normal Gamma Exponential Log-logistic Gompertz
2681.794 2712.833 2731.35 3468.762 2740.371 3211.258

#### 7.4.1.3 Parametric reliability estimation

Based on the conclusions that the reliability of the four cement mills is not equivalent but is well approximated by a Weibull distribution, parametric models can be estimated to model the behavior of each cement mill. The Weibull function is extremely popular for modelling the hazard function for reliability analysis, and has unique implications for Accelerated Failure Time(AFT) and Proportional Hazards(PH) estimation(Kleinbaum, Klein, and Samet 2006).

As outlined in Kleinbaum, Klein, and Samet (2006), the AFT assumption implies that the effect of covariates is multiplicative(proportional) with respect to the survival time $$S(t)$$, whereas the PH assumption implies that the effect of covariates is multiplicative(proportional) with respect to the hazard function $$h(t)$$. When the process is well represented by a Weibull function, each assumption implies the other, however, the distinction between the two formulations determines the interpretation of the estimated effect of the covariates.

The appropriateness of a Weibull fit, as well as the AFT and PH assumptions, can be assessed graphically by plotting the relationship between $$log(-log[\hat{S}_{c}(t_{i})])$$ and $$log(t_{i})$$, where $$\hat{S}_{c}(t_{i})$$ is the Kaplan-Meier survival estimate for each cement mill $$c$$. Figure 7.22 contains a plot of the complementary log-log transformation for each of the respective cement mills, along with a fitted least squares line to assess linearity, which would suggest that a Weibull hazard function would be appropriate for modelling.

If each of the relationships appear linear, then parallel lines would imply that both PH and AFT assumptions appear valid, given the properties of the Weibull hazard function. In this case, the lines largely appear straight, supporting the use of a Weibull distribution, but clearly intersect indicating that both the PH and AFT assumptions may not be valid.

Assuming that the reliability of each cement mill is not equivalent, but each can be adequately represented by a Weibull function, we can begin to estimate respective model parameters. Using the survreg function from the Survival package, we can specify a Weibull distribution and obtain parameter estimates using maximum likelihood estimation. Figure 7.23 shows a comparison of the output of an estimated parametric regression model against the KM survival estimate for each cement mill.

The fitted Weibull model uses the parameterization of $$\alpha$$ and $$\mu$$ as the shape and scale terms respectively. The shape parameter is held constant, while the cement mill identifier is used as a categorical covariate with a multiplicative effect on the scale parameter. In this parameterization, cement mill 1 is used as the baseline, with covariates $$x_{1}$$, $$x_{2}$$, and $$x_{3}$$ representing dummy variables for cement mills 3, 4, and 5.

$$$S(t)=exp(-(\frac{t}{\mu})^{\alpha}), \;\;\;\;\;log(\mu)=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}$$$

Table 7.17 contains the estimates of the intercept $$\beta_{0}$$ and $$\beta_{1}$$, $$\beta_{2}$$, and $$\beta_{3}$$ coefficients. Individually, each coefficient represents the logarithm of the ratio of survival times of each mill with the baseline mill(mill 1). As such, positive coefficients such as $$\beta_{1}$$ and $$\beta_{3}$$ represent longer survival times for mills 3 and 5, compared with mill 1. Conversely, the negative coefficient for $$\beta_{2}$$ suggests shorter survival times for mill 4.

Table 7.17: Beta coefficient estimates
Coefficients
$$\beta_{0}$$ $$\beta_{1}$$ $$\beta_{2}$$ $$\beta_{3}$$
1.149169 1.020933 -0.8008054 0.3558786

Table 7.18 contains the respective shape and scale parameters for each of the cement mills derived from the estimated coefficients. As the cement mill ID is included in the model as a categorical covariate, only one model is being estimated, and thus the shape parameter $$\alpha$$ remains constant. The dummy variables for each cement mill have a linear effect on the the log of the scale parameter $$\mu$$, with each respective value reported in the table.

Table 7.18: Parameter estimates for each cement mill
Cement Mills
Parameter 1 3 4 5
Shape: $$\alpha$$ 0.4995278 0.4995278 0.4995278 0.4995278
Scale: $$\mu$$ 3.1555706 8.7591758 1.4167478 4.5043695

### 7.4.2 Basic fan reliability

The analysis of fan reliability will follow a similar procedure as for the cement mills in plant section 6, as discussed in the previous section. However, since the plant contains a large number of fans operating throughout different sections of the plant, this analysis will use two subsets of data. As identified in the descriptive analysis of the vibration dataset, the first and more general set of equipment consists of 10 fans for which complete records of stoppage events, vibration measurements, and plant subsection production rate, are recorded. The second dataset is a subset of the first, consisting of only the fans from plant section 5(05FN01, 05FN02, 05FN31, and 05FN31), which will be used for parametric estimation. For each respective part of the analysis, the dataset being used will be noted.

#### 7.4.2.1 Non-parametric reliability estimation

Figure 7.24 provides the respective KM estimates of the reliability function $$\hat{S}(t)$$ and cumulative hazard function $$\hat{H}(t)$$ for each of the fans. Each plot has been restricted to show only the first 30 days days of each interfailure time in order to better visualize differences in the KM estimate at earlier durations. Although not visually depicted in the figures, the interfailure durations of the fans are much longer than those previously observed for the cement mills. As the cement mills are large equipment providing the majority of the core function of processing raw materials, and the fans provide supportive functionality through cooling, this difference in interfailure durations is intuitive.

#### 7.4.2.2 Trend assessment

In following with the methodology of the analysis, the subset of fans in plant section 5 was assessed in order to evaluate the possibility of trend in the interfailure durations. Figure 7.25 contains a plot of all failures for each fan, according to their ordered failure number and interfailure duration.

Similar to the conclusion drawn from the cement mills, there appears to be no indication of a trend in the interfailure durations for any fan in section 5. Furthermore, estimating a Cox model for each fan using the ordered failure number as the only covariate, as suggested by Lawless (2007), further indicated no presence of trend. Furthermore, the small values for the Pearson correlation coefficients provided in Table 7.19 further support this conclusion.

In addition to providing evidence against the presence of trend, the plots in Figure 7.25 provide further context regarding the history of failures for each fan. Based on the highest ordered failure numbers, it is clear than fans 31 and 33 experience far more failure events than fans 1 and 2.

Table 7.19: Pearson correlations coefficients between interfailure duration and ordered failure number
Fan
1 2 31 33
0.0369481 0.0333637 -0.0033797 -0.0394446

#### 7.4.2.3 Parametric reliability estimation

In further analysis with the subset of fans from plant section 5, the plots in Figure 7.26 show a comparison of the KM survival estimate for each fan against fitted Weibull, Log-normal, Gamma, Exponential, Log-logistic, and Gompertz distributions using the flexsurvreg function from the flexsurv package in R. As shown in the plots, the regression model using a Weibull distribution provides a reasonable approximation of the KM estimate for for both fans 31 and 33.

In contrast, for fans 1 and 2, the Weibull model appears to underestimate reliability during some short time durations, as compared to the KM estimate. Despite the poor fit of the Weibull model for short durations, it still provides a visually similar approximation for longer interfailure durations. Furthermore, the unique behavior of the KM estimate for fan 2 may be representative of the small number of failure occurrences, as represented in Figure 7.25.

Table 7.20 contains a comparison of the Akike’s Information Criterion(AIC) from each of the respective fitted models. The smallest AIC value of 1256.74 supports the notion that a Weibull model provides the closest approximation of the survival function for each fan.

Table 7.20: AIC comparison of model distribution options
Weibull Log-normal Gamma Exponential Log-logistic Gompertz
1256.738 1278.83 1258.585 1473.096 1280.537 1425.85

Following from the conclusions that the reliability of the four fans in section 5 is not equivalent but is well approximated by a Weibull distribution, parametric models can be estimated to model the behavior of respective fan.

As performed in the analysis of the cement mill data, the assumptions of a well fitting Weibull distribution, as well as proportional hazard will be assessed via the complementary log-log transformation. Figure 7.27 contains a plot of the relationship between $$log(-log[\hat{S}_{c}(t_{i})])$$ and $$log(t_{i})$$, along wit ha least squares line, for each of the fans in section 5.

As evident by the plot, linearity of the points suggests that the Weibull distribution is a sufficient candidate for modelling the reliability of each fan. Furthermore, despite the respective lines for each fan being nearly parallel, fan 2 intersects fans 1 and 33, indicating that the proportional hazards assumption is not valid. This plot further illustrates the unique behavior of fan 2, which may be a consequence of the fewer number of observed failures.

Assuming that the reliability of each fan is not equivalent, but each can be adequately represented by a Weibull function, we can estimate respective model parameters using maximum likelihood estimation. Figure 7.28 shows a comparison of the output of an estimated parametric regression model against the KM survival estimate for each fan. As seen in this plot, the discrepancies between the Weibull model and the KM estimate for fans 1 and 2 are quite pronounced for interfailure durations of less than 30 days. As the Weibull appears to closely approximate the KM estimate for the remaining two fans, this may indicate that fans 1 and 2 have experienced relatively few failures after short durations. Given the close approximation for fans 31 and 33, it supports the continued use of the Weibull distribution.

Similar to the cement mill analysis, the fitted Weibull model uses the parameterization of $$\alpha$$ and $$\mu$$ as the shape and scale terms respectively. The shape parameter is held constant, while the fan identifier is used as a categorical covariate with a multiplicative effect on the scale parameter. In this parameterization, fan 31 is used as the baseline, with covariates $$x_{1}$$, $$x_{2}$$, and $$x_{3}$$ representing dummy variables for fans 1, 2, and 33. Fan 31 is chosen as the baseline as it has experienced the highest number of failure events, which is more likely to provide an example of realistic fan performance.

$$$S(t)=exp(-(\frac{t}{\mu})^{\alpha}), \;\;\;\;\;log(\mu)=\beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}$$$

Table 7.21 contains the estimates of the intercept $$\beta_{0}$$ and $$\beta_{1}$$, $$\beta_{2}$$, and $$\beta_{3}$$ coefficients. Individually, each coefficient represents the logarithm of the ratio of survival times of each fan with the baseline fan(fan 1). As such, positive coefficients such as $$\beta_{1}$$ and $$\beta_{2}$$ represent longer survival times for fans 1 and 2, compared to fan 31. Conversely, the negative coefficient for $$\beta_{3}$$ suggests shorter survival times for fan 22.

Table 7.21: Beta coefficient estimates
Coefficients
$$\beta_{0}$$ $$\beta_{1}$$ $$\beta_{2}$$ $$\beta_{3}$$
1.699014 1.290246 1.980185 0.493133

Table 7.18 contains the respective shape and scale parameters for each of the fans derived from the estimated coefficients. As the fan ID is included in the model as a categorical covariate, only one model is being estimated, and thus the shape parameter $$\alpha$$ remains constant. The dummy variables for each fan have a linear effect on the the log of the scale parameter $$\mu$$, with each respective value reported in the table.

Table 7.22: Parameter estimates for each fan
Fans
Parameter 1 2 31 33
Shape: $$\alpha$$ 0.4831104 0.4831104 0.4831104 0.4831104
Scale: $$\mu$$ 5.4685521 19.8709716 39.6146630 8.9544166

## 7.5 Integrated predictive reliability models

In addition to basic models for estimating the reliability of each cement mill without covariates, there is an array of additional methods capable of accommodating all of the previously extracted information.

### 7.5.1 Integrated models for cement mills

Table 7.23 contains the frequency of relative production output according to quantile for each cement mill. As shown in the table, it appears that all cement mills do not experience the same relative production load. For example, cement mill 1 experiences only 1st(0-36%) and 2nd(36-46%) quantile production loading, with cement mill 3 experiencing only 1st quantile production loading. Conversely, cement mill 4 experiences loading across the entire range, while mill 5 experiences only high loading(Q3:46-53%, Q4:53-67%). Based on this information, it may be valuable to account for production rate when modelling reliability.

Table 7.23: Count of interfailure durations by production rate
Equipment Production Rate Percentage
06CM41 Q1 8
06CM41 Q2 37
06CM41 Q3 31
06CM41 Q4 24
06CM11 Q1 75
06CM11 Q2 25
06CM31 Q1 100
06CM51 Q2 23
06CM51 Q3 31
06CM51 Q4 46
a Q1:0-36%, Q2:36-46%, Q3:46-53%, Q4:53-67%

#### 7.5.1.1 Stratified extended Cox models for cement mills

As explained in Therneau (2000) an extension of the Cox model allows for stratification, such that the observations are divided into disjoint strata. Each strata has its own baseline hazard function, but common coefficient values for the coefficient vector $$\beta$$. Thus, the hazard function for interfailure duration $$i$$ in stratum $$k$$ has the form $$h_{k}(t)e^{X_{i}\beta}$$. Stratification is useful because it allows for adjustment of confounding covariates, or covariates which do not satisfy the proportional hazards assumptions, such as time-dependent covariates. An unfortunate aspect of stratification in the extended Cox model is that as the baseline hazard function is not estimated, the effect or importance of the strata is not estimated(Therneau 2000).

Furthermore, fitted Cox models may include the interaction between strata and covariates, which identifies whether the effect of covariates differs by strata. Including each covariate by strata interaction is equivalent to modeling each strata separately(Therneau 2000).

An extended Cox model including both time-independent and time-dependent covariates takes the form: $$$\begin{array}{r}{h_{k}(t, \mathbf{X}(t))=h_{k}(t) \exp \left[\sum_{i=1}^{p_{1}} \beta_{i} X_{i}\right.} {+\sum_{j=1}^{p_{2}} \delta_{j} X_{j}(t) ]}\end{array}$$$ With $$X_{1}, X_{2}, \ldots X_{p_{1}}$$ the time-independent covariates and $$X_{1}(t), X_{2}(t), \ldots X_{p_{2}}(t)$$ the time-dependent covariates of interest(Kleinbaum, Klein, and Samet 2006). However, when the extended Cox model includes time-dependent covariates, it may no longer satisfy the proportional hazards assumption, as both the baseline hazard and the covariates depend upon time. The proportional hazards assumption will be verified for all covariates.

Table ?? contains a comparison of extended Cox model fits for the available cement mill data using the “coxph” function in the Survival package in R. Model 1 is the full model, containing all covariates, as well as all covariate by strata interactions. Model 2 contains all main effects, while Model 3 contains only the significant main effects of Production Replace.

As the Replace covariate indicates whether the previous maintenance action involved replacement or not(0:no, 1:yes), the baseline value is set to 0. Thus, the model is estimating the effect of replacement when compared to non-replacement. Additionally, as Model 1 includes interaction terms with each cement mill, cement mill 4 was chosen as the baseline, as it is more operational, and at the widest range of production capacity.

In the case of this analysis the model is stratified meaning that the estimated coefficients for the main effects represents an effect that is present in all cement mills. Including an interaction between a maint effect and a strata, seeks to identify whether the the effect of a variable is different for a particular cement mill.

Despite Model 1 indicating a significant interaction between Production and CM1, Model 3 results in a lower AIC without including any interactions. Furthermore, as cement mill 4 is the baseline, this interaction implies that the effect of production is not the same for cement mill 1. This notion is expected, because as shown in Table 7.23, cement mill 1 spends 75% of operating time at Q1 production levels and 25% at Q2 production levels, having never operated at higher capacity rates. As such, the interaction term will be excluded from consideration in favor of Model 3.

The use of the proportional hazards framework assumes that the effect of covariates is constant over time. PH models that include time-dependent covariates imply a restriction on these covariates such that $$\delta(t)=\delta$$(Therneau 2000). This means that although the value of the time-dependent covariates change over time, the estimated coefficient must not. Thus, if the estimated coefficient is constant over time, the time-dependent covariate satisfies the proportional hazards assumption.

In order to evaluate the PH assumption, Therneau (2000) recommend the both a statistical test of the Schoenfield residuals and visual inspection of the residual plots. The cox.zph function from the Survival package, which tests for an interaction between a covariate and time, was used to formally test the proportional hazards assumption for both models 2 and 3. When testing for a change in the estimates over log(time), the results for model 2 and 3 indicate that all covariates except for Production satisfy the proportional hazards assumption. However, when performing the same test using the real time scale, Production is now proportional under model 3.

As the proportionality assumption refers to whether the estimated coefficient for Production changes over time, this behavior can be assessed and interpreted using plots of the Schoenfield residuals against time. Figure 7.29 contains the Schoenfield residual plot for the estimate of Production under model 3 on both real-time(left plot) and log(time)(right plot) scales. The blue lines represent the estimated coefficient for Production under model 3, the circles represent the Schoenfield residuals, and the solid line represents a LOWESS(locally weighted scatterplot smoothing) fit of the estimated coefficient. Under proportionality of Production the LOWESS fitted line would be horizontal, representing a constant value for the estimated coefficient over time.

Despite a significant test result, neither plot indicates a clear trend in the Schoenfield residuals. The left plot shows that the LOWESS fit is severely effected by the observed failures that occur between the 50-100 day range where observations are relatively sparse. In the right plot, the LOWESS fit demonstrates an upward trend near the larger survival times, where observations are more sparse. In both plots, the departure of the fitted line from the estimated coefficient value(blue line) is not well supported by the observed residuals, which do not exhibit an extreme trend. Based on analysis of the residual plots, the results of the statistical test will be ignored, and the proportional hazards assumptions will be considered satisfied.

For Model 3, the estimated coefficients for $$\beta_{Replace}$$ and $$\delta_{Production}$$ are provided, along with respective standard errors in parenthesis. An estimate of 0.0121 for $$\delta_{Production}$$, yields a 95% confidence interval of [1.004, 1.021], representing the respective hazard ratio of a 1 unit increase in production rate(measured in percent). This confidence interval indicates that the hazard, or instantaneous risk of failure, increases between 0.4% and 2.5% following a 1% increase in production. This estimate aligns with intuition regarding maintenance and reliability, in that an increase in machine utilization relative to maximum capacity would negatively effect reliability due to increased wear and tear.

Additionally, an estimate of -0.328 for $$\beta_{Replace}$$ yields a 95% confidence interval of [0.521, 0.996] for the respective hazard ratio comparing interfailure durations following “replacement” maintenance actions versus “non-replacement” maintenance actions. In other words, this confidence interval suggests that replacing a failed component yields between a .004% and 47.9% decrease in the instantaneous risk of failure as compared to a repair. This estimate is also intuitive as a replacement action would ideally have a renewal effect, thereby improving reliability immediately after.

 Models (1) (2) (3) Production 0.012** (0.005) 0.012*** (0.004) 0.012*** (0.004) Replace -0.187 (0.292) -0.376** (0.185) -0.328** (0.175) Maint. -0.146 (0.131) -0.114 (0.094) Numstop 0.010 (0.012) 0.012 (0.009) Repairtime 0.027 (0.375) 0.008 (0.023) Production*CM1 -0.00005 (0.0005) Production*CM3 -0.095 (0.450) Production*CM5 (0.000) Replace*CM1 -0.528 (1.060) Replace*CM3 (0.000) Replace*CM5 0.612 (0.507) Maint.*CM1 (0.000) Maint.*CM3 0.330 (0.299) Maint.*CM5 -0.067 (0.353) Numstop*CM1 -0.109 (0.236) Numstop*CM3 -0.007 (0.025) Numstop*CM5 0.060 (0.046) Rep.time*CM1 -0.002 (0.029) Rep.time*CM3 -0.027 (0.375) Rep.time*CM5 -0.001 (0.389) reptime:equipment06CM51 -0.752 (1.100) AIC 6085.17 6067.33 6064.63 Observations 3,001 3,001 3,001 Log Likelihood -3,024.587 -3,028.664 -3,030.313 Wald Test 25.970 (df = 18) 15.480*** (df = 5) 13.310*** (df = 2) LR Test 23.803 (df = 18) 15.648*** (df = 5) 12.350*** (df = 2) Score (Logrank) Test 22.593 (df = 18) 15.389*** (df = 5) 11.956*** (df = 2) Note: p<0.1; p<0.05; p<0.01

#### 7.5.1.2 Parametric model estimation for cement mills

Although the extended Cox model provided insight into the effects of production rate and replacement on the hazard ratio, nothing is known about the actual baseline hazard function of the cement mills. As such, a fully parametric model is used to provide a complete solution for modelling the reliability of each cement mill.

Table ?? contains a comparison of fully parametric AFT models estimated using the “aftreg” function from the eha package in R, using a Weibull baseline function. As the results of the extended Cox procedure ruled out interaction effects, Model 1 includes only main effects, of which Production, Replace, and Maint. are significant. Model 2 is a reduction of Model 1, eliminating Repairtime and Numstop as insignificant terms. In order to fully parametrize the model, scale and shape parameters $$\mu_{k}$$ $$\alpha_{k}$$ are estimated for each of the strata($$k$$), with the estimated coefficients being common between strata.

In contrast to the extended Cox model, the estimated coefficients of an AFT model represent logarithms of ratios of survival times. In other words, positive coefficients indicate longer survival times, and negative coefficients indicate decreased survival times.

In terms of the estimates for Model 2, $$exp(\delta_{Production})=0.977$$ indicates that a one unit increase in production rate corresponds to a .977(95%CI[.962, .993]) ratio of relative survival times, or a 2.3% reduction in survival time. Additionally, $$exp(\beta_{Replace})=1.848$$ which suggests that survival durations immediately following replacement actions are a 1.848(95%CI[1.73, 1.97]) times longer than those following non-replacement actions. Finally, $$exp(\beta_{Maint.})=1.291$$, which suggests that a single additional planned maintenance event results in survival times that are 1.291(95%CI[1.107, 1.506]) times longer.

All three of these main effects estimates are intuitive, and align with the inference drawn from the extended Cox model estimation. In summary, an increase in production rate corresponds to decreased equipment reliability, while performing replacement actions, and increasing the number of planned maintenance interventions both significantly increase the reliability of an equipment.

 Models (1) (2) Production -0.023*** (0.008) -0.023*** (0.008) Replace 0.601* (0.331) 0.614* (0.318) Repairtime -0.009 (0.048) Maint. 0.219** (0.089) 0.256*** (0.077) Numstop 0.010 (0.013) log(scale):CM4 1.400*** (0.388) 1.420*** (0.387) log(shape):CM4 -0.635*** (0.040) -0.639*** (0.040) log(scale):CM1 1.390*** (0.290) 1.405*** (0.286) log(shape):CM1 -0.745*** (0.078) -0.756*** (0.076) log(scale):CM3 2.178*** (0.356) 2.196*** (0.355) log(shape):CM3 -0.860*** (0.109) -0.861*** (0.109) log(scale):CM5 2.587*** (0.455) 2.604*** (0.455) log(shape):CM5 -0.441*** (0.074) -0.439*** (0.074) Observations 3,001 3,001 Log Likelihood -1,314.651 -1,315.007 Note: p<0.1; p<0.05; p<0.01

As a fully parametric model with scale and shape $$\mu_{k}$$ and $$\alpha_{k}$$ estimates for each strata $$k$$, these estimates can be used to model the effect of covariates on the individual reliability of each cement mill. Using the reported estimates from Model 2, where $$\hat{S}_{k}(t)$$ represents the reliability function for each cement mill $$k$$, and log(scale) in Table ?? corresponds to $$\beta_{k}$$, the AFT model has the following parameterization:

\begin{align} &\hat{S}_{k}(t)=exp(-(\frac{t}{\mu_{k}})^{\alpha_{k}}),\notag\\ &\;\;\;\;\;log(\mu_{k})=\beta_{k}+\beta_{Replace}Replace+\beta_{Maint.}Maint.+\delta_{Production}Production \end{align}

Figure 7.30 contains a comparison of the parametric AFT fit from Model 2 for each of the cement mills at differing production levels(20%, 40%, and 60%). As a follow-up from the earlier interpretation of the effect of production rate on interfailure time, these plots visually express how an increase in production rate results in a decrease in the reliability, for each cement mill. Such a fully parametric solution may be especially useful for capacity planning or when planning future maintenance interventions, as it allows for simulation of the reliability an single cement mill. Moreover, parametric models could facilitate the use of real-time reliability monitoring as production demands and environmental factors change.

In addition to calculating the reliability for each cement fill, conditioned on certain covariate values, a parametric model allows for easy calculation of the median time until failure as well as the mean time before failure(MTBF), conditional on covariates. Given the same parameterization as used for Model 2, where the time until failure $$T \sim Weibull(exp (X' \hat{\beta}),\hat{\alpha})$$, the MTBF is the expected value of $$T$$ such that:

$$$MTBF=\hat{E}(t_{i})= \int_{0}^{\inf}\hat{S}(t)dt=\exp (X' \hat{\beta}) \Gamma(1+\hat{\sigma})$$$ where $$\hat{\sigma}=\frac{1}{\hat{\alpha}}$$,and $$\Gamma$$ represents the Gamma function(Liu and Lim 2018).

Furthermore, the median time until failure conditional on covariates $$X$$ is: $$$\operatorname{median}(t_{i})=\exp (X' \hat{\beta})(\log (2))^{\hat{\sigma}}$$$

Following from the procedures demonstrated by Liu and Lim (2018), it is possible to estimate confidence intervals for the MTBF using the Delta method to derive the standard error of the MTTF. The standard error can be estimated as follows: $$$S E=\left\{\left( \begin{array}{c}{\frac{\partial E\left(\hat{t}_{i}\right)}{\partial \hat{\boldsymbol{\beta}}}} \\ {\frac{\partial E\left(t_{i}\right)}{\partial \hat{\sigma}}}\end{array}\right)^{t} \Sigma_{\hat{\sigma} \hat{\boldsymbol{\beta}}} \left( \begin{array}{c}{\frac{\partial E\left(\hat{t}_{i}\right)}{\partial \hat{\boldsymbol{\beta}}}} \\ {\frac{\partial E\left(t_{i}\right)}{\partial \tilde{\sigma}}}\end{array}\right)\right\}^{\frac{1}{2}}$$$ with $$\sum_{\hat{\sigma}} \hat{\boldsymbol{\beta}}$$, the covariance matrix of both $$\hat{\beta}$$ and $$\hat{\sigma}$$.

For example, the MTBF of cement mill 4, conditional on 50% production, no replacement, with 2 maintenance interventions can be derived as follows: \begin{align} \exp (X' \hat{\beta}) \Gamma(1+\hat{\sigma})&=exp(\beta_{k}+\beta_{Replace}Replace+\beta_{Maint.}Maint.+\delta_{Production}Production)(1+\frac{1}{\hat{\alpha}})\notag\\ &=exp(\beta_{k}+\beta_{Replace}(0)+\beta_{Maint.}(2)+\delta_{Production}(50))(1+\frac{1}{\hat{\alpha}})\notag\\ &=exp(1.42+(0.614\cdot0)+(0.256 \cdot 2)+(-0.023 \cdot 50))(1+\frac{1}{0.528})\notag\\ &\approx 3.97 days \end{align}

Functions from the ciTools package in R were adapted in order to estimate the standard error of the MTBF using the AFT parameter estimates from Model 2. Figure 7.31 contains a comparison of the MTBF for each cement mill, conditional on a range of covariate values, against the observed failures. Specifically, the points on each plot indicate observed cement mill failures according to respective relative production rate and failure time. The shape of the point indicates whether the previous maintenance action involved a replacement or not. The size of the point indicates the cumulative number of maintenance interventions(preventative, planned, etc.,) that were carried out during the gap time. The solid line and red ribbon represent the MTBF and respective 95% CI conditional on no replacement, two maintenance interventions, and indicated production level. The dashed line and blue ribbon represent the respective MTBF and 95% CI conditional on replacement with remaining covariates identical.

As evident in the plots, most of the failures above the confidence intervals represent gap times immediately following replacement(triangles), gap times in which multiple maintenance interventions occurred(large circles), or both(large triangles or large circles). This visual representation supports the earlier conclusions regarding the affect of each covariate on reliability, as well as the insights into the difference in behavior between cement mills.

### 7.5.2 Integrated models for fans

In addition to performing a basic parametric model estimation, the large quantity of available covariates for the interfailure duration of each fan make estimation of more advanced models possible. Similar to the cement mill analysis, an extended Cox model will be used to estimate a semi-parametric reliability model, given the large number of time-dependent covariates. Additionally, in order to facilitate improved estimation, the full set of fan data will be used to fit the extended Cox models.

As there is a large number of covariates, an extended Cox model will be estimated using a subset of the covariates, namely Production, Replace, Repairtime, and Numstop, in addition to interactions between the covariates and strata. The aim of the first estimation will be to identify which, if any, of these covariates are significant, along with possible interactions. The results of this estimation will be carried over into an estimation using the condition monitoring covariates, namely the vibration measurements, The results of the second estimation will provide a basis for later fully-parametric modelling.

#### 7.5.2.1 Stratified extended Cox models for fans

Table ?? contains a comparison of extended Cox model fits for the available fan data using the “coxph” function in the Survival package in R. Model 1 is the full model, containing all covariates, including all covariate by strata interactions. Model 2 is a reduction of Model 1, containing the same covariates with the exception of the insignificant Replace by strata interaction term. Model 3 contains only the main effects of Production, Replace, Repairtime, and Numstop. Although Repairtime is no longer significant in Model 3, it was retained as it was significant in the additional models. Statistical tests of the Schoenfield residuals for each model indicate that all covariates satisfy the proportional hazards assumption.

As the Replace covariate indicates whether the previous maintenance action involved replacement or not(0:no, 1:yes), the baseline value is set to 0. Thus, the model is estimating the effect of replacement when compared to non-replacement. Additionally, as Models 1 and 2 include interaction terms with each strata, section 5 fan 31 was chosen as the baseline, as it is had experienced a large number of failure events.

In the case of this analysis the model is stratified meaning that the estimated coefficients for the main effects represents an effect that is present in all fans. Including an interaction between a maint effect and a strata, seeks to identify whether the the effect of a variable is different for a particular fan.

As evident in the table, Models 2 and 3 have similar AIC values, despite Model 3 containing far fewer covariate values. Given that the model parameters are being estimated using data from 10 fans operating in different areas of the plant, it is not surprising that some of the interaction terms appear to be significant.

For Model 3, the estimated coefficients for $$\delta_{Production}$$, $$\beta_{Replace}$$, $$\beta_{Repairtime}$$, and $$\beta_{Numstop}$$,are provided, along with respective standard errors in parenthesis. An estimate of -0.0183 for $$\delta_{Production}$$, yields a 95% confidence interval of [0.967, 0.997], representing the respective hazard ratio of a 1 unit increase in production rate(measured in percent). This confidence interval indicates that a 1% increase in production rate corresponds to between a 0.3% and 3.3% reduction in the instantaneous risk of failure. Unlike the cement mill analysis, an increase in the the production rate is estimated to reduce the risk of failure for fans. Since the fans operate at high speeds, there will be more variability in fan speed during light production, when cooling requirements fluctuate. As the production and need for consistent cooling increases, the fans operate with less variability, which may reduce the propensity for wear resulting from speed variations.

Additionally, an estimate of -0.586 for $$\beta_{Replace}$$ yields a 95% confidence interval of [0.355, 0.873] for the respective hazard ratio comparing interfailure durations following “replacement” maintenance actions versus “non-replacement” maintenance actions. Based on this confidence interval, performing a replacement action results in between a 12.5% and 64.5% reduction in hazard as compared to a repair. This result corroborates the result obtained from the cement mill analysis.

Furthermore, an estimate of 0.017 for $$\beta_{Repairtime}$$ yields a 95% confidence interval of [0.996, 1.038] for the respective hazard ratio resulting from a 1 unit(hour) increase in repair time(downtime). As this CI includes 1, Model 3 suggests that repairtime may not have an effect on the hazard function immediately after. Finally, an estimate of 0.174 for $$\delta_{Numstop}$$ yields a 95% confidence interval of [1.069, 1.324] for the respective hazard ratio following each additional non-failure stoppage event. This estimate suggests that each time the equipment is stopped for neither failure nor maintenance, the instantaneous risk of failure increases between 6.9% and 32.4%. This may be attributed to factors such as corrosion, start-stop wear, or load variations of the respective equipment during non-failure stoppages.

 Models (1) (2) (3) prod -0.025*** (0.008) -0.025*** (0.008) -0.018** (0.007) rep1 -0.483** (1.031) -0.418 (0.312) -0.586** (0.280) reptime -0.255* (0.178) -0.257** (0.176) 0.017 (0.013) numstop 0.736** (0.370) 0.736** (0.370) 0.174*** (0.084) cluster(id) -0.005 (0.003) -0.005 (0.003) rep0:equipment05FN02 (0.000) rep1:equipment05FN02 (0.000) rep0:equipment03FN08 (0.000) rep1:equipment03FN08 (0.000) rep0:equipment05FN01 -0.132 (1.119) rep1:equipment05FN01 (0.000) rep0:equipment05FN33 (0.000) rep1:equipment05FN33 (0.000) rep0:equipment06FN11 13.682*** (2,001.091) rep1:equipment06FN11 (0.000) rep0:equipment06FN41 0.023 (1.289) rep1:equipment06FN41 (0.000) rep0:equipment06FN51 (0.000) rep1:equipment06FN51 (0.000) rep0:equipment06FN52 (0.000) rep1:equipment06FN52 (0.000) rep0:equipment06FN53 -0.005 (1.230) rep1:equipment06FN53 (0.000) reptime:equipment05FN02 0.297** (0.179) 0.300** (0.177) reptime:equipment03FN08 -0.282 (0.451) -0.280 (0.450) reptime:equipment05FN01 0.270** (0.179) 0.274** (0.177) reptime:equipment05FN33 0.853* (0.425) 0.855* (0.424) reptime:equipment06FN11 (0.000) -8.680*** (1,263.847) reptime:equipment06FN41 0.334** (0.187) 0.337** (0.186) reptime:equipment06FN51 -1.308** (0.860) -1.306** (0.860) reptime:equipment06FN52 -1.391*** (1.598) -1.389*** (1.598) reptime:equipment06FN53 0.167 (0.195) 0.166 (0.190) numstop:equipment05FN02 -2.092*** (1.133) -2.093*** (1.133) numstop:equipment03FN08 (0.000) (0.000) numstop:equipment05FN01 -0.224 (0.470) -0.209 (0.466) numstop:equipment05FN33 -0.581* (0.383) -0.582* (0.383) numstop:equipment06FN11 (0.000) (0.000) numstop:equipment06FN41 -1.171*** (0.690) -1.189*** (0.676) numstop:equipment06FN51 (0.000) (0.000) numstop:equipment06FN52 (0.000) (0.000) numstop:equipment06FN53 -1.335** (0.789) -1.337** (0.790) AIC 1822.25 1816.3 1819.58 Observations 805 805 805 Log Likelihood -889.124 -889.148 -905.791 Wald Test 339.280*** (df = 22) 304.370*** (df = 19) 18.110*** (df = 4) LR Test 47.451*** (df = 22) 47.402*** (df = 19) 14.116*** (df = 4) Score (Logrank) Test 45.927*** (df = 22) 45.677*** (df = 19) 14.422*** (df = 4) Note: p<0.1; p<0.05; p<0.01

A final estimation using extended Cox models was performed to evaluate the usage of the different condition monitoring covariates, namely the original vibration measurements and the results from the two PCA procedures. Building upon the results from Model 3, Table ?? contains a comparison of an additional extended Cox model estimation with the inclusion of the vibration measurements. Model 4 includes the main effects from Model 3, in addition to the 9 available vibration measures. Models 5 and 6 both contain terms for Production, Repairtime, Numstop, and Replace, in addition to the principal component scores from their respective procedures, denoted as A and B.

As evident in the table, all but 3 of the terms included in Model 4 are significant, all are significant in Model 5, and in Model 6 all but the last three principal component scores are significant. Additionally, despite containing more covariates, Model 4 results in the lowest respective AIC value. This suggests that retaining the original vibration measurements instead of a dimension reduction may result in an improved model fit, in addition to having an easier interpretation. Furthermore, regardless of the condition monitoring covariates included, each model results in similar estimates for the first 4 coefficients.

 Models (4) (5) (6) Production -0.017** (0.007) -0.018** (0.007) -0.019** (0.007) Repairtime 0.019** (0.013) 0.019* (0.013) 0.017* (0.013) Numstop 0.185*** (0.082) 0.177*** (0.084) 0.180*** (0.083) Replace -0.576** (0.299) -0.575** (0.284) -0.561** (0.282) ADE -1.291*** (0.399) ANDE 0.504*** (0.331) HDE -0.367 (0.903) HDE.CAV 1.171 (1.080) HDE.ENV -1.569 (4.230) HNDE 0.546** (0.415) HNDE.ENV 4.294** (2.803) VDE 0.232*** (0.114) VNDE -1.065*** (0.511) PCA1 1.459* (0.877) PCA2 -3.052*** (1.402) PCA3 3.822* (2.289) PCA4 2.498** (1.461) PCB1 -0.331* (0.191) PCB2 0.205 (0.246) PCB3 -0.334 (0.308) PCB4 0.058 (0.215) AIC 1807.39 1822 1823.62 Observations 805 805 805 Log Likelihood -890.695 -903.001 -903.810 Wald Test 95.500*** (df = 13) 26.830*** (df = 8) 23.830*** (df = 8) LR Test 44.308*** (df = 13) 19.698** (df = 8) 18.079** (df = 8) Score (Logrank) Test 47.668*** (df = 13) 21.086*** (df = 8) 18.874** (df = 8) Note: p<0.1; p<0.05; p<0.01

#### 7.5.2.2 Parametric model estimation for fans

Although the extended Cox model provided insight into the effects of maintenance related condition monitoring covariates on the hazard ratio, nothing is known about the actual baseline hazard function of the fans. As such, we advance the analysis with a fully parametric model to provide a complete solution for modelling the reliability of each fan. Although the extended Cox models were estimated using data from all fans, the fully parametric model will be estimated using only fans from section 5.

Table ?? contains a comparison of fully parametric AFT models estimated using the “aftreg” function from the eha package in R, using a Weibull baseline function. Model 1 includes the 4 maintenance related covariates in addition to the original vibration measurements. Model 2 is a subset of Model 1, including the only significant covariates from Model 1. After estimating the reduced model, some of the significant terms from Model 1 are no longer significant in Model 2. Model 3 contains a further reduction, retaining only significant covariates from Model 2.

The estimated coefficients of an AFT model represent logarithms of ratios of survival times; positive coefficients indicate longer survival times, and negative coefficients indicate decreased survival times. In terms of the estimates for Model 3, $$exp(\beta_{Repairtime})=0.968$$ indicates that a one unit(hour) increase in repair time corresponds to a .968(95%CI[.950, .987]) ratio of relative survival times. In other words, each hour of repair time corresponds to between a 1.3% and 5% reduction in reliability immediately following the maintenance action. Given this result, a longer repair time may be indicative of a more severe failure, or a failure for which an appropriate maintenance action proves difficult. Moreover,longer repair time implies intensive maintenance actions, and the possibility of human error, which may reduce equipment reliability.

Additionally, $$exp(\delta_{ANDE})=0.644$$ which suggests that a one unit(mm/s) increase in ANDE(axial non-drive end) vibration corresponds to a 0.644(95%CI[.459, .904]) ratio of relative survival times. This confidence interval indicates that the survival decreases between 9.6% and 54.1% with a 1mm/s increase in ANDE vibration. Finally, $$exp(\delta_{HNDE.ENV})=0.001$$ which suggests that a one unit(gE) increase in HNDE.ENV(horizontal non-drive end enveloping) vibration corresponds to a 0.001(95%CI[.000001, .191]) ratio of relative survival times. Similarly, a 1gE increase in HNDE.ENV vibration corresponds to between a 99.99% and 80.9% decrease in survival time. Each of the estimated effects for fan vibrations are intuitive, given their purpose of monitoring the health condition of the equipment.

 Models (1) (2) (3) Production 0.011 (0.011) Repairtime -0.027*** (0.009) -0.032*** (0.011) -0.032*** (0.011) Numstop -0.086 (0.159) Replace 0.459 (0.552) ADE 0.573** (0.283) 0.327 (0.249) ANDE -0.509** (0.259) -0.481*** (0.160) -0.440** (0.173) HDE 0.323 (0.617) HDE.CAV -0.884 (0.662) HDE.ENV 7.266* (4.181) 3.705 (4.589) HNDE -0.012 (0.361) HNDE.ENV -8.212*** (3.061) -8.849*** (2.985) -7.493** (2.977) VDE -0.112 (0.235) VNDE -0.106 (0.435) log(scale):Fan31 3.441*** (1.221) 2.637*** (0.991) 3.864*** (0.864) log(shape):Fan31 -0.912*** (0.090) -0.910*** (0.090) -0.911*** (0.090) log(scale):Fan2 5.886*** (1.349) 5.573*** (0.978) 6.690*** (0.910) log(shape):Fan2 0.563** (0.230) 0.319 (0.200) 0.355* (0.204) log(scale):Fan1 4.903*** (1.392) 4.672*** (1.037) 5.920*** (0.917) log(shape):Fan1 -0.381** (0.155) -0.404*** (0.150) -0.409*** (0.152) log(scale):Fan33 4.206*** (1.207) 3.367*** (0.977) 4.595*** (0.848) log(shape):Fan33 -0.678*** (0.102) -0.680*** (0.101) -0.689*** (0.101) Observations 526 526 526 Log Likelihood -603.363 -607.256 -608.826 Note: p<0.1; p<0.05; p<0.01

As a fully parametric model with scale and shape $$\mu_{k}$$ and $$\alpha_{k}$$ estimates for each strata $$k$$, these estimates can be used to model the effect of covariates on the individual reliability of each cement mill. Using the reported estimates from Model 2, where $$\hat{S}_{k}(t)$$ represents the reliability function for each cement mill $$k$$, and log(scale) in Table ?? corresponds to $$\beta_{k}$$, the AFT model is represented in Equation (7.1).

\begin{align} &\hat{S}_{k}(t)=exp(-(\frac{t}{\mu_{k}})^{\alpha_{k}}),\notag\\ &\;\;\;\;\;log(\mu_{k})=\beta_{k}+\beta_{Repairtime}Repairtime+\delta_{ANDE}ANDE+\delta_{HNDE.ENV}HNDE.ENV \tag{7.1} \end{align}

Figure 7.32 contains a comparison of the parametric AFT fit from Model 3 for each of the fans at differing ANDE vibration levels(3mm/s, 4mm/s, and 5mm/s), with Repairtime and HNDE.ENV held constant at their mean values(3.96hrs and .102gE respectively).

From the earlier interpretation of the effect of vibration readings on interfailure time, these plots visually express how an increase in vibrations results in a decrease in the reliability, for each fan. The plot representing fan 2 further illustrates behavior that is different from the other fans. This is supported by the difference in the log(scale) and log(shape) parameter estimates for fan 2 as compared to the estimates for the remaining fans, as shown in Table ??.

As demonstrated in the cement mill analysis, it is possible to use the parameter estimates from Model 3 to derive the MTBF, conditional on covariates, for each of the respective fans. For example, the MTBF of fan 31, conditional on 4mm/s of ANDE vibration, a 4hr previous repairtime, and an HNDE.ENV measurement of .1gE can be derived using Equation (7.2).

\begin{align} \exp (X' \hat{\beta}) \Gamma(1+\hat{\sigma})&=exp(\beta_{k}+\beta_{Repairtime}Repairtime\notag \\ &+\delta_{ANDE}ANDE+\delta_{HNDE.ENV}HNDE.ENV)(1+\frac{1}{\hat{\alpha}})\notag \\ &=exp(\beta_{k}+\beta_{Repairtime}(4)+\delta_{ANDE}(4)+\delta_{HNDE.ENV}(.1))(1+\frac{1}{\hat{\alpha}})\notag\\ &=exp(3.86+(-.032 \cdot 4)+(-.440 \cdot 4)+(-7.493 \cdot .1))(1+\frac{1}{0.402})\notag\\ &\approx 11.17 days \tag{7.2} \end{align}

Functions from the ciTools package in R were adapted in order to estimate the standard error of the MTBF using the AFT parameter estimates from Model 3. Figure 7.33 contains a comparison of the MTBF for each fan, conditional on a range of covariate values, against the observed failures.

Specifically, the points on each plot indicate observed fan failures according to respective ANDE vibration value and failure time. The size of the point indicates the duration of the last repair(in hours) prior to the start of the gap time. The solid line and red ribbon represent the MTBF and respective 95% CI conditional on a 2hr repair time and indicated production level. The dashed line and blue ribbon represent the respective MTBF and 95% CI conditional on 10hr repair time with remaining covariates identical. For both confidence intervals, the HNDE.ENV value is held constant at its mean of .102gE.

As evident in the plots, the MTBF decreases substantially as the value of the ANDE vibrations increases. In contrast with the representation from the cement mill analysis, which included production level, number of maintenance interventions, and replacement status, these plots only depend upon values that are recorded as the result of a manufacturing process. However, this plot may also be helpful for guiding production decisions, as it provides insight into the consequences of operating equipment that is experiencing high vibration levels, or has undergone long repairs.

## 7.6 Classification models

In addition to modelling reliability using traditional survival analysis techniques, the wide array of covariates that have been extracted from the data sources allow for an equally wide selection of modelling tools. During this research, the role of covariates representing observed vibration measurements have been of key interest, as they, ideally, represent the current health status of the equipment. Such indicators of the condition of the equipment are especially valuable as they are virtually non-invasive, meaning that the equipment can be monitored without the need to stop and re-start the machine. In contrast, the majority of the extracted covariates are byproducts of the frequent stoppage events that the equipment experiences.

In theory, a maintenance engineer that is able to continuously(or intermittently) monitor the condition of equipment is able to use this insight for the purpose of either scheduling a maintenance intervention or identifying the optimal maintenance action for the next failure. Over time, this procedure of maintenance decision support would be reflected in the historical records of both the condition indicators and the respective past maintenance actions. The following section will explore a predictive modelling approach in an attempt to classify maintenance actions using historical covariates.

### 7.6.1 Predicting maintenance action from covariates

The aim of the first predictive model was to train artificial neural network(ANN) using the integrated fan dataset to predict the respective maintenance action conditional on respective covariates. As the fan dataset contains historical records pertaining to a specific equipment during a unique time between two failures, there are multiple rows representing the change in the time-dependent covariates leading up to each failure. In order to consistently provide a fixed number of inputs to the ANN, only the final observation in each gap time is used to train the network. This implies that the ANN is training using only the values observed immediately prior to failure, and not the history of the gap time.

In addition, 70% of the data is selected for use to train the model, with the remaining 30% used to evaluate the model performance. All models were trained using the neuralnet package, which allows the user to determine the number of neurons and hidden layers, the desired error function, the activation function, and the training algorithm, among other features.

As the larger fan dataset will be used for the predictive modelling, the 12 of the covariates used in extended Cox Model 4 in the previous section will be retained, in addition to a covariate representing the time of the imminent failure(Tstop). Although a model comparison is not provided, several ANNs were trained and compared using resilient backpropagation, sum of squared error(SSE) and cross-entropy(CE) error functions, in addition to logistic and hyperbolic-tangent activation functions. The results of this model were largely equivalent regardless of function choice, and the results of a network with a single hidden layer containing 8 neurons using the SSE and logistic functions is summarized by the confusion matrix in Table 7.25.

When evaluated with the test set, the accuracy of the network was 9.6%, as reflected in the confusion matrix. Such a poor classification results is not entirely surprising given the non-uniform distribution of maintenance actions, as illustrated previously in Table 7.6. As a result, there are relatively few samples in some categories for the purpose of training, and even fewer in the smaller testing set. The results of this model are summarized in the confusion matrix presented in Table 7.24, which also reports the sensitivity and specificity for each category of maintenance action.

Table 7.24: Confusion matrix ANN predicting category of future maintenance action from covariates
Predicted Class.
Observed Class.
1 2 4 5 9 12 Sensitivity Specificity
1 9 3 77 1 0 1 1 0.03
2 0 0 1 0 0 1 0 0.98
4 0 1 0 0 0 0 0 0.94
5 0 0 0 0 0 0 0 1
9 0 0 0 0 0 0 0 1
12 0 0 0 0 0 0 0 1

### 7.6.2 Predicting replacement probability from covariates

Given the difficulties of the previous model in classifying maintenance actions into sparsely represented categories, a further analysis was performed after dichotomizing the maintenance action into a binary indicator of whether the maintenance action was a Replace or any other category. Using a binary response not only provides a more balanced distribution with which to train and test, it allows for the usage and comparison of traditional statistical classifiers such as a logistic regression.

Table 7.25: Comparison of models for predicting replacement
Model Type MSE Accuracy Sensitivity Specificity
1 6 0.105 0.894 0.182 0.988
2 8 0.110 0.883 0.273 0.964
3 12 0.087 0.915 0.364 0.988
4 10 0.085 0.915 0.364 0.988
5 GLM 0.142 0.840 0.000 0.952
6 GLM 0.106 0.872 0.000 0.988
Note:
Type: GLM for logistic regression, otherwise it indicates the number
of neurons in the respective hidden layers

Table 7.25 provides a comparison of ANNs with several different configurations against two logistic regression models. Models 1, 2, 3, and 4 all include the same 13 covariates, namely Production, Numstop, Tstop, Repairtime, ADE, ANDE, HDE, HDE.CAV, HDE.ENV, HNDE, HNDE.ENV, VDE, and VNDE. All four models were trained using resilient backpropagation, a SSE error function, and a logistic activation function with 6, 8, 12, and 10 neurons in a single hidden layer respectively. During exploration, modeling including more than one hidden layer were attempted, however the procedure did not converge. The logistic regression corresponding to Model 5 contained the same 13 covariates and the ANN models. As the significance tests provided by Model 5 indicated that Production was the only significant covariate, Model 6 was estimated with only production as a predictor.

As the motivation behind these models is to identify conditions that indicate an equipment is likely to be replaced, a replacement value of 1, indicating that replacement was performed, is classified as a positive. In the context of the classification results in Table 7.25, sensitivity corresponds to the the percentage of positives correctly identified. In other words, this represents when the model correctly identified that a replacement occurred. Conversely, specificity corresponds to when the model correctly identified that the equipment did not need replacing. As evident in the table, Model 1 performs the best classification in terms of overall accuracy, sensitivity, and specificity.

Table 7.26: Confusion matrix for Model 1
Predicted
Observed
0 1
0 82 9
1 1 2

The confusion matrix for Model 1 is provided in Table 7.26, in which the columns indicate the observed value of replacement, and the rows indicate the predicted value of replacement. Although the accuracy and specificity of model 1 are quite high, the sensitivity is still quite low. The confusion matrix in Table 7.26 provides additionalcontext to these values, as it identifies that there are only 11 replacement actions compared to 83 non-replacement actions in the testing set. The fact that the model almost always classifys observations as non-replace indicates that it is unable to identify a consistent pattern between the covariates and the resulting maintenance actions. Given our knowledge that the plant maintenance engineers are currently unable to use all available data to decide on the appropriate maintenance action, it is not surprising that the model suggests these actions are inconsistent. If the plant can use the framework of integration and analysis demonstrated throughout this research to identify a set of maintenance rules based on equipment conditions(e.g., if vibration exceeds threshold, perform inspection), then classification models would theoretically be able to identify these rules via patterns in future data.