Data Mining 101

Martin Fairbank
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

I heard it said recently that in the last ten minutes, more data have been generated than since the prehistoric era until 2003.

Measuring things and storing the acquired data has become much easier and cheaper over the years. Most pulp and paper mills have years of data in their data historian systems that can be mined for value. But where should you start?

I've been manually collecting data for many years on the fuel efficiency of the cars I've owned, but I've never done much with the data. So I thought I'd use this blog post to take a journey with you on a data mining exercise!

Data mining is about understanding patterns, then using that information to optimize a system. It turns data into information which, combined with process knowledge, can help in decision-making.

The first way you usually look at your data is as a time series. You may have to do a lot to get your data into this format, such as importing it and reformatting it into understandable units. This is called data pre-processing. Here are all my fuel efficiency data from 1991 to 2017, calculated from distance traveled between tanks of gas.

Clustering

The first thing you'll notice about this time series plot is that there are three groups of points, corresponding to three different cars. Finding groups of data that behave differently is called "clustering" the data. In order to understand the patterns, it's important to focus on one cluster at a time and it may be necessary to do some data "cleaning" and filtering. A few very high points can be seen in the middle of the Car 2 cluster. This was due to a multi-day road trip I made hauling a trailer, which reduced fuel efficiency. I would remove these points when trying to model the "normal process".

Pattern recognition

To the right of Car 2's data and to the left of Car 3's data, you can see an up-and-down pattern in the data. This was a period where the cars were used almost exclusively for highway driving. I've expanded the Car 2 data for this period below.

It's clearly a seasonal pattern, with better fuel efficiency in the summer. If I want to develop a model for this, I would need to have the corresponding temperature, and as it happens, Environment Canada has historical temperature records available for download from the web. After some manipulation, I was able to come up with a rough correlation between daytime temperature and fuel efficiency, shown below. It tells me that for about every 5° drop in temperature, the car used about another 0.2 L per 100 km.

Now obviously I don't have a very good model here. Other factors that would come into play would be the average speed (at 120 km/h, a vehicle uses about 20 percent more fuel than at 100 km/h), weight being carried and wind direction and speed. In this example, I don't have these data, but you get the idea. By building a good model based on past data and including all the variables affecting the process, you can predict what will happen to your process in the future. This can enable alarm settings when the process is not operating properly, or ultimately lead to more sophisticated systems such as model-based control.

My experience with data mining is that it works better with some data sets than others. The simpler the process, and the less it's affected by random or difficult-to-measure parameters, the more robust the model will be. Generally if you are trying to model energy processes, such as a boiler, reboiler or turbine, a good model can be developed, because most relationships in the model are linear, and the number of variables required is not too high. This can also lead to good cost savings opportunities, since any improvement in energy efficiency is easy to translate into dollars.

On the other hand, I have not seen data mining work well for complex phenomena such as wet end breaks on a paper machine. There are many things that can cause a break, and some of them, such as build-up of pitch or slime, are difficult to measure and predict. About twenty years ago, the industry thought high-speed video cameras would eliminate breaks by developing a thorough understanding of break causes, but like data-mining, it remains just another tool that can be used by a smart papermaker to approach the ideal of zero breaks.

Software packages are available to take care of the grunt work of data mining. Although I created the graphs in this blog using Excel, these packages can handle data import, data cleaning and filtering, clustering, data visualization, modeling and other tasks much more easily. An example is EXPLORE software available from Canmet Energy, who presented a workshop on this topic at PaperWeek Canada this February.


Martin Fairbank, Ph.D. Martin Fairbank has worked in the forest products industry for 31 years,
including many years for a pulp and paper producer and two years with
Natural Resources Canada. With a Ph.D. in chemistry and experience in
process improvement, product development, energy management and lean
manufacturing, Martin currently works as an independent consultant,
based in Montreal. He is also an author, having recently published
Resolute Roots, a history of Resolute Forest Products and its
predecessors over the last 200 years.


This email address is being protected from spambots. You need JavaScript enabled to view it.


Martin Fairbank Consulting

Industry Experience

  • Pulp and Paper Technology
  • Materials Recycling
  • Biorefinery Development
  • Manufacturing
  • Government Subsidy Programs

Services

  • Technical Writing
    . White Papers
    . Grant Applications
    . Explain technical concepts
  • Scientific Editing
    . Review of articles for publication
  • Project Assessment
    . Evaluation of Technologies
    . Project evaluation for funding agencies
  • Pulp & Paper
    . Conventional and emerging technologies