data scienceforecastingcareer

Engineering to Data Science: A Life of Forecasting — and How I Miss Physics

July 18, 2018

This post was originally published on Medium in July 2018.

Before entering firmly into the data science space I spent 5 years in the oil and gas sector, as a Petroleum Engineer. Working in several countries and everywhere from offshore platforms in the Gulf of Guinea to offices in the middle of the West African jungle. These experiences have shaped me as an engineer, an individual and have had a tremendous influence on my outlook when it comes to quantitative analysis.

The bulk of my work lay in the area of production optimization and forecasting. Essentially how can we produce more oil with what we have (or even less) and predicting what will we produce in the future. All this with an objective to understand the value of an oil field asset and what is its future potential.

I often see many parallels between what I did as a petroleum engineer, using physical models and what I do now as a data scientist predicting consumer demands for ecommerce. Nonetheless, there is a large divergence in the problems encountered in the forecasting models. Ultimately, I came to a realisation that even with all the horribly dirty data that I had to manage as an engineer, I strongly miss the foundations of physics to help guide my decision-making. The laws of engineering and physics provide known boundary conditions when trying to understand what the data and information is telling you. These physical laws are somewhat difficult to define when we analyse the rational of a person buying consumer goods.

Forecasting in the Oil and Gas Sector

As an engineer working in central Africa my work in essence, took a bunch of physically measured data (features) and transformed them into predictive models, often time dependent models. The key to this analysis and as it is in demand forecasting as well, is understanding the data, what to look at, how to combine variables efficiently to best represent your system and dealing with the quantifying the uncertainty. Yet, there are fundamental differences in the approaches, one namely engineering starts from using physical models and the other statistical models. Personally I believe that combining both camps is fundamental to move forward in the industry.

An oil forecast like any other needs data and with oil wells you have something which has generally been producing for a number of years (0–40 years) and with most of these wells at a minimum you will usually have a standard set of data: how much fluid is produced per day, pressure at the surface and temperature. With these elements it is straight forward to build out a forecast. The problem is that although you have data, it varies enormously in quality and quantity. This is the real challenge to the engineer. It makes it a fascinating job.

Your data can come anywhere from fancy remote sensors 3000m underground, sending signals to your laptop every second or it can be a guy jumping into a pickup or boat to visit the oil wells and read a gauge — the latter happens more often than you imagine. In fact, I have been this guy. Therefore, by the nature in which it is collected and what is being collected the data is sporadic and hugely unreliable at times, and sometimes just non-existent.

This is actually something which is symptomatic of the oil industry, there is very little consistency especially when it comes to data and it is something useful to understand when you begin forecasting. One of the few consistencies you will find, is that with every drop in oil price you will likely also see a drop in the quality of your data. This has a significant impact on time-series forecasting, often your data is reliable right at the start of an oil well's life (this is where the most capital investment is) and then degrades over time, through malfunctioning sensors, poorer monitoring and generally being no longer the sexiest thing in the portfolio of a company.

Beyond the simple availability of data, the actual measurements have huge uncertainty. It is difficult to comprehend the complexity of an oil reservoir and the fluid it produces, let alone the difficulty you have in sometimes believing your measurements. At the end when you think about an oil reservoir, you are talking about placing tiny little pin pricks (wells ~12 inches wide) into the earth at 2–4000m deep, and then using these as your only eyes and ears in something that can span tens of kilometres.

So the difficulty in oil and gas forecasting comes a lot from your data quality and availability. And although I make it sound like a horrible problem — and you will often see engineers fighting just to get sense from their data to deliver their view of the future — what you also have is a physical system. This means even if I spent weeks or months developing my models, I could give another engineer 5–10 parameters and on his calculator he could check if my forecast was physically reasonable. At the end, the processes which we try to predict are governed at a macroscopic level by simple pressure, temperature and volume relationships we learnt in secondary school physics. With physics you have a ground truth, a guiding rope. It gives you a hell of a lot more confidence when you are telling someone to invest tens of millions of dollars.

The Ecommerce Side

Now, this week I have been working on an entirely different problem, I am working for an ecommerce company building out demand forecasting models. Essentially, I am trying to predict what people will buy next week on the site, so that I can have enough stock to fulfil their demand at the point they buy. This comes with a number of constraints: I don't want to buy too much stock (over-predicting) as it costs me money, and I don't want to under-predict too much, as I will then have delays in fulfilling a customer's order (less happy customer and/or lost clientele). It is a financial tradeoff — investigating what the optimal stock quantity is at any given time.

I have a simple set of data coming from the sales side (the orders we have per day for a given product) and from Google Analytics I can take the traffic information for the site; by combining the two data sets you get the conversion rate. I am doing this for ~20,000 products so it needs to be industrialised. It is also a discontinuous data set — you can go a week selling nothing and all of a sudden sell 10 products.

How I Sometimes Miss Physical Systems

So in my ecommerce case I am dealing with unimaginably clean data compared to my life in the oil field. I have "sensors" measuring everything that people are doing on the site, the quality of this data is high — not only is it collected by Google, but it is most critically not collected by a human working night shift, spending 8 hours driving or taking a boat to remote locations to read a gauge. Let alone me trying to note down a pressure in my notebook at 2am in a monsoon.

So when I initially started working on these problems I was like wow, I am bathing in good data. But then when I started getting into the nitty gritty of the forecast I realised quickly I didn't have a ground truth, I didn't have my guide. In forecasting oil wells you always had the hand of physics to give you confidence. And when I reached out for a friend to give me confidence for next week's stock forecast I found it all of a sudden a pretty lonely place.

The thing is, without physics it is difficult to know what you are doing is right. Tomorrow I don't know whether we will sell more or less. I can say that yesterday we sold X and the day before Y. But tomorrow I don't know anything concrete — I can give you probabilities and tell you that I am pretty certain we will sell more than 1, but I really have very few bounds on my system. In oil forecasting I could support my forecast by saying "well yesterday we produced 200 barrels which means the pressure decreased by x and that means today I will probably produce 5 barrels less."

At the end in the oil field you can create complex probabilistic models, but behind that you can create nice deterministic models based on physics to give you confidence. The buying mentality of the populous, the impact of Google's search algorithm or the advertising campaign of a competitor and how this affects my traffic — I don't know.

With demand forecasting you can, and must, continue to create better models, trying to capture more proxies for probable user behaviour. But ultimately, this leads to more complex models which become harder to explain. And although the performance on historical data can be much improved it can never give the same confidence — at least for me. What we do here is applied and researched extensively in financial forecasting and other probabilistic systems; they can be highly successful. But we should always look to be critical of what you actually know and what your model could possibly learn. Because ultimately, at least for me, I am worried that one of those big fat tails is gonna slap me in the face.