With this notebook I show how to create a synthetic dataset of elastic properties (Vp, Vs, density, etc.) starting from well logs. The idea is that the wells cannot possibly sample all the complexity and variations occurring across a reservoir. So what I do is to use various techniques to synthesize new data and create an abstraction of my reservoir to investigate alternative and plausible scenarios (change of pore fluids; lithological variations); another possible use of is to create a training dataset to calibrate 3D seismic volumes (see also the discussion on training data in Avseth et al., 2005, p.126).
What I also want to do is to demonstrate the use of Python to build your own tools and workflows and therefore move away from a "black-box approach".
This material was featured in the Geophysical Tutorial columns in the April and June 2015 issues of The Leading Edge.
Alessandro, April 2015
History:
.ix
), plus various other small corrections.To-do:
np.random.multivariate_normal
about covariance not positive-semidefinite.The idea is to show how to do these things:
And while doing this, I will also show some cool features of Python, such as using the Pandas library to explore geophysical data, defining evergreen functions like Gassmann and Voigt-Reuss-Hill, do fancy plots and make interactive tools.
The well log I will use is part of the test datasets available on the Quantitative Seismic Interpretation book website:
https://srb.stanford.edu/quantitative-seismic-interpretation
In particular, I use "Well 2", contained in the "Project Data" zipfile. Plenty of background information for this particular well is available on Per Avseth's PhD thesis, again available from the Stanford website:
https://srb.stanford.edu/theses
The actual file I will use in this notebook, qsiwell2.csv
is an elaboration of several other files contained in that zip archive (merged some logs, created others, etc.). See qsiwell2_dataprep.py
in the repo for details.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
# comment out the following if you're not on a Mac with HiDPI display
%config InlineBackend.figure_format = 'retina'
First a few words on the libraries I am using:
Note the use of the Pandas library to digest log data; it's a library built to handle big datasets and make your life easier (as we shall see later). This is my way to do things. You are welcome to use Python's built-in structures if you're inclined to do so or need to keep things minimal (Matt and Evan at Agile Geoscience do wonderful things relying on the standard triumvirate numpy/scipy/matplotlib).
For those interested in using their own .las
files, there is another library (http://wiki.scipy.org/Cookbook/LASReader) that can be used.
Let's load qsiwell2.csv
into the Pandas DataFrame L
:
L=pd.read_csv('qsiwell2.csv', index_col=0)
to check what's now inside L
:
L.columns
Index(['VP', 'VS', 'RHO_OLD', 'GR', 'NPHI', 'RHO', 'SWE', 'SWX', 'VPVS', 'IP', 'IS', 'VSH', 'RHOm', 'RHOf', 'PHIE'], dtype='object')
As you can see, I have all the required logs for a quantitative analysis: the elastic logs (compressional and shear velocity Vp
and Vs
, density RHO
) as well as the main petrophysical logs (water saturation Sw
, shale volume VSH
and porosity PHI
).
Logs stored into a Pandas DataFrame are easily accessed using the syntax DataFrame.log
or DataFrame['log']
, e.g. L.VP
is the P-wave velocity and L['PHIE']
is effective porosity.
To select more than one log, use the syntax DataFrame[['log1','log2']]
. Mix and match with filters such as DataFrame.index>2100
to restrict the data only to those values below 2100m and you can get all the information you need:
print('Average Ip between 1500 and 2000 m:')
print(L[(L.index>=2100) & (L.index<2200)][['VP','RHO']].mean())
Average Ip between 1500 and 2000 m: VP 2551.785823 RHO 2.216325 dtype: float64
If I had a categorical variable, for example a litho-facies log like the LFC
(litho-fluid class) log that I will show later on in this notebook, I could use the syntax DataFrame[DataFrame.LFC==2]['log1','log2']]
to only select data belonging to a specific class.
To remove a specific log (e.g., that RHO_OLD
) I use the drop
method with the optional axis=1
(required otherwise Pandas will search for a row named TVD
instead of a column):
L=L.drop(['RHO_OLD'],axis=1)
Let's check if I have really deleted RHO_OLD
:
L.columns
Index(['VP', 'VS', 'GR', 'NPHI', 'RHO', 'SWE', 'SWX', 'VPVS', 'IP', 'IS', 'VSH', 'RHOm', 'RHOf', 'PHIE'], dtype='object')
To rename a log use the .rename
method:
L.rename(columns={'RHOf':'RHOfluid'}, inplace=True)
L.columns
Index(['VP', 'VS', 'GR', 'NPHI', 'RHO', 'SWE', 'SWX', 'VPVS', 'IP', 'IS', 'VSH', 'RHOm', 'RHOfluid', 'PHIE'], dtype='object')
But what if I need to rename multiple logs? Simply assign a new list of names to the DataFrame columns; for example, if I have a 3 logs called log1
, log2
, log3
, I can rename them all in one go using this syntax:
DataFrame.columns=['newlog1','newlog2','newlog3']
Pandas is the best way I have ever seen to manage large and complex datasets; the basic Dataframe is a compact, portable and powerful data structure that allows you to quickly inspect your data, either using standard Python functions (like min()
, .max()
, .mean()
) or the describe
method:
print('Depth min/max: {:.1f}/{:.1f}'.format(L.index.min(),L.index.max()))
print('Vp average: {:.2f}'.format(L.VP.mean()))
Depth min/max: 2013.3/2640.5 Vp average: 2977.10
L[['VP','VS','RHO','PHIE']].describe()
VP | VS | RHO | PHIE | |
---|---|---|---|---|
count | 4117.000000 | 4117.000000 | 2701.000000 | 2701.000000 |
mean | 2977.098761 | 1371.293952 | 2.225045 | 0.291759 |
std | 449.355552 | 297.056552 | 0.061961 | 0.032196 |
min | 1439.900000 | 688.800000 | 2.036616 | 0.106844 |
25% | 2594.500000 | 1141.600000 | 2.185285 | 0.276218 |
50% | 3036.100000 | 1415.300000 | 2.223544 | 0.297311 |
75% | 3251.100000 | 1595.000000 | 2.264000 | 0.310519 |
max | 4431.000000 | 2427.800000 | 2.545268 | 0.376396 |
The next step is to calculate a Litho-Fluid Class log (LFC), where I separate groups of data identified by similar lithological and/or pore fluid content. The values of the LFC log will be assigned following these rules:
First I restrict my analysis on a smaller depth window (first line in the code snippet below) because the petrophysical logs are not computed everywhere. Then I create the "flag" logs ssb
, sso
, ssg
and sh
(i.e., logs made of samples that can only be 1 or 0, i.e. True or False) using cut-off values on VSH
(shale volume) and SW
(water saturation):
L=L[(L.index>=2100) & (L.index<=2400)]
sand_cutoff=0.20
ssb=((L.VSH <= sand_cutoff) & (L.SWE >= 0.9)) # brine sand
sso=((L.VSH <= sand_cutoff) & (L.SWE < 0.9)) # oil sand
sh=(L.VSH > sand_cutoff) # shales
Notice that there is no ssg
log (occurrence of gas sands) yet, as the insitu log did not record any gas sand. We will deal with that later.
I use the above flag logs to create the LFC
log and store it into L
with the other logs:
temp_lfc=np.zeros(np.shape(L.VSH))
temp_lfc[ssb.values]=1 # LFC will be 1 when ssb (brine sand flag) is True
temp_lfc[sso.values]=2 # LFC will be 2 when sso (oil sand flag) is True
temp_lfc[sh.values]=4 # LFC will be 4 when sh (shale flag) is True
L.insert(L.columns.size, 'LFC', temp_lfc) # the first argument is the position, in this way LFC will be the last one.
Now, let's see if this is all ok; the total number of samples after zooming in the 2100-2400 m depth window is:
np.shape(L.VSH)
(1968,)
and the following prints out how many samples each facies is made of:
print('brine sst={}, oil sst={}, shale={}'.format(np.count_nonzero(ssb),np.count_nonzero(sso),np.count_nonzero(sh)))
brine sst=706, oil sst=134, shale=1128
And a final check to make this newly defined LFC
log only has values within the range 1 to 4 (there will be no undefined samples in this particular depth interval, i.e. classes with LFC=0
):
print('LFC min: {}, LFC max: {}'.format(L.LFC.min(), L.LFC.max()))
LFC min: 1.0, LFC max: 4.0
It is very easy to do plots of all types with Pandas, and to show that here's a one-liner that plots Vp histograms for each class (remember that at this point we only have classes 1, 2 and 4, i.e. brine sand, oil sand and shales):
L.IP.hist(bins=50, color='black', by=L.LFC, figsize=(12,2), layout=(1,3), lw=0)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f0dc877d2b0>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f0dc84f45f8>, <matplotlib.axes._subplots.AxesSubplot object at 0x7f0dc851e908>], dtype=object)
First I need a custom colormap for my classes, i.e. a discrete colormap with following classes-colors association:
This is the way to define this colormap:
import matplotlib.colors as colors
# 0=undef 1=bri 2=oil 3=gas 4=shale
ccc = ['#B3B3B3','blue','green','red','#996633',]
cmap_facies = colors.ListedColormap(ccc[0:len(ccc)], 'indexed')
Let's now display a summary view of all the logs:
ztop=2100; zbot=2400
ll=L[(L.index>=ztop) & (L.index<=zbot)]
cluster=np.repeat(np.expand_dims(ll['LFC'].values,1),100,1)
f, ax = plt.subplots(nrows=1, ncols=4, figsize=(8, 6))
ax[0].plot(ll.VSH, ll.index, '-g', label='Vsh')
ax[0].plot(ll.SWE, ll.index, '-b', label='Sw')
ax[0].plot(ll.PHIE, ll.index, '-k', label='phi')
ax[1].plot(ll.IP, ll.index, '-', color='0.5')
ax[2].plot(ll.VPVS, ll.index, '-', color='0.5')
im=ax[3].imshow(cluster, interpolation='none', aspect='auto',cmap=cmap_facies,vmin=0,vmax=4)
cbar=plt.colorbar(im, ax=ax[3])
cbar.set_label((12*' ').join(['undef', 'brine', 'oil', 'gas', 'shale']))
cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
for i in ax[:-1]:
i.set_ylim(ztop,zbot)
i.invert_yaxis()
i.grid()
i.locator_params(axis='x', nbins=4)
ax[0].legend(fontsize='small', loc='lower right')
ax[0].set_xlabel('Vcl/phi/Sw'), ax[0].set_xlim(-.1,1.1)
ax[1].set_xlabel('Ip [m/s*g/cc]'), ax[1].set_xlim(3000,9000)
ax[2].set_xlabel('Vp/Vs'), ax[2].set_xlim(1.5,3)
ax[3].set_xlabel('LFC')
ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([]); ax[3].set_xticklabels([]);
Now let's see the same data in crossplot domain that should be familiar to geophysicists, i.e. acoustic impedance (Ip) versus Vp/Vs ratio:
plt.figure(figsize=(8,6))
plt.scatter(L.IP,L.VPVS,20,L.LFC,marker='o',edgecolors='none',alpha=0.7,cmap=cmap_facies,vmin=0,vmax=4)
plt.xlim(3000,9000); plt.ylim(1.5,3); plt.grid(); plt.xlabel('Ip'); plt.ylabel('Vp/Vs')
cbar=plt.colorbar()
cbar.set_label((15*' ').join(['undef', 'brine', 'oil', 'gas', 'shale']))
cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
#-----------------------------------------------------------------
#-- alternative way to label the colorbar:
#cbar.set_label('0=undef,1=brine,2=oil,3=gas,4=shale')
#cbar.set_ticks(range(0,4+1)); cbar.set_ticklabels(range(0,4+1));
#-----------------------------------------------------------------
cbar.set_alpha(1)
cbar.draw_all()
What about that red color in the legend plot above? Well, that was reserved for gas sands -- but we don't have any in the insitu case. Let's go make some of those!
Fluid replacement modeling (FRM in short), based on Gassmann's equation, is one of the key activities for any kind of quantitative work (see Wang, 2001 and Smith et al., 2003, for an exhaustive overview).
It is used to model the response of elastic logs (i.e., Vp, Vs and density) in different fluid conditions, thus allowing to study how much a rock would change in terms of velocity (or impedances) if it was filled with gas instead of brine for example; but what it also does is to bring all elastic logs to a common fluid denominator to focus only on lithological variations, disregarding fluid effects.
The inputs to FRM are $k_s$ and $\mu_s$ (saturated bulk and shear moduli which we can get from recorded Vps, Vs and density logs), $k_d$ and $\mu_d$ (dry-rock bulk and shear moduli), $k_0$ (mineral bulk modulus), $k_f$ (fluid bulk modulus) and porosity, $\varphi$. Reasonable estimates of mineral and fluid bulk moduli and porosity are easily computed and are shown below. The real unknowns, what is arguably the core issue of rock physics, are the dry-rock moduli.
And here we come to Gassmann's equation; I can use it in its inverse form to calculate $k_d$:
$$ k_d = \frac{k_s \cdot ( \frac{\varphi k_0}{k_f} +1-\varphi) -k_0}{\frac {\varphi k_0}{k_f} + \frac{k_s}{k_0} -1-\varphi} $$Then I use Gassmann's again in its direct form to calculate the saturated bulk modulus with the new fluid:
$$k_s = k_d + \frac { (1-\frac{k_d}{k_0})^2} { \frac{\varphi}{k_f} + \frac{1-\varphi}{k_0} - \frac{k_d}{k_0^2}}$$Shear modulus is not affected by pore fluid so that it stays unmodified throughout the fluid replacement process.
$$\mu_s = \mu_d$$Bulk density is defined via the following equation:
$$\rho = (1-\varphi) \cdot \rho_0 + \varphi \cdot \rho_f $$And this is how I define it all as a Python function:
def gassmann(vp1, vs1, rho1, rho_fl1, k_fl1, rho_fl2, k_fl2, k0, phi):
vp1=vp1/1000
vs1=vs1/1000
rho2 = rho1-phi*rho_fl1+phi*rho_fl2
mu1 = rho1*vs1**2.
k1 = rho1*vp1**2-(4./3.)*mu1
kdry= (k1 * ((phi*k0)/k_fl1+1-phi)-k0) / ((phi*k0)/k_fl1+(k1/k0)-1-phi)
k2 = kdry + (1- (kdry/k0))**2 / ( (phi/k_fl2) + ((1-phi)/k0) - (kdry/k0**2) )
mu2 = mu1
vp2 = np.sqrt(((k2+(4./3)*mu2))/rho2)
vs2 = np.sqrt((mu2/rho2))
return [vp2*1000, vs2*1000, rho2, k2]
What this function does is to get the relevant inputs which are:
vp1
, vs1
, rho1
: measured Vp, Vs, and density (saturated with fluid 1)rho_fl1
, k_fl1
: density and bulk modulus of fluid 1rho_fl2
, k_fl2
: density and bulk modulus of fluid 2k0
: mineral bulk modulusphi
: porosityAnd returns vp2
,vs2
, rho2
, k2
which are respectively Vp, Vs, density and bulk modulus of rock with fluid 2.
I have mentioned above the possibility to get estimates of mineral bulk modulus $k_0$; the thing to know is that another assumption in Gassmann's equation is that it works only on monomineralic rocks; an actual rock is always a mixture of different minerals. A good approximation to get a mixed mineralogy bulk modulus k0
is to use Voigt-Reuss-Hill averaging (check these notes by Jack Dvorkin for a more rigorous discussion or this wikipedia entry).
So I code another function, vrh
, to do that:
def vrh(volumes,k,mu):
f=np.array(volumes).T
k=np.resize(np.array(k),np.shape(f))
mu=np.resize(np.array(mu),np.shape(f))
ax = 0 if f.ndim==1 else 1
k_u = np.sum(f*k,axis=ax)
k_l = 1./np.sum(f/k,axis=ax)
mu_u = np.sum(f*mu,axis=ax)
with np.errstate(divide='ignore', invalid='ignore'):
mu_l = 1./np.sum(f/mu,axis=ax)
k0 = (k_u+k_l)/2.
mu0 = (mu_u+mu_l)/2.
return k_u, k_l, mu_u, mu_l, k0, mu0
I can use the same function to also compute the fluid bulk modulus log which is usually done via Reuss average (the lower bound k_l
in the vrh
function above):
First I need to define the various mineral and fluid elastic constants:
rho_qz=2.65; k_qz=37; mu_qz=44 # mineral properties, quartz (i.e., sands)
rho_sh=2.81; k_sh=15; mu_sh=5 # mineral properties, clay (i.e., shales)
rho_b=1.09; k_b=2.8 # fluid properties, brine
rho_o=0.78; k_o=0.94 # fluid properties, oil
rho_g=0.25; k_g=0.06 # fluid properties, gas
Then I calculate the original (insitu) fluid density rho_fl
and bulk modulus k_fl
, and the average mineral bulk modulus k0
:
# mineral mixture bulk and shear moduli, k0 and mu0
shale=L.VSH.values
sand=1-shale-L.PHIE.values
shaleN = shale/(shale+sand) # normalized shale and sand volumes
sandN = sand/(shale+sand)
[k_u, k_l, mu_u, mu_l, k0, mu0]=vrh([shaleN, sandN],[k_sh,k_qz],[mu_sh,mu_qz])
# fluid mixture bulk modulus, using the same vrh function but capturing the Reuss average (second output)
water=L.SWE.values
hc=1-L.SWE.values
[tmp, k_fl, tmp, tmp, tmp, tmp]=vrh([water, hc],[k_b,k_o],[0,0])
# fluid mixture density
rho_fl=water*rho_b + hc*rho_o
...and put it all together using the frm
function defined above:
[vpb, vsb, rhob, kb]=gassmann(L.VP, L.VS, L.RHO, rho_fl, k_fl, rho_b, k_b, k0, L.PHIE)
[vpo, vso, rhoo, ko]=gassmann(L.VP, L.VS, L.RHO, rho_fl, k_fl, rho_o, k_o, k0, L.PHIE)
[vpg, vsg, rhog, kg]=gassmann(L.VP, L.VS, L.RHO, rho_fl, k_fl, rho_g, k_g, k0, L.PHIE)
Now I create 3 sets of copies of the original elastic logs stored in my DataFrame L
(L.VP
, L.VSB
, L.RHO
) for the three fluid scenarios investigated (and I will append an appropriate suffix to identify these 3 cases, i.e. _FRMB
for brine, _FRMG
for gas and _FRMO
for oil, respectively).
These three sets will be placeholders to store the values of the actual fluid-replaced logs (vpb
, vsb
, rhob
, etc.).
The syntax I use to do this is:
L['VP_FRMB'][ssb|sso]=vpb[ssb|sso]
Which means, copy the values from the output of fluid replacement (vpb
, vsb
, rhob
, etc.) only where there's sand (vpb[ssb|sso]
), i.e. only when either the flag logs ssb
or sso
are True).
I also compute the additional elastic logs (acoustic and shear impedances IP
, Is
, and Vp/Vs ratio, VPVS
) in their fluid-replaced version.
L['VP_FRMB']=L.VP
L['VS_FRMB']=L.VS
L['RHO_FRMB']=L.RHO
L['VP_FRMB'][ssb|sso]=vpb[ssb|sso]
L['VS_FRMB'][ssb|sso]=vsb[ssb|sso]
L['RHO_FRMB'][ssb|sso]=rhob[ssb|sso]
L['IP_FRMB']=L.VP_FRMB*L.RHO_FRMB
L['IS_FRMB']=L.VS_FRMB*L.RHO_FRMB
L['VPVS_FRMB']=L.VP_FRMB/L.VS_FRMB
L['VP_FRMO']=L.VP
L['VS_FRMO']=L.VS
L['RHO_FRMO']=L.RHO
L['VP_FRMO'][ssb|sso]=vpo[ssb|sso]
L['VS_FRMO'][ssb|sso]=vso[ssb|sso]
L['RHO_FRMO'][ssb|sso]=rhoo[ssb|sso]
L['IP_FRMO']=L.VP_FRMO*L.RHO_FRMO
L['IS_FRMO']=L.VS_FRMO*L.RHO_FRMO
L['VPVS_FRMO']=L.VP_FRMO/L.VS_FRMO
L['VP_FRMG']=L.VP
L['VS_FRMG']=L.VS
L['RHO_FRMG']=L.RHO
L['VP_FRMG'][ssb|sso]=vpg[ssb|sso]
L['VS_FRMG'][ssb|sso]=vsg[ssb|sso]
L['RHO_FRMG'][ssb|sso]=rhog[ssb|sso]
L['IP_FRMG']=L.VP_FRMG*L.RHO_FRMG
L['IS_FRMG']=L.VS_FRMG*L.RHO_FRMG
L['VPVS_FRMG']=L.VP_FRMG/L.VS_FRMG
Finally, I will add three more LFC logs that will be companions to the new fluid-replaced logs.
The LFC log for brine-replaced logs will be always 1 whenever there's sand, because fluid replacement will have acted on all sand points and replaced whatever fluid we had originally with brine. Same thing for LFC_O
and LFC_G
(LFC for oil and gas-replaced logs): they will always be equal to 2 (oil) or 3 (gas) for all the sand samples. That translates into Python like:
temp_lfc_b=np.zeros(np.shape(L.VSH))
temp_lfc_b[ssb.values | sso.values]=1 # LFC is 1 when either ssb (brine sand flag) or sso (oil) is True
temp_lfc_b[sh.values]=4 # LFC 4=shale
L['LFC_B']= temp_lfc_b
temp_lfc_o=np.zeros(np.shape(L.VSH))
temp_lfc_o[ssb.values | sso.values]=2 # LFC is now 2 when there's sand (ssb or sso is True)
temp_lfc_o[sh.values]=4 # LFC 4=shale
L['LFC_O']= temp_lfc_o
temp_lfc_g=np.zeros(np.shape(L.VSH))
temp_lfc_g[ssb.values | sso.values]=3 # LFC 3=gas sand
temp_lfc_g[sh.values]=4 # LFC 4=shale
L['LFC_G']= temp_lfc_g
And this is the same summary plot that I have used above, updated to show the fluid changes in the elastic logs Ip and Vp/Vs. It is also zoomed into the reservoir between 2150 and 2200 m, and the LFC log is the original one, i.e. it reflects the insitu case.
ztop=2150; zbot=2200
ll=L[(L.index>=ztop) & (L.index<=zbot)]
cluster=np.repeat(np.expand_dims(ll['LFC'].values,1),100,1)
f, ax = plt.subplots(nrows=1, ncols=4, figsize=(8, 6))
ax[0].plot(ll.VSH, ll.index, '-g', label='Vsh')
ax[0].plot(ll.SWE, ll.index, '-b', label='Sw')
ax[0].plot(ll.PHIE, ll.index, '-k', label='phi')
ax[1].plot(ll.IP_FRMG, ll.index, '-r')
ax[1].plot(ll.IP_FRMB, ll.index, '-b')
ax[1].plot(ll.IP, ll.index, '-', color='0.5')
ax[2].plot(ll.VPVS_FRMG, ll.index, '-r')
ax[2].plot(ll.VPVS_FRMB, ll.index, '-b')
ax[2].plot(ll.VPVS, ll.index, '-', color='0.5')
im=ax[3].imshow(cluster, interpolation='none', aspect='auto',cmap=cmap_facies,vmin=0,vmax=4)
cbar=plt.colorbar(im, ax=ax[3])
# cbar.set_label('0=undef,1=brine,2=oil,3=gas,4=shale')
# cbar.set_ticks(range(0,4+1));
cbar.set_label((12*' ').join(['undef', 'brine', 'oil', 'gas', 'shale']))
cbar.set_ticks(range(0,1)); cbar.set_ticklabels('')
for i in ax[:-1]:
i.set_ylim(ztop,zbot)
i.invert_yaxis()
i.grid()
i.locator_params(axis='x', nbins=4)
ax[0].legend(fontsize='small', loc='lower right')
ax[0].set_xlabel('Vcl/phi/Sw'), ax[0].set_xlim(-.1,1.1)
ax[1].set_xlabel('Ip [m/s*g/cc]'), ax[1].set_xlim(3000,9000)
ax[2].set_xlabel('Vp/Vs'), ax[2].set_xlim(1.5,3)
ax[3].set_xlabel('LFC')
ax[1].set_yticklabels([]); ax[2].set_yticklabels([]); ax[3].set_yticklabels([]); ax[3].set_xticklabels([]);
Let's have a look at the results in the Ip versus Vp/Vs crossplot domain; I will now plot 4 different plots to compare the initial situation to the results of the 4 fluid replacements:
f, ax = plt.subplots(nrows=1, ncols=4, sharey=True, sharex=True, figsize=(16, 4))
ax[0].scatter(L.IP,L.VPVS,20,L.LFC,marker='o',edgecolors='none',alpha=0.5,cmap=cmap_facies,vmin=0,vmax=4)
ax[1].scatter(L.IP_FRMB,L.VPVS_FRMB,20,L.LFC_B,marker='o',edgecolors='none',alpha=0.5,cmap=cmap_facies,vmin=0,vmax=4)
ax[2].scatter(L.IP_FRMO,L.VPVS_FRMO,20,L.LFC_O,marker='o',edgecolors='none',alpha=0.5,cmap=cmap_facies,vmin=0,vmax=4)
ax[3].scatter(L.IP_FRMG,L.VPVS_FRMG,20,L.LFC_G,marker='o',edgecolors='none',alpha=0.5,cmap=cmap_facies,vmin=0,vmax=4)
ax[0].set_xlim(3000,9000); ax[0].set_ylim(1.5,3);
ax[0].set_title('original data');
ax[1].set_title('FRM to brine');
ax[2].set_title('FRM to oil');
ax[3].set_title('FRM to gas');
for i in ax: i.grid()
Through FRM I have created an extended dataset. What I want to do now is a further abstraction, moving away from the intricacies and local irregularities of the real data with the goal of creating a fully synthetic dataset representing an idealized version of a reservoir.
The first step is to study some basic statistical properties, i.e. tendency, dispersion and correlation between certain elastic properties for each litho-fluid class.
Central tendency is simply described by calculating the mean values of some desired elastic property for all the existing classes; dispersion and correlation are summarised with the covariance matrix, which can be written like this (for two generic variables X and Y):
[ var_X cov_XY ]
[ cov_XY var_Y ]
if I had three variables instead:
[ var_X cov_XY cov_XZ ]
[ cov_XY var_Y cov_YZ ]
[ cov_XZ cov_YZ var_Z ]
Where var_X
is the variance of property X, i.e. a measure of dispersion about the mean, while the covariance cov_XY
is a measure of similarity between two properties X and Y. A detailed description of the covariance matrix can be found at this wikipedia entry.
Python allows me to easily perform these calculations; the covariance matrix will be stored 'flattened' to make it easily accessible even if it has redundancies s(e.g. cov_XY
in the 2-variable scenario):
[var_X cov_XY cov_XY var_Y]
For the rest of this exercise I will work with the two variables used so far, i.e. Ip and Vp/Vs.
First I need to prepare a few things to make the procedure easily extendable to other situations (e.g., using more than two variables):
To create a megalog (step 1) I use the following code:
lognames0=['LFC','IP','VPVS']
lognames1=['LFC_B','IP_FRMB', 'VPVS_FRMB']
lognames2=['LFC_O','IP_FRMO', 'VPVS_FRMO']
lognames3=['LFC_G','IP_FRMG', 'VPVS_FRMG']
ww0=L[pd.notnull(L.LFC)][lognames0].values
ww1=L[pd.notnull(L.LFC)][lognames1].values
ww2=L[pd.notnull(L.LFC)][lognames2].values
ww3=L[pd.notnull(L.LFC)][lognames3].values
ww=pd.DataFrame(np.vstack((ww0,ww1,ww2,ww3)), columns=lognames0)
What I have done here is to first define 4 lists containing the names of the logs I need. Then I extract into 4 separate temporary DataFrames that are automatically converted to simple Numpy arrays (with .values
) different sets of logs. For example, ww0
contains only the LFC
,IP
,VPVS
, while ww1
contains only LFC_B
,IP_FRMB
, VPVS_FRMB
.
I then merge the 4 datasets together and convert it back to a Pandas DataFrame assigning as column names LFC
,IP
,VPVS
(at this point, I treat all samples the same regardless of whether they were the insitu values or the fluid-replaced ones).
In other words, I have created a megalog (ww
) that includes all values of Ip and Vp/Vs that are both measured for a certain facies, and synthetically created through fluid substitution.
For step 2 first I build the headers for a Pandas DataFrame to store mean and covariances for each class:
nlfc=int(ww.LFC.max())
nlogs=len(ww.columns)-1 # merged data always contain a facies log so total number of logs is 3-1
names_mean=['']*nlogs # create empty list of strings for the mean columns header
names_cov=['']*nlogs**2 # same as above but for the covariance columns
for i,aa in enumerate(names_mean): # populate names_mean with unique names
names_mean[i]='mean'+str(i)
for i,aa in enumerate(names_cov): # populate names_cov with unique names
names_cov[i]='c'+str(i)
I will now create a DataFrame which is dynamically dimensioned and made of nfacies
rows, i.e. one row per each facies, and 1+n+m+1
columns, where n
is the number of columns with the mean values of Ip and Vp/Vs, m
is the length of the flattened covariance matrix. In this example where we have only two properties this means:
LFC
)mean0
and mean1
)c0
is the variance of Ip, c1
and c2
will be the covariance Ip-Vp/Vs and they will be equal, c3
is the variance of Vp/Vs)SAMPLES
)stat=pd.DataFrame(data=None,
columns=['LFC']+names_mean+names_cov+['SAMPLES'],
index=np.arange(nlfc))
stat['LFC']=range(1,nlfc+1)
This is how the stat
DataFrame looks like now:
stat
LFC | mean0 | mean1 | c0 | c1 | c2 | c3 | SAMPLES | |
---|---|---|---|---|---|---|---|---|
0 | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | 2 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | 3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
To recap: right now I just have an empty box, made of four rows (because we have 4 classes: shale, brine, oil and gas sands), and for each row we have 2 columns to store the averages of property n.1 (Ip) and n.2 (Vp/Vs), 4 columns to store the flattened covariance matrices and finally 1 column with the number of samples for each class (that will inform me on the robustness of the analysis, i.e. if I have too few samples then I need to consider how my statistical analysis will not be reliable).
The following snippet shows how to populate stat
and prints out the number of samples in each class; also note the use of .flatten()
at line 5 that linearize the covariance matrix as discussed above:
for i in range(1,nlfc+1):
temp=ww[ww.LFC==i].drop('LFC',1)
stat.loc[stat.LFC==i,'SAMPLES']=temp.count()[0]
stat.loc[stat.LFC==i,names_mean[0]:names_mean[-1]]=np.mean(temp.values,0)
stat.loc[stat.LFC==i,names_cov[0]:names_cov[-1]]=np.cov(temp,rowvar=0).flatten()
print('LFC={}, number of samples={}'.format(i, temp.count()[0]))
LFC=1, number of samples=1546 LFC=2, number of samples=974 LFC=3, number of samples=840 LFC=4, number of samples=4512
Now let's look back at stat
and see how it has been filled up with all the information I need:
stat
LFC | mean0 | mean1 | c0 | c1 | c2 | c3 | SAMPLES | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 6790.21 | 2.11403 | 199721 | -27.9541 | -27.9541 | 0.0205225 | 1546 |
1 | 2 | 6184.98 | 2.00664 | 337593 | -16.7249 | -16.7249 | 0.0234101 | 974 |
2 | 3 | 5816.3 | 1.939 | 360001 | 8.66782 | 8.66782 | 0.020416 | 840 |
3 | 4 | 6087.85 | 2.31668 | 492525 | -98.0249 | -98.0249 | 0.0562869 | 4512 |
I can also interrogate stat
to know for example the average Ip for the litho-fluid class 1 (brine sands):
stat[stat.LFC==1]['mean0']
0 6790.21 Name: mean0, dtype: object
or the variance of Vp/Vs for class 2:
stat[stat.LFC==1]['c3']
0 0.0205225 Name: c3, dtype: object
Obviously I need to remember that the first property is Ip, so that's why I am querying the column mean0
(mean1
contains the mean values for the second property, in this case Vp/Vs). I could have named in a different way the columns just to avoid any confusion, agreed.
Remember that Python works with zero-based lists and vectors, so the first one has always an index of 0. If I were working with 3 properties, e.g. Ip, Vp/Vs and density, then the average density value for a hypothetical class 5 would be: stat[stat.LFC==5]['mean2']
;
To display graphically the same information and compare brine sands (class 1) with gas sands (class 3) I use Pandas' scatter_matrix
:
i=1
for i in [1,3]:
pd.plotting.scatter_matrix(ww[ww.LFC==i].drop('LFC',1), color='black', diagonal='kde', alpha=0.1, density_kwds={'color':'#000000','lw':2})
plt.suptitle('LFC=%d' % i)
I can now use all this information to create a brand new synthetic dataset that will replicate the average behaviour of the reservoir complex and at the same time overcome typical problems when using real data like undersampling of a certain class, presence of outliers, spurious occurrence of anomalies.
To create the synthetic datasets I use a Monte Carlo simulation relying on multivariate normal distribution to draw samples that are random but correlated in the elastic domain of choice (Ip and Vp/Vs).
This is the code:
NN=300
mc=pd.DataFrame(data=None,
columns=lognames0,
index=np.arange(nlfc*NN), dtype='float')
for i in range(1,nlfc+1):
mc.loc[NN*i-NN:NN*i-1,'LFC']=i
First I define how many samples per class I want, then I create an empty Pandas DataFrame with these specs:
LFC
, IP
and VPVS
, stored in lognames0
, previously used to dimension the stat
DataFrame);Finally, I fill in the LFC
column with the numbers assigned to each class. I use the nlfc
variable that contains the total number of classes, introduced earlier when creating the stat
DataFrame; then I loop over the range 1 to nlfc
, and assign 1 to rows 1-299, 2 to 300-599, and so on.
The next code snippet is at the core of the entire procedure.
for i in range(1,nlfc+1):
mean = stat.loc[i-1,names_mean[0]:names_mean[-1]].values.astype(np.float32, copy=False)
sigma = np.reshape(stat.loc[i-1,names_cov[0]:names_cov[-1]].values,(nlogs,nlogs)).astype(np.float32, copy=False)
# print('class {}: mean={}, sigma={}'.format(i,mean,sigma))
mc.loc[mc.LFC==i,1:] = np.random.multivariate_normal(mean,sigma,NN,check_valid='ignore')
For each class (another loop over the range 1-nlfc
) I extract the average value mean
and the covariance matrix sigma
from the stat
DataFrame, then put them into the Numpy np.random.multivariate_normal
method to draw randomly selected samples from the continuous and correlated distributions of the properties Ip and Vp/Vs.
So the mc
DataFrame is now made of 3 columns (LFC
, IP
, VPVS
) and 1200 rows:
mc.describe()
LFC | IP | VPVS | |
---|---|---|---|
count | 1200.0000 | 1200.000000 | 1200.000000 |
mean | 2.5000 | 6265.112630 | 2.100294 |
std | 1.1185 | 661.104336 | 0.217301 |
min | 1.0000 | 4299.261056 | 1.548199 |
25% | 1.7500 | 5787.597321 | 1.950863 |
50% | 2.5000 | 6288.131324 | 2.074322 |
75% | 3.2500 | 6759.082611 | 2.230949 |
max | 4.0000 | 7992.950163 | 2.914140 |
And these are the results, comparing the original, augmented dataset (i.e. the results of fluid replacement merged with the insitu log, all stored in the DataFrame ww
defined earlier when calculating the statistics) with the newly created synthetic data:
f, ax = plt.subplots(nrows=1, ncols=2, sharey=True, sharex=True, figsize=(8, 4))
scatt1=ax[0].scatter(ww.IP,ww.VPVS,20,ww.LFC,marker='o',edgecolors='none', alpha=0.2, cmap=cmap_facies,vmin=0,vmax=4)
scatt2=ax[1].scatter(mc.IP,mc.VPVS,20,mc.LFC,marker='o',edgecolors='none', alpha=0.5, cmap=cmap_facies,vmin=0,vmax=4)
ax[0].set_xlim(3000,9000); ax[0].set_ylim(1.5,3.0);
ax[0].set_title('augmented well data');
ax[1].set_title('synthetic data');
for i in ax: i.grid()
One very cool feature of these IPython notebooks, is the possibility to go interactive. I have created a separate notebook available for download in the repository https://github.com/aadm/geophysical_notes
See also the awesome collection of tutorials curated by Matt Hall for The Leading Edge and available here: https://github.com/seg/tutorials