Can we extract text from images using Tesseract, then process that text for lithological data?
We'll use a Python wrapper for Tesseract.
This library looks cool but I can't get it to work.
I also tried a lightweight one, pytesseract, but it failed with an image error.
Let's try using PyOCR.
from PIL import Image
import pyocr
import pyocr.builders
tools = pyocr.get_available_tools()
tool = tools[0]
tool.get_available_languages()
['eng']
text = tool.image_to_string(Image.open('Samples.png'), builder=pyocr.builders.TextBuilder())
print text
0.0 m 15.0 m 30.0 m 75.0 In 80.0 m 85.0 m ~ 15.0 m 30.0 75.0 80.0 85.0 90.0 m SAMPLE AND CORE DESCRIPTIONS Chevron Irving Bras d’0r #Z B! A. Berti No samples. Drift, subangular to subroundad small pebbles of quartz, granite, metesediment, ett. No samples . Fractured anhydrite and dolomite in more or less equal amounts, fractures 2-3 mm apart, <l mm wide, healed with nnhydrite and/nr gypsum; lOZ of sample is white to milky-white gypsum; anhydrite l#him to light grey microcrystslline to medium grained, dolomite, light grey-brown to brown, earthy to finely crystalline, trace ZZ porosity, no fluorescence. Fractured anhydrite and dolomite as above with about 102 gypsum in sample; crate smell, nan-ew black arte-ks in dolomite (Totganic or bitumen). Anhydrite, white to light tu medium brown with irregular stringera dolomite .tattered throughout rather than fractured as above, dolomite, light brown, miorocryatalline, trato porosity.
This is the preformance with no training.
If training is required, this might help.
For all I know this is also using Tesseract. This Mashape app claims to do text recognition.
import unirest
response = unirest.post("https://imagevision-text-search-v1.p.mashape.com/textSearch/detectText",
headers={
"X-Mashape-Key": "PUT MASHAPE KEY HERE",
"Content-Type": "application/x-www-form-urlencoded"
},
params={
"objecturl": "https://dl.dropboxusercontent.com/u/14965965/Samples.png"
}
)
response.body
{u'imageDimensions': [860, 686], u'message': u'SUCCESS', u'objectUrl': u'https://dl.dropboxusercontent.com/u/14965965/Samples.png', u'resource': u'detectText', u'text': [{u'id': 1, u'textCoordinates': [69, 259, 134, 274], u'textString': u'300m'}], u'textDetected': True, u'transactionId': u'1234', u'version': u'IV-1.2.22.29'}
OK, that didn't work very well.
Try recognizing hashtags in the Tesseract text, using the Aylien app in Mashape.
import urllib
s = urllib.quote_plus(text)
base_url = "https://aylien-text.p.mashape.com/hashtags?text="
headers = {"X-Mashape-Authorization": "PUT MASHAPE KEY HERE"}
r = unirest.get(base_url+s, headers=headers)
r.body['hashtags']
[u'#Dolomite', u'#Anhydrite', u'#Gypsum', u'#Porosity', u'#Bitumen', u'#Granite', u'#Fluorescence', u'#Crystal']
parts = text.split('\n\n')
parts
['\n0.0 m', '15.0 m', '30.0 m', '75.0 In', '80.0 m', '85.0 m ~', '15.0 m', '30.0', '75.0', '80.0', '85.0', '90.0 m', 'SAMPLE AND CORE DESCRIPTIONS', 'Chevron Irving Bras d\xe2\x80\x990r #Z\nB! A. Berti', 'No samples.', 'Drift, subangular to subroundad small pebbles of\nquartz, granite, metesediment, ett.', 'No samples .', 'Fractured anhydrite and dolomite in more or less\nequal amounts, fractures 2-3 mm apart, <l mm\nwide, healed with nnhydrite and/nr gypsum; lOZ of\nsample is white to milky-white gypsum; anhydrite\nl#him to light grey microcrystslline to medium\ngrained, dolomite, light grey-brown to brown,\nearthy to finely crystalline, trace ZZ porosity,\nno fluorescence.', 'Fractured anhydrite and dolomite as above with\nabout 102 gypsum in sample; crate smell, nan-ew\nblack arte-ks in dolomite (Totganic or bitumen).', 'Anhydrite, white to light tu medium brown with\nirregular stringera dolomite .tattered throughout\nrather than fractured as above, dolomite, light\nbrown, miorocryatalline, trato porosity.\n']
results = [unirest.get(base_url+urllib.quote_plus(e), headers=headers) for e in parts]
hashtags = [r.body['hashtags'] for r in results]
hashtags
[[], [], [], [], [], [], [], [], [], [], [], [], [u'#CORE', u'#CongressOfRacialEquality'], [u'#ChevronCorporation'], [], [u'#Granite'], [], [u'#Dolomite', u'#Anhydrite', u'#Gypsum', u'#Porosity', u'#Fluorescence', u'#Crystal'], [u'#Dolomite', u'#Anhydrite', u'#Gypsum', u'#Bitumen'], [u'#Dolomite', u'#Anhydrite', u'#Porosity']]