This is a quick demonstration for using ElasticSearch in Python.
Some materials are taken from https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html, it's a great book!
Some changes you need to make before lauching a node:
cluster.name
for auto-discovery or not-auto-discovery cluster in your networknode.name
for easy determine which node are in troubleSome more options:
bootstrap.mlockall
to true
for performance purposenetwork.host
to 127.0.0.1
for security reasonWe use pyelasticsearch
package for wrapping ElasticSearch RESTful API around Python in this demo.
Install it using pip install pyelasticsearch
Import ElasticSearch
class.
from pyelasticsearch import ElasticSearch
Config url for using ElasticSearch, there're more parameters but we're good for now.
es = ElasticSearch('http://localhost:9200')
We check the health first
es.health()
{'active_primary_shards': 0, 'active_shards': 0, 'cluster_name': 'elasticsearch_tai-dev', 'initializing_shards': 0, 'number_of_data_nodes': 1, 'number_of_in_flight_fetch': 0, 'number_of_nodes': 1, 'number_of_pending_tasks': 0, 'relocating_shards': 0, 'status': 'green', 'timed_out': False, 'unassigned_shards': 0}
All we care for now is the 'green' status, that means all things are OK.
Fact: health
method is a wrapper for calling GET /_cluster/health?pretty
directly using API.
Now before we have anything to do with ElasticSearch, we need to index our documents to ElasticSearch database.
es.index('library', # Index name
'books', # Type name
{
'title': 'A very interesting name',
'name': {
'first': 'Hugh',
'last': 'Jackman'
},
'publish_date': '2015-07-02',
'price': 20,
},
id=1 # Doc ID
)
{'_id': '1', '_index': 'library', '_type': 'books', '_version': 1, 'created': True}
es.get('library', 'books', 1)
{'_id': '1', '_index': 'library', '_source': {'name': {'first': 'Hugh', 'last': 'Jackman'}, 'price': 20, 'publish_date': '2015-07-02', 'title': 'A very interesting name'}, '_type': 'books', '_version': 1, 'found': True}
If the document is not existed, an error is raised:
try:
es.get('library', 'books', 123)
except:
print("This is an error!")
WARNING:elasticsearch:GET /library/books/123 [status:404 request:0.004s]
This is an error!
Optional (and ugly) ID:
es.index('library', # Index name
'books', # Type name
{
'title': 'Another interesting name',
'name': {
'first': 'Tom',
'last': 'Cruise'
},
'publish_date': '2015-08-02',
'price': 21,
},
)
{'_id': 'AU5N-j9DhyYCAHYcFB3R', '_index': 'library', '_type': 'books', '_version': 1, 'created': True}
Get me that book:
es.get('library', 'books', 'AU5N-j9DhyYCAHYcFB3R')
{'_id': 'AU5N-j9DhyYCAHYcFB3R', '_index': 'library', '_source': {'name': {'first': 'Tom', 'last': 'Cruise'}, 'price': 21, 'publish_date': '2015-08-02', 'title': 'Another interesting name'}, '_type': 'books', '_version': 1, 'found': True}
es.update('library', # Index name
'books', # Type name
id = 1, # Doc ID
doc = {
'title': 'A very interesting name 2',
'name': {
'first': 'Hugh',
'last': 'Jackman'
},
'publish_date': '2015-07-03',
'price': 30,
},
)
{'_id': '1', '_index': 'library', '_type': 'books', '_version': 2}
It worked, but the method is kind of ugly though.
es.get('library', 'books', 1)
{'_id': '1', '_index': 'library', '_source': {'name': {'first': 'Hugh', 'last': 'Jackman'}, 'price': 30, 'publish_date': '2015-07-03', 'title': 'A very interesting name 2'}, '_type': 'books', '_version': 2, 'found': True}
The method perform a partial update:
es.update('library', # Index name
'books', # Type name
id = 1, # Doc ID
doc = {
'price': 90,
},
)
{'_id': '1', '_index': 'library', '_type': 'books', '_version': 3}
es.get('library', 'books', 1)
{'_id': '1', '_index': 'library', '_source': {'name': {'first': 'Hugh', 'last': 'Jackman'}, 'price': 90, 'publish_date': '2015-07-03', 'title': 'A very interesting name 2'}, '_type': 'books', '_version': 3, 'found': True}
es.delete('library', 'books', 1)
{'_id': '1', '_index': 'library', '_type': 'books', '_version': 4, 'found': True}
try:
es.get('library', 'books', 1)
except:
print('Not found!')
WARNING:elasticsearch:GET /library/books/1 [status:404 request:0.004s]
Not found!
Input data:
users = [{ "email" : "john@smith.com", "name" : "John Smith", "username" : "@john" },
{ "email" : "mary@jones.com", "name" : "Mary Jones", "username" : "@mary" }]
tweet = [{ "date" : "2014-09-13", "name" : "Mary Jones", "tweet" : "Elasticsearch means full text search has never been so easy", "user_id" : 2 },
{ "date" : "2014-09-14", "name" : "John Smith", "tweet" : "@mary it is not just text, it does everything", "user_id" : 1 },
{ "date" : "2014-09-15", "name" : "Mary Jones", "tweet" : "However did I manage before Elasticsearch?", "user_id" : 2 },
{ "date" : "2014-09-16", "name" : "John Smith", "tweet" : "The Elasticsearch API is really easy to use", "user_id" : 1 },
{ "date" : "2014-09-17", "name" : "Mary Jones", "tweet" : "The Query DSL is really powerful and flexible", "user_id" : 2 }]
Bulk indexing:
es.bulk((es.index_op(user, id=i) for i, user in enumerate(users)),
index='demo',
doc_type='user')
{'errors': False, 'items': [{'index': {'_id': '0', '_index': 'demo', '_type': 'user', '_version': 1, 'status': 201}}, {'index': {'_id': '1', '_index': 'demo', '_type': 'user', '_version': 1, 'status': 201}}], 'took': 886}
es.bulk((es.index_op(t, id=i) for i, t in enumerate(tweet)),
index='demo',
doc_type='tweet')
{'errors': False, 'items': [{'index': {'_id': '0', '_index': 'demo', '_type': 'tweet', '_version': 1, 'status': 201}}, {'index': {'_id': '1', '_index': 'demo', '_type': 'tweet', '_version': 1, 'status': 201}}, {'index': {'_id': '2', '_index': 'demo', '_type': 'tweet', '_version': 1, 'status': 201}}, {'index': {'_id': '3', '_index': 'demo', '_type': 'tweet', '_version': 1, 'status': 201}}, {'index': {'_id': '4', '_index': 'demo', '_type': 'tweet', '_version': 1, 'status': 201}}], 'took': 53}
es.search({})
{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'hits': {'hits': [{'_id': '4', '_index': 'demo', '_score': 1.0, '_source': {'date': '2014-09-17', 'name': 'Mary Jones', 'tweet': 'The Query DSL is really powerful and flexible', 'user_id': 2}, '_type': 'tweet'}, {'_id': '0', '_index': 'demo', '_score': 1.0, '_source': {'email': 'john@smith.com', 'name': 'John Smith', 'username': '@john'}, '_type': 'user'}, {'_id': '0', '_index': 'demo', '_score': 1.0, '_source': {'date': '2014-09-13', 'name': 'Mary Jones', 'tweet': 'Elasticsearch means full text search has never been so easy', 'user_id': 2}, '_type': 'tweet'}, {'_id': '1', '_index': 'demo', '_score': 1.0, '_source': {'email': 'mary@jones.com', 'name': 'Mary Jones', 'username': '@mary'}, '_type': 'user'}, {'_id': '1', '_index': 'demo', '_score': 1.0, '_source': {'date': '2014-09-14', 'name': 'John Smith', 'tweet': '@mary it is not just text, it does everything', 'user_id': 1}, '_type': 'tweet'}, {'_id': '2', '_index': 'demo', '_score': 1.0, '_source': {'date': '2014-09-15', 'name': 'Mary Jones', 'tweet': 'However did I manage before Elasticsearch?', 'user_id': 2}, '_type': 'tweet'}, {'_id': '3', '_index': 'demo', '_score': 1.0, '_source': {'date': '2014-09-16', 'name': 'John Smith', 'tweet': 'The Elasticsearch API is really easy to use', 'user_id': 1}, '_type': 'tweet'}], 'max_score': 1.0, 'total': 7}, 'timed_out': False, 'took': 6}
Simple match
es.search('name:john', index='demo')
{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'hits': {'hits': [{'_id': '0', '_index': 'demo', '_score': 0.625, '_source': {'email': 'john@smith.com', 'name': 'John Smith', 'username': '@john'}, '_type': 'user'}, {'_id': '1', '_index': 'demo', '_score': 0.625, '_source': {'date': '2014-09-14', 'name': 'John Smith', 'tweet': '@mary it is not just text, it does everything', 'user_id': 1}, '_type': 'tweet'}, {'_id': '3', '_index': 'demo', '_score': 0.19178301, '_source': {'date': '2014-09-16', 'name': 'John Smith', 'tweet': 'The Elasticsearch API is really easy to use', 'user_id': 1}, '_type': 'tweet'}], 'max_score': 0.625, 'total': 3}, 'timed_out': False, 'took': 282}
Query API, yeah, we can hide it for sometime but we can't escape:
query = {'query':
{'match': {'name': 'john'}}
}
es.search(query, index='demo')
{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'hits': {'hits': [{'_id': '0', '_index': 'demo', '_score': 0.625, '_source': {'email': 'john@smith.com', 'name': 'John Smith', 'username': '@john'}, '_type': 'user'}, {'_id': '1', '_index': 'demo', '_score': 0.625, '_source': {'date': '2014-09-14', 'name': 'John Smith', 'tweet': '@mary it is not just text, it does everything', 'user_id': 1}, '_type': 'tweet'}, {'_id': '3', '_index': 'demo', '_score': 0.19178301, '_source': {'date': '2014-09-16', 'name': 'John Smith', 'tweet': 'The Elasticsearch API is really easy to use', 'user_id': 1}, '_type': 'tweet'}], 'max_score': 0.625, 'total': 3}, 'timed_out': False, 'took': 9}
How about 2 terms?
query = {'query':
{'match': {'name': 'john mary'}}
}
es.search(query, index='demo')
{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'hits': {'hits': [{'_id': '0', '_index': 'demo', '_score': 0.22097087, '_source': {'email': 'john@smith.com', 'name': 'John Smith', 'username': '@john'}, '_type': 'user'}, {'_id': '0', '_index': 'demo', '_score': 0.22097087, '_source': {'date': '2014-09-13', 'name': 'Mary Jones', 'tweet': 'Elasticsearch means full text search has never been so easy', 'user_id': 2}, '_type': 'tweet'}, {'_id': '1', '_index': 'demo', '_score': 0.22097087, '_source': {'email': 'mary@jones.com', 'name': 'Mary Jones', 'username': '@mary'}, '_type': 'user'}, {'_id': '1', '_index': 'demo', '_score': 0.22097087, '_source': {'date': '2014-09-14', 'name': 'John Smith', 'tweet': '@mary it is not just text, it does everything', 'user_id': 1}, '_type': 'tweet'}, {'_id': '4', '_index': 'demo', '_score': 0.028130025, '_source': {'date': '2014-09-17', 'name': 'Mary Jones', 'tweet': 'The Query DSL is really powerful and flexible', 'user_id': 2}, '_type': 'tweet'}, {'_id': '2', '_index': 'demo', '_score': 0.028130025, '_source': {'date': '2014-09-15', 'name': 'Mary Jones', 'tweet': 'However did I manage before Elasticsearch?', 'user_id': 2}, '_type': 'tweet'}, {'_id': '3', '_index': 'demo', '_score': 0.028130025, '_source': {'date': '2014-09-16', 'name': 'John Smith', 'tweet': 'The Elasticsearch API is really easy to use', 'user_id': 1}, '_type': 'tweet'}], 'max_score': 0.22097087, 'total': 7}, 'timed_out': False, 'took': 179}
And phrase?
query = {'query':
{'match_phrase': {'name': 'john mary'}}
}
es.search(query, index='demo')
{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'hits': {'hits': [], 'max_score': None, 'total': 0}, 'timed_out': False, 'took': 147}
search
does not return an error like the get
method, this kind of behavior is much less scary.
We can write boolean combinations with must
, must_not
and should
:
Does John Smith mention "API" in his tweet?
query = \
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"name": "john smith"
}
},
{
"match": {
"tweet": "API"
}
}
]
}
}
}
es.search(query, index='demo')
{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'hits': {'hits': [{'_id': '3', '_index': 'demo', '_score': 0.38595587, '_source': {'date': '2014-09-16', 'name': 'John Smith', 'tweet': 'The Elasticsearch API is really easy to use', 'user_id': 1}, '_type': 'tweet'}], 'max_score': 0.38595587, 'total': 1}, 'timed_out': False, 'took': 13}
We can rank the importance of statments in combination using boost
field:
We try it with 'DSL' and 'API':
query = \
{
"query": {
"bool": {
"should": [
{
"match": {
"tweet": {
"query": "DSL",
"boost": 5,
}
}
},
{
"match": {
"tweet": "API"
}
}
]
}
}
}
es.search(query, index='demo')
{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'hits': {'hits': [{'_id': '4', '_index': 'demo', '_score': 0.04016714, '_source': {'date': '2014-09-17', 'name': 'Mary Jones', 'tweet': 'The Query DSL is really powerful and flexible', 'user_id': 2}, '_type': 'tweet'}, {'_id': '3', '_index': 'demo', '_score': 0.0029369325, '_source': {'date': '2014-09-16', 'name': 'John Smith', 'tweet': 'The Elasticsearch API is really easy to use', 'user_id': 1}, '_type': 'tweet'}], 'max_score': 0.04016714, 'total': 2}, 'timed_out': False, 'took': 10}
Now change boost
, and the order change:
query = \
{
"query": {
"bool": {
"should": [
{
"match": {
"tweet": {
"query": "DSL",
"boost": 0.5,
}
}
},
{
"match": {
"tweet": {
"query": "API"
}
}
}
]
}
}
}
es.search(query, index='demo')
{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'hits': {'hits': [{'_id': '3', '_index': 'demo', '_score': 0.025078464, '_source': {'date': '2014-09-16', 'name': 'John Smith', 'tweet': 'The Elasticsearch API is really easy to use', 'user_id': 1}, '_type': 'tweet'}, {'_id': '4', '_index': 'demo', '_score': 0.0072710635, '_source': {'date': '2014-09-17', 'name': 'Mary Jones', 'tweet': 'The Query DSL is really powerful and flexible', 'user_id': 2}, '_type': 'tweet'}], 'max_score': 0.025078464, 'total': 2}, 'timed_out': False, 'took': 9}
Highlight the result:
query = \
{
"query": {
"bool": {
"must": [
{
"match_phrase": {
"name": "john smith"
}
},
{
"match": {
"tweet": "API"
}
}
]
}
},
"highlight": {
"fields": {
"tweet": {}
}
}
}
es.search(query, index='demo')
{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'hits': {'hits': [{'_id': '3', '_index': 'demo', '_score': 0.38595587, '_source': {'date': '2014-09-16', 'name': 'John Smith', 'tweet': 'The Elasticsearch API is really easy to use', 'user_id': 1}, '_type': 'tweet', 'highlight': {'tweet': ['The Elasticsearch <em>API</em> is really easy to use']}}], 'max_score': 0.38595587, 'total': 1}, 'timed_out': False, 'took': 15}
Find all tweets posted after '2014-09-15':
query = \
{
"query": {
"filtered": {
"filter": {
"range": {
"date": {
"gt": '2014-09-15'
}
}
}
}
}
}
es.search(query, index='demo')
{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'hits': {'hits': [{'_id': '4', '_index': 'demo', '_score': 1.0, '_source': {'date': '2014-09-17', 'name': 'Mary Jones', 'tweet': 'The Query DSL is really powerful and flexible', 'user_id': 2}, '_type': 'tweet'}, {'_id': '3', '_index': 'demo', '_score': 1.0, '_source': {'date': '2014-09-16', 'name': 'John Smith', 'tweet': 'The Elasticsearch API is really easy to use', 'user_id': 1}, '_type': 'tweet'}], 'max_score': 1.0, 'total': 2}, 'timed_out': False, 'took': 8}
How about just list only John Smith's tweets, after 2014-09-15?
query = \
{
"query": {
"filtered": {
"query": {
"match_phrase": {
"name": "John Smith"
}
},
"filter": {
"range": {
"date": {
"gt": '2014-09-15'
}
}
}
}
}
}
es.search(query, index='demo')
{'_shards': {'failed': 0, 'successful': 5, 'total': 5}, 'hits': {'hits': [{'_id': '3', '_index': 'demo', '_score': 0.38356602, '_source': {'date': '2014-09-16', 'name': 'John Smith', 'tweet': 'The Elasticsearch API is really easy to use', 'user_id': 1}, '_type': 'tweet'}], 'max_score': 0.38356602, 'total': 1}, 'timed_out': False, 'took': 12}
All the fancy things above worked mostly because of Analysis.
Analysis = Tokenization + Token filters
Analyzer = Character filters + Tokenizer + Token filters
Analyzers are language-specific, as of July 2015, Vietnamese is not supported, so we won't talk much about it then.
Mapping is kind of schema in ElasticSearch. It's automatically generated if we don't customize it.
es.get_mapping('demo', 'tweet')
{'demo': {'mappings': {'tweet': {'properties': {'date': {'format': 'dateOptionalTime', 'type': 'date'}, 'name': {'type': 'string'}, 'tweet': {'type': 'string'}, 'user_id': {'type': 'long'}}}}}}
We can add a new field using put_mapping
method:
es.put_mapping('demo', 'tweet',
{'tweet':
{'properties':
{'very_new_field': {'type': 'string'}}}})
es.get_mapping('demo', 'tweet')
{'demo': {'mappings': {'tweet': {'properties': {'date': {'format': 'dateOptionalTime', 'type': 'date'}, 'name': {'type': 'string'}, 'tweet': {'type': 'string'}, 'user_id': {'type': 'long'}, 'very_new_field': {'type': 'string'}}}}}}
We can't change mapping of an existing field though:
try:
es.put_mapping('demo', 'tweet',
{'tweet':
{'properties':
{'very_new_field': {'type': 'long'}}}})
except:
print("Error")
WARNING:elasticsearch:PUT /demo/tweet/_mapping [status:400 request:0.068s]
Error
So if you must, specific your mapping before indexing to make sure things go in the way you want.