Demo using ElasticSearch in Python¶

This is a quick demonstration for using ElasticSearch in Python.

Some materials are taken from https://www.elastic.co/guide/en/elasticsearch/guide/current/index.html, it's a great book!

elasticsearch.yml¶

Some changes you need to make before lauching a node:

Change cluster.name for auto-discovery or not-auto-discovery cluster in your network
Change node.name for easy determine which node are in trouble

Some more options:

Lock the memory by setting bootstrap.mlockall to true for performance purpose
Set network.host to 127.0.0.1 for security reason

pyelasticsearch¶

We use pyelasticsearch package for wrapping ElasticSearch RESTful API around Python in this demo.

Install it using pip install pyelasticsearch

Set things up¶

Import ElasticSearch class.

In [ ]:

from pyelasticsearch import ElasticSearch

Config url for using ElasticSearch, there're more parameters but we're good for now.

In [2]:

es = ElasticSearch('http://localhost:9200')

We check the health first

In [11]:

es.health()

Out[11]:

{'active_primary_shards': 0,
 'active_shards': 0,
 'cluster_name': 'elasticsearch_tai-dev',
 'initializing_shards': 0,
 'number_of_data_nodes': 1,
 'number_of_in_flight_fetch': 0,
 'number_of_nodes': 1,
 'number_of_pending_tasks': 0,
 'relocating_shards': 0,
 'status': 'green',
 'timed_out': False,
 'unassigned_shards': 0}

All we care for now is the 'green' status, that means all things are OK.

Fact: health method is a wrapper for calling GET /_cluster/health?pretty directly using API.

CRUD: create-read-update-delete¶

Now before we have anything to do with ElasticSearch, we need to index our documents to ElasticSearch database.

Index¶

In [12]:

es.index('library', # Index name
         'books',   # Type name
         {
            'title': 'A very interesting name',
            'name': {
                'first': 'Hugh',
                'last': 'Jackman'
            },
            'publish_date': '2015-07-02',
            'price': 20,
         },
         id=1        # Doc ID
        )

Out[12]:

{'_id': '1',
 '_index': 'library',
 '_type': 'books',
 '_version': 1,
 'created': True}

Read¶

In [13]:

es.get('library', 'books', 1)

Out[13]:

{'_id': '1',
 '_index': 'library',
 '_source': {'name': {'first': 'Hugh', 'last': 'Jackman'},
  'price': 20,
  'publish_date': '2015-07-02',
  'title': 'A very interesting name'},
 '_type': 'books',
 '_version': 1,
 'found': True}

If the document is not existed, an error is raised:

In [35]:

try:
    es.get('library', 'books', 123)
except:
    print("This is an error!")

WARNING:elasticsearch:GET /library/books/123 [status:404 request:0.004s]

This is an error!

Optional (and ugly) ID:

In [18]:

es.index('library', # Index name
         'books',   # Type name
         {
            'title': 'Another interesting name',
            'name': {
                'first': 'Tom',
                'last': 'Cruise'
            },
            'publish_date': '2015-08-02',
            'price': 21,
         },
        )

Out[18]:

{'_id': 'AU5N-j9DhyYCAHYcFB3R',
 '_index': 'library',
 '_type': 'books',
 '_version': 1,
 'created': True}

Get me that book:

In [19]:

es.get('library', 'books', 'AU5N-j9DhyYCAHYcFB3R')

Out[19]:

{'_id': 'AU5N-j9DhyYCAHYcFB3R',
 '_index': 'library',
 '_source': {'name': {'first': 'Tom', 'last': 'Cruise'},
  'price': 21,
  'publish_date': '2015-08-02',
  'title': 'Another interesting name'},
 '_type': 'books',
 '_version': 1,
 'found': True}

Update¶

In [23]:

es.update('library', # Index name
          'books',   # Type name
          id = 1,    # Doc ID
          doc = {
             'title': 'A very interesting name 2',
             'name': {
                 'first': 'Hugh',
                 'last': 'Jackman'
             },
             'publish_date': '2015-07-03',
             'price': 30,
          },
        )

Out[23]:

{'_id': '1', '_index': 'library', '_type': 'books', '_version': 2}

It worked, but the method is kind of ugly though.

In [24]:

es.get('library', 'books', 1)

Out[24]:

{'_id': '1',
 '_index': 'library',
 '_source': {'name': {'first': 'Hugh', 'last': 'Jackman'},
  'price': 30,
  'publish_date': '2015-07-03',
  'title': 'A very interesting name 2'},
 '_type': 'books',
 '_version': 2,
 'found': True}

The method perform a partial update:

In [25]:

es.update('library', # Index name
          'books',   # Type name
          id = 1,    # Doc ID
          doc = {
             'price': 90,
          },
        )

Out[25]:

{'_id': '1', '_index': 'library', '_type': 'books', '_version': 3}

In [26]:

es.get('library', 'books', 1)

Out[26]:

{'_id': '1',
 '_index': 'library',
 '_source': {'name': {'first': 'Hugh', 'last': 'Jackman'},
  'price': 90,
  'publish_date': '2015-07-03',
  'title': 'A very interesting name 2'},
 '_type': 'books',
 '_version': 3,
 'found': True}

Delete¶

In [36]:

es.delete('library', 'books', 1)

Out[36]:

{'_id': '1',
 '_index': 'library',
 '_type': 'books',
 '_version': 4,
 'found': True}

In [38]:

try:
    es.get('library', 'books', 1)
except:
    print('Not found!')

WARNING:elasticsearch:GET /library/books/1 [status:404 request:0.004s]

Not found!

Bulk indexing and Search¶

Bulk index¶

Input data:

In [40]:

users = [{ "email" : "john@smith.com", "name" : "John Smith", "username" : "@john" }, 
        { "email" : "mary@jones.com", "name" : "Mary Jones", "username" : "@mary" }]

tweet = [{ "date" : "2014-09-13", "name" : "Mary Jones", "tweet" : "Elasticsearch means full text search has never been so easy", "user_id" : 2 },
        { "date" : "2014-09-14", "name" : "John Smith", "tweet" : "@mary it is not just text, it does everything", "user_id" : 1 },
        { "date" : "2014-09-15", "name" : "Mary Jones", "tweet" : "However did I manage before Elasticsearch?", "user_id" : 2 },
        { "date" : "2014-09-16", "name" : "John Smith", "tweet" : "The Elasticsearch API is really easy to use", "user_id" : 1 },
        { "date" : "2014-09-17", "name" : "Mary Jones", "tweet" : "The Query DSL is really powerful and flexible", "user_id" : 2 }]

Bulk indexing:

In [48]:

es.bulk((es.index_op(user, id=i) for i, user in enumerate(users)),
        index='demo',
        doc_type='user')

Out[48]:

{'errors': False,
 'items': [{'index': {'_id': '0',
    '_index': 'demo',
    '_type': 'user',
    '_version': 1,
    'status': 201}},
  {'index': {'_id': '1',
    '_index': 'demo',
    '_type': 'user',
    '_version': 1,
    'status': 201}}],
 'took': 886}

In [49]:

es.bulk((es.index_op(t, id=i) for i, t in enumerate(tweet)),
        index='demo',
        doc_type='tweet')

Out[49]:

{'errors': False,
 'items': [{'index': {'_id': '0',
    '_index': 'demo',
    '_type': 'tweet',
    '_version': 1,
    'status': 201}},
  {'index': {'_id': '1',
    '_index': 'demo',
    '_type': 'tweet',
    '_version': 1,
    'status': 201}},
  {'index': {'_id': '2',
    '_index': 'demo',
    '_type': 'tweet',
    '_version': 1,
    'status': 201}},
  {'index': {'_id': '3',
    '_index': 'demo',
    '_type': 'tweet',
    '_version': 1,
    'status': 201}},
  {'index': {'_id': '4',
    '_index': 'demo',
    '_type': 'tweet',
    '_version': 1,
    'status': 201}}],
 'took': 53}

Search¶

Search all¶

In [54]:

es.search({})

Out[54]:

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '4',
    '_index': 'demo',
    '_score': 1.0,
    '_source': {'date': '2014-09-17',
     'name': 'Mary Jones',
     'tweet': 'The Query DSL is really powerful and flexible',
     'user_id': 2},
    '_type': 'tweet'},
   {'_id': '0',
    '_index': 'demo',
    '_score': 1.0,
    '_source': {'email': 'john@smith.com',
     'name': 'John Smith',
     'username': '@john'},
    '_type': 'user'},
   {'_id': '0',
    '_index': 'demo',
    '_score': 1.0,
    '_source': {'date': '2014-09-13',
     'name': 'Mary Jones',
     'tweet': 'Elasticsearch means full text search has never been so easy',
     'user_id': 2},
    '_type': 'tweet'},
   {'_id': '1',
    '_index': 'demo',
    '_score': 1.0,
    '_source': {'email': 'mary@jones.com',
     'name': 'Mary Jones',
     'username': '@mary'},
    '_type': 'user'},
   {'_id': '1',
    '_index': 'demo',
    '_score': 1.0,
    '_source': {'date': '2014-09-14',
     'name': 'John Smith',
     'tweet': '@mary it is not just text, it does everything',
     'user_id': 1},
    '_type': 'tweet'},
   {'_id': '2',
    '_index': 'demo',
    '_score': 1.0,
    '_source': {'date': '2014-09-15',
     'name': 'Mary Jones',
     'tweet': 'However did I manage before Elasticsearch?',
     'user_id': 2},
    '_type': 'tweet'},
   {'_id': '3',
    '_index': 'demo',
    '_score': 1.0,
    '_source': {'date': '2014-09-16',
     'name': 'John Smith',
     'tweet': 'The Elasticsearch API is really easy to use',
     'user_id': 1},
    '_type': 'tweet'}],
  'max_score': 1.0,
  'total': 7},
 'timed_out': False,
 'took': 6}

Match¶

Simple match

In [55]:

es.search('name:john', index='demo')

Out[55]:

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '0',
    '_index': 'demo',
    '_score': 0.625,
    '_source': {'email': 'john@smith.com',
     'name': 'John Smith',
     'username': '@john'},
    '_type': 'user'},
   {'_id': '1',
    '_index': 'demo',
    '_score': 0.625,
    '_source': {'date': '2014-09-14',
     'name': 'John Smith',
     'tweet': '@mary it is not just text, it does everything',
     'user_id': 1},
    '_type': 'tweet'},
   {'_id': '3',
    '_index': 'demo',
    '_score': 0.19178301,
    '_source': {'date': '2014-09-16',
     'name': 'John Smith',
     'tweet': 'The Elasticsearch API is really easy to use',
     'user_id': 1},
    '_type': 'tweet'}],
  'max_score': 0.625,
  'total': 3},
 'timed_out': False,
 'took': 282}

Query API, yeah, we can hide it for sometime but we can't escape:

In [59]:

query = {'query':
            {'match': {'name': 'john'}}
        }

In [60]:

es.search(query, index='demo')

Out[60]:

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '0',
    '_index': 'demo',
    '_score': 0.625,
    '_source': {'email': 'john@smith.com',
     'name': 'John Smith',
     'username': '@john'},
    '_type': 'user'},
   {'_id': '1',
    '_index': 'demo',
    '_score': 0.625,
    '_source': {'date': '2014-09-14',
     'name': 'John Smith',
     'tweet': '@mary it is not just text, it does everything',
     'user_id': 1},
    '_type': 'tweet'},
   {'_id': '3',
    '_index': 'demo',
    '_score': 0.19178301,
    '_source': {'date': '2014-09-16',
     'name': 'John Smith',
     'tweet': 'The Elasticsearch API is really easy to use',
     'user_id': 1},
    '_type': 'tweet'}],
  'max_score': 0.625,
  'total': 3},
 'timed_out': False,
 'took': 9}

How about 2 terms?

In [61]:

query = {'query':
            {'match': {'name': 'john mary'}}
        }
es.search(query, index='demo')

Out[61]:

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '0',
    '_index': 'demo',
    '_score': 0.22097087,
    '_source': {'email': 'john@smith.com',
     'name': 'John Smith',
     'username': '@john'},
    '_type': 'user'},
   {'_id': '0',
    '_index': 'demo',
    '_score': 0.22097087,
    '_source': {'date': '2014-09-13',
     'name': 'Mary Jones',
     'tweet': 'Elasticsearch means full text search has never been so easy',
     'user_id': 2},
    '_type': 'tweet'},
   {'_id': '1',
    '_index': 'demo',
    '_score': 0.22097087,
    '_source': {'email': 'mary@jones.com',
     'name': 'Mary Jones',
     'username': '@mary'},
    '_type': 'user'},
   {'_id': '1',
    '_index': 'demo',
    '_score': 0.22097087,
    '_source': {'date': '2014-09-14',
     'name': 'John Smith',
     'tweet': '@mary it is not just text, it does everything',
     'user_id': 1},
    '_type': 'tweet'},
   {'_id': '4',
    '_index': 'demo',
    '_score': 0.028130025,
    '_source': {'date': '2014-09-17',
     'name': 'Mary Jones',
     'tweet': 'The Query DSL is really powerful and flexible',
     'user_id': 2},
    '_type': 'tweet'},
   {'_id': '2',
    '_index': 'demo',
    '_score': 0.028130025,
    '_source': {'date': '2014-09-15',
     'name': 'Mary Jones',
     'tweet': 'However did I manage before Elasticsearch?',
     'user_id': 2},
    '_type': 'tweet'},
   {'_id': '3',
    '_index': 'demo',
    '_score': 0.028130025,
    '_source': {'date': '2014-09-16',
     'name': 'John Smith',
     'tweet': 'The Elasticsearch API is really easy to use',
     'user_id': 1},
    '_type': 'tweet'}],
  'max_score': 0.22097087,
  'total': 7},
 'timed_out': False,
 'took': 179}

And phrase?

In [62]:

query = {'query':
            {'match_phrase': {'name': 'john mary'}}
        }

es.search(query, index='demo')

Out[62]:

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [], 'max_score': None, 'total': 0},
 'timed_out': False,
 'took': 147}

search does not return an error like the get method, this kind of behavior is much less scary.

Boolean combination¶

We can write boolean combinations with must, must_not and should:

Does John Smith mention "API" in his tweet?

In [67]:

query = \
{
    "query": {
        "bool": {
            "must": [
                {
                    "match_phrase": {
                        "name": "john smith"
                    }
                },
                {
                    "match": {
                        "tweet": "API"
                    }
                }
            ]
        }
    }
}

es.search(query, index='demo')

Out[67]:

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '3',
    '_index': 'demo',
    '_score': 0.38595587,
    '_source': {'date': '2014-09-16',
     'name': 'John Smith',
     'tweet': 'The Elasticsearch API is really easy to use',
     'user_id': 1},
    '_type': 'tweet'}],
  'max_score': 0.38595587,
  'total': 1},
 'timed_out': False,
 'took': 13}

We can rank the importance of statments in combination using boost field:

We try it with 'DSL' and 'API':

In [76]:

query = \
{
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "tweet": {
                            "query": "DSL",
                            "boost": 5,
                        }                        
                    }
                },
                {
                    "match": {
                        "tweet": "API"
                    }
                }
            ]
        }
    }
}

es.search(query, index='demo')

Out[76]:

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '4',
    '_index': 'demo',
    '_score': 0.04016714,
    '_source': {'date': '2014-09-17',
     'name': 'Mary Jones',
     'tweet': 'The Query DSL is really powerful and flexible',
     'user_id': 2},
    '_type': 'tweet'},
   {'_id': '3',
    '_index': 'demo',
    '_score': 0.0029369325,
    '_source': {'date': '2014-09-16',
     'name': 'John Smith',
     'tweet': 'The Elasticsearch API is really easy to use',
     'user_id': 1},
    '_type': 'tweet'}],
  'max_score': 0.04016714,
  'total': 2},
 'timed_out': False,
 'took': 10}

Now change boost, and the order change:

In [77]:

query = \
{
    "query": {
        "bool": {
            "should": [
                {
                    "match": {
                        "tweet": {
                            "query": "DSL",
                            "boost": 0.5,
                        }                        
                    }
                },
                {
                    "match": {
                        "tweet": {
                            "query": "API"
                        }
                    }
                }
            ]
        }
    }
}

es.search(query, index='demo')

Out[77]:

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '3',
    '_index': 'demo',
    '_score': 0.025078464,
    '_source': {'date': '2014-09-16',
     'name': 'John Smith',
     'tweet': 'The Elasticsearch API is really easy to use',
     'user_id': 1},
    '_type': 'tweet'},
   {'_id': '4',
    '_index': 'demo',
    '_score': 0.0072710635,
    '_source': {'date': '2014-09-17',
     'name': 'Mary Jones',
     'tweet': 'The Query DSL is really powerful and flexible',
     'user_id': 2},
    '_type': 'tweet'}],
  'max_score': 0.025078464,
  'total': 2},
 'timed_out': False,
 'took': 9}

Highlight the result:

In [80]:

query = \
{
    "query": {
        "bool": {
            "must": [
                {
                    "match_phrase": {
                        "name": "john smith"
                    }
                },
                {
                    "match": {
                        "tweet": "API"
                    }
                }
            ]
        }
    },
    "highlight": {
        "fields": {
            "tweet": {}
        }
    }
}

es.search(query, index='demo')

Out[80]:

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '3',
    '_index': 'demo',
    '_score': 0.38595587,
    '_source': {'date': '2014-09-16',
     'name': 'John Smith',
     'tweet': 'The Elasticsearch API is really easy to use',
     'user_id': 1},
    '_type': 'tweet',
    'highlight': {'tweet': ['The Elasticsearch <em>API</em> is really easy to use']}}],
  'max_score': 0.38595587,
  'total': 1},
 'timed_out': False,
 'took': 15}

Filter¶

Find all tweets posted after '2014-09-15':

In [83]:

query = \
{
    "query": {
        "filtered": {
            "filter": {
                "range": {
                    "date": {
                        "gt": '2014-09-15'
                    }
                }
            }
        }
    }
}

es.search(query, index='demo')

Out[83]:

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '4',
    '_index': 'demo',
    '_score': 1.0,
    '_source': {'date': '2014-09-17',
     'name': 'Mary Jones',
     'tweet': 'The Query DSL is really powerful and flexible',
     'user_id': 2},
    '_type': 'tweet'},
   {'_id': '3',
    '_index': 'demo',
    '_score': 1.0,
    '_source': {'date': '2014-09-16',
     'name': 'John Smith',
     'tweet': 'The Elasticsearch API is really easy to use',
     'user_id': 1},
    '_type': 'tweet'}],
  'max_score': 1.0,
  'total': 2},
 'timed_out': False,
 'took': 8}

How about just list only John Smith's tweets, after 2014-09-15?

In [85]:

query = \
{
    "query": {
        "filtered": {
            "query": {
                "match_phrase": {
                    "name": "John Smith"
                }
            },
            "filter": {
                "range": {
                    "date": {
                        "gt": '2014-09-15'
                    }
                }
            }
        }
    }
}

es.search(query, index='demo')

Out[85]:

{'_shards': {'failed': 0, 'successful': 5, 'total': 5},
 'hits': {'hits': [{'_id': '3',
    '_index': 'demo',
    '_score': 0.38356602,
    '_source': {'date': '2014-09-16',
     'name': 'John Smith',
     'tweet': 'The Elasticsearch API is really easy to use',
     'user_id': 1},
    '_type': 'tweet'}],
  'max_score': 0.38356602,
  'total': 1},
 'timed_out': False,
 'took': 12}

Analysis and Analyzer¶

All the fancy things above worked mostly because of Analysis.

Analysis = Tokenization + Token filters

Analyzer = Character filters + Tokenizer + Token filters

Analyzers are language-specific, as of July 2015, Vietnamese is not supported, so we won't talk much about it then.

Mapping¶

Mapping is kind of schema in ElasticSearch. It's automatically generated if we don't customize it.

In [93]:

es.get_mapping('demo', 'tweet')

Out[93]:

{'demo': {'mappings': {'tweet': {'properties': {'date': {'format': 'dateOptionalTime',
      'type': 'date'},
     'name': {'type': 'string'},
     'tweet': {'type': 'string'},
     'user_id': {'type': 'long'}}}}}}

We can add a new field using put_mapping method:

In [96]:

es.put_mapping('demo', 'tweet',
               {'tweet':
                {'properties':
                 {'very_new_field': {'type': 'string'}}}})

es.get_mapping('demo', 'tweet')

Out[96]:

{'demo': {'mappings': {'tweet': {'properties': {'date': {'format': 'dateOptionalTime',
      'type': 'date'},
     'name': {'type': 'string'},
     'tweet': {'type': 'string'},
     'user_id': {'type': 'long'},
     'very_new_field': {'type': 'string'}}}}}}

We can't change mapping of an existing field though:

In [98]:

try:
    es.put_mapping('demo', 'tweet',
                   {'tweet':
                    {'properties':
                     {'very_new_field': {'type': 'long'}}}})
except:
    print("Error")

WARNING:elasticsearch:PUT /demo/tweet/_mapping [status:400 request:0.068s]

Error

So if you must, specific your mapping before indexing to make sure things go in the way you want.