## Web scraping¶

To get values from websites which don't provide an API is often only through scraping. It can be very tricky to get to the right values but this example here should help you to get started. This is very similar to the work-flow the scrape sensor is using.

### Get the value¶

Importing the needed modules.

In [1]:
import requests
from bs4 import BeautifulSoup


We want to scrape the counter for all our implementations from the Component overview.

The section (extracted from the source) which is relevant for this example is shown below.

...
<div class="grid__item one-sixth lap-one-whole palm-one-whole">
<div class="filter-button-group">
<a href='#all' class="btn">All (444)</a>
<a href='#featured' class="btn featured">Featured</a>
<a href='#alarm' class="btn">
Alarm
(9)
</a>
...


The line <a href='#all' class="btn">All (444)</a> contains the counter.

In [2]:
URL = 'https://home-assistant.io/components/'


With requests the website is retrieved and with BeautifulSoup parsed.

In [3]:
raw_html = requests.get(URL).text
data = BeautifulSoup(raw_html, 'html.parser')


Now you have the complete content of the page. CSS selectors can be used to identify the counter. We have several options to get the part in question. As BeautifulSoup is giving us a list with the findings, we only need to identify the position in the list.

In [4]:
print(data.select('a')[10])

<a class="btn" href="#all">All (791)</a>

In [5]:
print(data.select('.btn')[0])

<a class="btn" href="#all">All (791)</a>


nth-of-type(x) gives you element x back.

In [6]:
print(data.select('a:nth-of-type(11)'))

[<a class="btn" href="#all">All (791)</a>]


To make your selector as robust as possible, it's recommended to look for unique elements like id, URL, etc.

In [7]:
print(data.select('a[href="#all"]'))

[<a class="btn" href="#all">All (791)</a>]


The value extration is handled with value_template by the scrape sensor. The next two step are only shown here to show all manual steps.

We only need the actual text.

In [8]:
print(data.select('a[href="#all"]')[0].text)

All (791)


This is a string and can be manipulated. We focus on the number.

In [9]:
print(data.select('a[href="#all"]')[0].text[5:8])

791


This is the number of the current platforms/components from the Component overview which are available in Home Assistant.

The details you identified here can be re-used to configure scrape sensor's select. This means that the most efficient way is to apply nth-of-type(x) to your selector.

### Send the value to the Home Assistant frontend¶

The "Using the Home Assistant Python API" notebooks contains an intro to the Python API of Home Assistant and Jupyther notebooks. Here we are sending the scrapped value to the Home Assistant frontend.

In [10]:
import homeassistant.remote as remote

HOST = '127.0.0.1'

new_state = data.select('a[href="#all"]')[0].text[5:8]

True