!wget https://setup.johnsnowlabs.com/nlu/colab.sh -O - | bash
import nlu
--2022-04-15 11:39:40-- https://setup.johnsnowlabs.com/nlu/colab.sh Resolving setup.johnsnowlabs.com (setup.johnsnowlabs.com)... 51.158.130.125 Connecting to setup.johnsnowlabs.com (setup.johnsnowlabs.com)|51.158.130.125|:443... connected. HTTP request sent, awaiting response... 302 Moved Temporarily Location: https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh [following] --2022-04-15 11:39:40-- https://raw.githubusercontent.com/JohnSnowLabs/nlu/master/scripts/colab_setup.sh Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 1665 (1.6K) [text/plain] Saving to: ‘STDOUT’ - 0%[ ] 0 --.-KB/s Installing NLU 3.4.3rc2 with PySpark 3.0.3 and Spark NLP 3.4.2 for Google Colab ... - 100%[===================>] 1.63K --.-KB/s in 0.001s 2022-04-15 11:39:41 (1.67 MB/s) - written to stdout [1665/1665] Get:1 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B] Ign:2 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease Get:3 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB] Hit:4 http://archive.ubuntu.com/ubuntu bionic InRelease Get:5 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB] Ign:6 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 InRelease Get:7 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release [696 B] Hit:8 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release Get:9 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release.gpg [836 B] Get:10 http://archive.ubuntu.com/ubuntu bionic-updates InRelease [88.7 kB] Hit:11 http://ppa.launchpad.net/cran/libgit2/ubuntu bionic InRelease Get:12 http://archive.ubuntu.com/ubuntu bionic-backports InRelease [74.6 kB] Get:13 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic InRelease [15.9 kB] Hit:15 http://ppa.launchpad.net/graphics-drivers/ppa/ubuntu bionic InRelease Get:16 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Packages [953 kB] Get:17 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main Sources [1,947 kB] Get:18 http://security.ubuntu.com/ubuntu bionic-security/universe amd64 Packages [1,490 kB] Get:19 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages [3,134 kB] Get:20 http://security.ubuntu.com/ubuntu bionic-security/main amd64 Packages [2,695 kB] Get:21 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic/main amd64 Packages [996 kB] Get:22 http://archive.ubuntu.com/ubuntu bionic-updates/universe amd64 Packages [2,268 kB] Get:23 http://ppa.launchpad.net/deadsnakes/ppa/ubuntu bionic/main amd64 Packages [45.3 kB] Fetched 13.8 MB in 4s (3,847 kB/s) Reading package lists... Done tar: spark-3.0.2-bin-hadoop2.7.tgz: Cannot open: No such file or directory tar: Error is not recoverable: exiting now |████████████████████████████████| 209.1 MB 60 kB/s |████████████████████████████████| 142 kB 53.1 MB/s |████████████████████████████████| 505 kB 58.6 MB/s |████████████████████████████████| 198 kB 55.2 MB/s Building wheel for pyspark (setup.py) ... done Collecting nlu_tmp==3.4.3rc10 Downloading nlu_tmp-3.4.3rc10-py3-none-any.whl (510 kB) |████████████████████████████████| 510 kB 5.1 MB/s Requirement already satisfied: spark-nlp<3.5.0,>=3.4.2 in /usr/local/lib/python3.7/dist-packages (from nlu_tmp==3.4.3rc10) (3.4.2) Requirement already satisfied: pandas>=1.3.5 in /usr/local/lib/python3.7/dist-packages (from nlu_tmp==3.4.3rc10) (1.3.5) Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from nlu_tmp==3.4.3rc10) (1.21.5) Requirement already satisfied: pyarrow>=0.16.0 in /usr/local/lib/python3.7/dist-packages (from nlu_tmp==3.4.3rc10) (6.0.1) Requirement already satisfied: dataclasses in /usr/local/lib/python3.7/dist-packages (from nlu_tmp==3.4.3rc10) (0.6) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.3.5->nlu_tmp==3.4.3rc10) (2.8.2) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=1.3.5->nlu_tmp==3.4.3rc10) (2018.9) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas>=1.3.5->nlu_tmp==3.4.3rc10) (1.15.0) Installing collected packages: nlu-tmp Successfully installed nlu-tmp-3.4.3rc10
https://www.kaggle.com/kashnitsky/news-about-major-cryptocurrencies-20132018-40k
import pandas as pd
import nlu
!wget http://ckl-it.de/wp-content/uploads/2020/12/small_btc.csv
df = pd.read_csv('/content/small_btc.csv').title
df
--2022-04-15 11:41:26-- http://ckl-it.de/wp-content/uploads/2020/12/small_btc.csv Resolving ckl-it.de (ckl-it.de)... 217.160.0.108, 2001:8d8:100f:f000::209 Connecting to ckl-it.de (ckl-it.de)|217.160.0.108|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 22244914 (21M) [text/csv] Saving to: ‘small_btc.csv’ small_btc.csv 100%[===================>] 21.21M 14.6MB/s in 1.4s 2022-04-15 11:41:28 (14.6 MB/s) - ‘small_btc.csv’ saved [22244914/22244914]
0 Bitcoin Price Update: Will China Lead us Down? 1 Key Bitcoin Price Levels for Week 51 (15 – 22 ... 2 National Australia Bank, Citing Highly Flawed ... 3 Chinese Bitcoin Ban Driven by Chinese Banking... 4 Bitcoin Trade Update: Opened Position ... 1995 Bitcoin Bill Pay Company Living Room of Satosh... 1996 NYDFS Extends BitLicense Bitcoin Regulation Co... 1997 Bitfinex Passes Stefan Thomas’s Proof Of Solve... 1998 Cryptocurrency Exchange Platform AlphaPoint Pa... 1999 Want to Buy And Sell Bitcoin Fast and Secure? ... Name: title, Length: 2000, dtype: object
import nlu
# Predict sentiment on dataset with NLU sentiment model
sentiment_df = nlu.load('emotion').predict(df)
sentiment_df
classifierdl_use_emotion download started this may take some time. Approximate size to download 21.3 MB [OK!] tfhub_use download started this may take some time. Approximate size to download 923.7 MB [OK!] sentence_detector_dl download started this may take some time. Approximate size to download 354.6 KB [OK!]
emotion | emotion_confidence_confidence | sentence | sentence_embedding_use | |
---|---|---|---|---|
0 | fear | 0.998173 | Bitcoin Price Update: Will China Lead us Down? | [0.05829371139407158, -0.036904484033584595, -... |
1 | joy | 0.997696 | Key Bitcoin Price Levels for Week 51 (15 – 22 ... | [0.038088250905275345, -0.04514157399535179, -... |
2 | fear | 0.999997 | National Australia Bank, Citing Highly Flawed ... | [0.05034318566322327, -0.01303655095398426, -0... |
3 | fear | 0.999135 | Chinese Bitcoin Ban Driven by Chinese Banking ... | [0.055152829736471176, -0.05237917602062225, -... |
4 | joy | 0.998864 | Bitcoin Trade Update: Opened Position | [0.05926975607872009, -0.056463420391082764, -... |
... | ... | ... | ... | ... |
1996 | fear | 0.998281 | NYDFS Extends BitLicense Bitcoin Regulation Co... | [0.0639236643910408, -0.05505230277776718, -0.... |
1997 | fear | 0.772052 | Bitfinex Passes Stefan Thomas’s Proof Of Solve... | [0.059178080409765244, -0.041498005390167236, ... |
1998 | joy | 0.999348 | Cryptocurrency Exchange Platform AlphaPoint Pa... | [0.05369672179222107, -0.023480931296944618, -... |
1999 | fear | 0.998905 | Want to Buy And Sell Bitcoin Fast and Secure? | [0.0626637190580368, -0.05945301055908203, -0.... |
1999 | fear | 0.998905 | Try CoinRNR | [0.02854502573609352, 0.05557611957192421, 0.0... |
2160 rows × 4 columns
sentiment_df.emotion.value_counts().plot.bar(figsize=(20,14), title = 'Emotion Distribution of Bitcoin News Articles')
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e1d797150>
key_df = nlu.load('yake').predict(df)
key_df
sentence_detector_dl download started this may take some time. Approximate size to download 354.6 KB [OK!]
document | keywords | keywords_confidence | |
---|---|---|---|
0 | Bitcoin Price Update: Will China Lead us Down? | update | 0.5798862558280943 |
0 | Bitcoin Price Update: Will China Lead us Down? | china | 0.5798862558280943 |
0 | Bitcoin Price Update: Will China Lead us Down? | china lead | 0.5066323531331214 |
1 | Key Bitcoin Price Levels for Week 51 (15 – 22 ... | price | 0.5798862558280943 |
1 | Key Bitcoin Price Levels for Week 51 (15 – 22 ... | levels | 0.5798862558280943 |
... | ... | ... | ... |
1998 | Cryptocurrency Exchange Platform AlphaPoint Pa... | growth | 0.26804494089513314 |
1998 | Cryptocurrency Exchange Platform AlphaPoint Pa... | support growth | 0.1840422979793308 |
1999 | Want to Buy And Sell Bitcoin Fast and Secure? ... | bitcoin fast | 0.3579604335906263 |
1999 | Want to Buy And Sell Bitcoin Fast and Secure? ... | try coinrnr | 0.2564243599387429 |
1999 | Want to Buy And Sell Bitcoin Fast and Secure? ... | sell bitcoin fast | 0.28203029979078753 |
6085 rows × 3 columns
You need to call .explode()
on the keyword column and then get the count of each keyword
# key_df.explode('keywords_classes').keywords_classes.value_counts()[0:100].plot.bar(title='Top 100 Keywords in Stack Overflow Questions', figsize=(20,8))
key_df.explode('keywords').keywords.value_counts()[0:100].plot.bar(title='Top 100 Keywords in BTC News Articles', figsize=(20,8))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e1fb3da10>
To reduce dimensionality of the data and yield better results with keyword extraction, we can apply the built in stemmer on our dataset. Especially to merge occurences of termns like bitcoin
and bitcoins
Note, Lemmatizing and Normalizing could also applied for further dimension reduction, but they would noch fix the previously mentioned example
stem_df = nlu.load('stem').predict(df, output_level = 'document')
stem_df['stem_string'] = stem_df.stem.str.join(' ')
stem_df
document | stem | stem_string | |
---|---|---|---|
0 | Bitcoin Price Update: Will China Lead us Down? | [bitcoin, price, updat, :, will, china, lead, ... | bitcoin price updat : will china lead u down ? |
1 | Key Bitcoin Price Levels for Week 51 (15 – 22 ... | [kei, bitcoin, price, level, for, week, 51, (,... | kei bitcoin price level for week 51 ( 15 – 22 ... |
2 | National Australia Bank, Citing Highly Flawed ... | [nation, australia, bank, ,, cite, highli, fla... | nation australia bank , cite highli flawe data... |
3 | Chinese Bitcoin Ban Driven by Chinese Banking ... | [chines, bitcoin, ban, driven, by, chines, ban... | chines bitcoin ban driven by chines bank crisi ? |
4 | Bitcoin Trade Update: Opened Position | [bitcoin, trade, updat, :, open, posit] | bitcoin trade updat : open posit |
... | ... | ... | ... |
1995 | Bitcoin Bill Pay Company Living Room of Satosh... | [bitcoin, bill, pai, compani, live, room, of, ... | bitcoin bill pai compani live room of satoshi ... |
1996 | NYDFS Extends BitLicense Bitcoin Regulation Co... | [nydf, extend, bitlicens, bitcoin, regul, comm... | nydf extend bitlicens bitcoin regul comment pe... |
1997 | Bitfinex Passes Stefan Thomas’s Proof Of Solve... | [bitfinex, pass, stefan, thomas’, proof, of, s... | bitfinex pass stefan thomas’ proof of solvenc ... |
1998 | Cryptocurrency Exchange Platform AlphaPoint Pa... | [cryptocurr, exchang, platform, alphapoint, pa... | cryptocurr exchang platform alphapoint partner... |
1999 | Want to Buy And Sell Bitcoin Fast and Secure? ... | [want, to, bui, and, sell, bitcoin, fast, and,... | want to bui and sell bitcoin fast and secur ? ... |
2000 rows × 3 columns
We can see bitcoins
is not a keyword anymore and added to the bitcoin
count including a lot of other occurences of Bitcoin in the dataset.
stem_df = nlu.load('yake').predict(stem_df.stem_string)
stem_df.explode('keywords').keywords.value_counts()[0:100].plot.bar(title='Top 100 Keywords in Stack Overflow Questions Lemmatized', figsize=(20,8))
sentence_detector_dl download started this may take some time. Approximate size to download 354.6 KB [OK!]
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e1d771cd0>
stem_df.explode('keywords').keywords.value_counts()[1:100].plot.bar(title='Top 100 Keywords in Stack Overflow Questions Lemmatized', figsize=(20,8))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e1e511990>
setNKeywords
to increase number of keywords extractedsetMinNGrams
Minimum N-grams a keyword shouldsetMaxNGrams
Maximum N-grams a keyword shouldsetWindowSize
Window size for Co-OccurrencesetThreshold
Keyword Score thresholdsetStopWords
The words to be filtered out. by default it's english stop words from Spark MLimport nlu
yake_pipe = nlu.load('yake')
yake_pipe.print_info()
The following parameters are configurable for this NLU pipeline (You can copy paste the examples) : >>> component_list['yake_keyword_extraction'] has settable params: component_list['yake_keyword_extraction'].setMinNGrams(1) | Info: Minimum N-grams a keyword should have | Currently set to : 1 component_list['yake_keyword_extraction'].setMaxNGrams(3) | Info: Maximum N-grams a keyword should have | Currently set to : 3 component_list['yake_keyword_extraction'].setNKeywords(3) | Info: Number of Keywords to extract | Currently set to : 3 component_list['yake_keyword_extraction'].setWindowSize(3) | Info: Window size for Co-Occurrence | Currently set to : 3 component_list['yake_keyword_extraction'].setThreshold(-1.0) | Info: Keyword Score threshold | Currently set to : -1.0 component_list['yake_keyword_extraction'].setStopWords(['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', "i'll", "you'll", "he'll", "she'll", "we'll", "they'll", "i'd", "you'd", "he'd", "she'd", "we'd", "they'd", "i'm", "you're", "he's", "she's", "it's", "we're", "they're", "i've", "we've", "you've", "they've", "isn't", "aren't", "wasn't", "weren't", "haven't", "hasn't", "hadn't", "don't", "doesn't", "didn't", "won't", "wouldn't", "shan't", "shouldn't", "mustn't", "can't", "couldn't", 'cannot', 'could', "here's", "how's", "let's", 'ought', "that's", "there's", "what's", "when's", "where's", "who's", "why's", 'would']) | Info: the words to be filtered out. by default it's english stop words from Spark ML | Currently set to : ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', "i'll", "you'll", "he'll", "she'll", "we'll", "they'll", "i'd", "you'd", "he'd", "she'd", "we'd", "they'd", "i'm", "you're", "he's", "she's", "it's", "we're", "they're", "i've", "we've", "you've", "they've", "isn't", "aren't", "wasn't", "weren't", "haven't", "hasn't", "hadn't", "don't", "doesn't", "didn't", "won't", "wouldn't", "shan't", "shouldn't", "mustn't", "can't", "couldn't", 'cannot', 'could', "here's", "how's", "let's", 'ought', "that's", "there's", "what's", "when's", "where's", "who's", "why's", 'would'] >>> component_list['tokenizer'] has settable params: component_list['tokenizer'].setTargetPattern('\S+') | Info: pattern to grab from text as token candidates. Defaults \S+ | Currently set to : \S+ component_list['tokenizer'].setContextChars(['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'"]) | Info: character list used to separate from token boundaries | Currently set to : ['.', ',', ';', ':', '!', '?', '*', '-', '(', ')', '"', "'"] component_list['tokenizer'].setCaseSensitiveExceptions(True) | Info: Whether to care for case sensitiveness in exceptions | Currently set to : True component_list['tokenizer'].setMinLength(0) | Info: Set the minimum allowed legth for each token | Currently set to : 0 component_list['tokenizer'].setMaxLength(99999) | Info: Set the maximum allowed legth for each token | Currently set to : 99999 >>> component_list['document_assembler'] has settable params: component_list['document_assembler'].setCleanupMode('shrink') | Info: possible values: disabled, inplace, inplace_full, shrink, shrink_full, each, each_full, delete_full | Currently set to : shrink
yake_pipe['yake_keyword_extraction'].setNKeywords(4)
key_df = yake_pipe.predict(df)
key_df
sentence_detector_dl download started this may take some time. Approximate size to download 354.6 KB [OK!]
document | keywords | keywords_confidence | |
---|---|---|---|
0 | Bitcoin Price Update: Will China Lead us Down? | update | 0.5798862558280943 |
0 | Bitcoin Price Update: Will China Lead us Down? | china | 0.5798862558280943 |
0 | Bitcoin Price Update: Will China Lead us Down? | lead | 0.5798862558280943 |
0 | Bitcoin Price Update: Will China Lead us Down? | china lead | 0.5066323531331214 |
1 | Key Bitcoin Price Levels for Week 51 (15 – 22 ... | price | 0.5798862558280943 |
... | ... | ... | ... |
1998 | Cryptocurrency Exchange Platform AlphaPoint Pa... | support growth | 0.1840422979793308 |
1999 | Want to Buy And Sell Bitcoin Fast and Secure? ... | sell bitcoin | 0.3579604335906263 |
1999 | Want to Buy And Sell Bitcoin Fast and Secure? ... | bitcoin fast | 0.3579604335906263 |
1999 | Want to Buy And Sell Bitcoin Fast and Secure? ... | try coinrnr | 0.2564243599387429 |
1999 | Want to Buy And Sell Bitcoin Fast and Secure? ... | sell bitcoin fast | 0.28203029979078753 |
8070 rows × 3 columns
key_df.explode('keywords').keywords.value_counts()[0:100].plot.bar(title='Top 100 Keywords in Stack Overflow Questions', figsize=(20,12))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e1f5bd350>
yake_pipe['yake_keyword_extraction'].setMinNGrams(2)
yake_pipe['yake_keyword_extraction'].setMaxNGrams(4)
key_df = yake_pipe.predict(df)
key_df
document | keywords | keywords_confidence | |
---|---|---|---|
0 | Bitcoin Price Update: Will China Lead us Down? | bitcoin price | 0.7475647452220192 |
0 | Bitcoin Price Update: Will China Lead us Down? | china lead | 0.3774989624964526 |
0 | Bitcoin Price Update: Will China Lead us Down? | lead us | 0.5619156399368569 |
0 | Bitcoin Price Update: Will China Lead us Down? | china lead us | 0.49160495247060043 |
1 | Key Bitcoin Price Levels for Week 51 (15 – 22 ... | key bitcoin | 0.7475647452220192 |
... | ... | ... | ... |
1998 | Cryptocurrency Exchange Platform AlphaPoint Pa... | bitfinex to support growth | 0.3685173882155852 |
1999 | Want to Buy And Sell Bitcoin Fast and Secure? ... | sell bitcoin | 0.2923195563311814 |
1999 | Want to Buy And Sell Bitcoin Fast and Secure? ... | bitcoin fast | 0.2923195563311814 |
1999 | Want to Buy And Sell Bitcoin Fast and Secure? ... | try coinrnr | 0.15815767906792633 |
1999 | Want to Buy And Sell Bitcoin Fast and Secure? ... | sell bitcoin fast | 0.20049687371139055 |
7365 rows × 3 columns
key_df.explode('keywords').keywords.value_counts()[0:100].plot.bar(title='Top 100 Keywords in Stack Overflow Questions', figsize=(20,12))
<matplotlib.axes._subplots.AxesSubplot at 0x7f7e1c960c90>