In [18]:

# Importing the libraries
import bs4 as bs
import urllib.request
import re

q1 = "Mustafa_Kemal_Atatürk"
q2 = "Donald_Trump"
q3 = 'Data_science'
q4 = 'Machine_learning'

# Gettings the data source
from urllib.parse import quote  
source = urllib.request.urlopen('https://en.wikipedia.org/wiki/'+ quote(q4)).read()

# Parsing the data/ creating BeautifulSoup object
soup = bs.BeautifulSoup(source,'lxml')

In [19]:

# Fetching the data
text = ""
for paragraph in soup.find_all('p'):
    text += paragraph.text

# Preprocessing the data
text = re.sub(r'\[[0-9]*\]',' ',text)
text = re.sub(r'\s+',' ',text)

In [20]:

type(text)

Out[20]:

str

In [21]:

import nltk
nltk.download('stopwords')
import heapq

#Preprocessing the data
text = re.sub(r'\[[0-9]*\]',' ',text)
text = re.sub(r'\s+',' ',text)
clean_text = text.lower()
clean_text = re.sub(r'\W',' ',clean_text)
clean_text = re.sub(r'\d',' ',clean_text)
clean_text = re.sub(r'\s+',' ',clean_text)

# Tokenize sentences
sentences = nltk.sent_tokenize(text)

# Stopword list
stop_words = nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/uzaycetin/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

In [22]:

# Word counts 
word2count = {}
for word in nltk.word_tokenize(clean_text):
    if word not in stop_words:
        if word not in word2count.keys():
            word2count[word] = 1
        else:
            word2count[word] += 1

In [23]:

#Converting counts to weights
maxi = max(word2count.values())
for key in word2count.keys():
    word2count[key] = word2count[key]/maxi

In [24]:

# Product sentence scores    
sent2score = {}
for sentence in sentences:
    for word in nltk.word_tokenize(sentence.lower()):
        if word in word2count.keys():
            if len(sentence.split(' ')) < 15:
                if sentence not in sent2score.keys():
                    sent2score[sentence] = word2count[word]
                else:
                    sent2score[sentence] += word2count[word]

In [25]:

#Gettings best 10 lines             
best_sentences = heapq.nlargest(10, sent2score, key=sent2score.get)

print('---------------------------------------------------------')
for sentence in best_sentences:
    print(sentence)

---------------------------------------------------------
Machine learning approaches in particular can suffer from different data biases.
In machine learning, genetic algorithms found some uses in the 1980s and 1990s.
Software suites containing a variety of machine learning algorithms include the following :
:25 Machine learning, reorganized as a separate field, started to flourish in the 1990s.
As a scientific endeavour, machine learning grew out of the quest for artificial intelligence.
Machine learning and statistics are closely related fields.
The name machine learning was coined in 1959 by Arthur Samuel.
Machine learning poses a host of ethical questions.
Sparse dictionary learning has also been applied in image de-noising.
A popular heuristic method for sparse dictionary learning is K-SVD.

In [ ]: