#!/usr/bin/env python
# coding: utf-8
#
# # Detecting novel anomalies in Twitter
# ### Delia Rusu and Mattia Boni Sforza
# #### PyData London 2016
# # About Us
# ### Delia Rusu
#
# PhD: Natural Language Processing and Machine Learning
#
# delia@knowsis.com
#
# https://github.com/deliarusu
#
#
# ### Mattia Boni Sforza
#
# MSc: Statistics
#
# mattia@knowsis.com
#
# https://github.com/mattiabs
#
# # Knowsis
# Knowsis is a social-intelligence company providing social media analytics for finance.
#
#
#
# # Knowsis
#
# Topics of research:
# - named entity recognition, disambiguation and linking
# - event detection
# - sentiment analysis
# - social network analysis
# - **anomaly detection**
# - **novelty detection**
#
# ## Q: What is the task?
# ### A: Detect Novel Anomalies in Twitter
# System which promptly flags the user when there is a **novel anomaly** in **social activity**
#
# - **Novel**?
# * Not previously encountered
# - Novel **Anomaly**?
# * Something that deviates from what is standard, normal, or expected
# - Novel Anomaly in **social activity**?
# * Unexpected amount of tweets in short time which creates a spike in tweets volume
#
#
#
# ## Novel anomalies detection pipeline
#
#
#
# # Main Challenges
#
# - Identify unusual activity **promptly**
# - Avoid false anomalies
#
#
#
# In[1]:
from IPython.core.display import Image, display
display(Image('images/VW_chart.jpg', unconfined=True))
# ## What you need
#
# 1. Define a variable to measure
# * what is the main characteristic that makes the data anomalous?
# 2. Establish if the observed variable is unexpected
# * how to quantify the unexpected nature of this quantity?
# ## Approach
#
# 1. Variable to measure is the count of tweets given a fixed time window (aka rate)
# 2. Consists of two steps:
# * Predict the tweets count given some predictor variables (characteristics of the data known in advance)
# * Use the prediction to establish how (un)likely it is to observe such a tweets volume
#
# OTHER WAYS TO DO IT
# pretty important
# means you can apply this approach to any problem which involves events happening in time (sensor, serer requests, biochimical..)
#
# Define features which make the measured variable unexpected
# Consider a different quantity (or quality) and define what anomalous means with respect to this new target (inter-arrival times, another quantity, another approach bayesian)
# ## Model the tweets rate
#
# - Static volume
# - Dinamic amount to account for seasonality
# - Include short time memory to only detect spikes
#
#
#
# Natural way to model counts of events happening in time is to use a Poisson process $N(t)$ with parameter $\lambda$, the rate at which the events happen
#
# $$ \mathbb{E}[N(t)] = \textbf{expected number of events occurred until time } t = \lambda t$$
#
# A Poisson regression models the logarithm of the expected number of events as a **linear combination** of predictor variables
#
# $$ \log(\mathbb{E}[N]) = log(\lambda) = \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_k x_k $$
#
# with $\beta_i$ coefficients of the model estimated based on volume history and $x_i$ observable quantities of the datapoint.
#
# ## Predictor variables for $\lambda$
# Consider seasonality at different scales
# - day of the week $x_w$
# - hour of the day $x_h$
# - month (?)
# - ...
#
# Consider recent behaviour
# - count in the previous time window $x_{t-1}$
# - count 2 time windows before $x_{t-2}$
# - ...
# - count for the day so far or yesterday (?)
#
# Other variables?
# - ...
#
# ## Detect the anomaly
#
# With the expected rate $\lambda_{pred}$, how (un)likely is that we see a number of tweets $N$ higher than the one we have so far?
#
# If this event has probability lower than $\alpha$, it is anomalous!
#
# $$\mathbb{P}(N \geq n_{obs} | \lambda_{pred}) < \alpha$$
#
# ## Pros
#
# - timely (predict in advance)
# - unsupervised (no need for annotations)
# - easily adapt to different time scales
# - easy to extend and include more predictors
# - applicable to discrete data of different nature
# - Generalised Linear Model applicable to different distribution families
#
# ## ...and Cons
#
# - needs other algorithms to determine if the anomaly is geneuine (spam driven, somehow expected, ...)
# - higly sensitive (try Zero inflated Poisson)
# - variance and expected value are the same (try Negative Binomial)
#
# # Novelty Detection
# Given an **anomaly period**, identify which tweets are **novel**
#
# Observation: tweets tend to cluster in **stories**
# Given an **anomaly period**, identify the tweets which are ~~novel~~ **not part of old stories**
#
# # VW Example
#
#
# # Challenges
# - **streaming** setting - each tweet needs to be processed in **bounded space** and **time**
# - bounded space - constant amount of stories in memory
# - bounded time - Locality Sensitive Hashing (Indyk and Motwani, 1998)
# - how much history is required to decide if a tweet is novel
# - stories can be reoccurring
#
# # First Story Detection
#
#
# # Locality Sensitive Hashing
#
# - approximate nearest neighbor task
# - hashing each tweet into buckets to maximize the probability of collision for tweets which are close
# - requires a measure of distance for tweets
# Advantage:
# - LSH works well if tweet is close to its nearest neighbor
# Disadvantage:
# - LSH fails to return the nearest neighbor for distant tweets
# - Solution: if the LSH doesn't return a neighbor, check novelty against most recent tweets
#
# # Novelty Flow
#
#
# # Conclusions
#
# - general anomaly detection algorithm which is:
# - applicable to different problems
# - easily understandable
# - fast training and detection
# - works well in a streaming pipeline
#
# - two-step novelty detection
# - fast lookup using LSH
# - additional lookup in inverted index for the most recent tweets
#
# # Thank You!
#
#
#
# Presentation & Notebooks:
#
# https://github.com/knowsis/novel-twitter-anomalies-pydatalondon2016/
#
#
#
#
# delia@knowsis.com
#
# mattia@knowsis.com
#
#