%matplotlib notebook
from stemgraphic.alpha import word_scatter, stem_scatter
from stemgraphic.stopwords import EN
import cufflinks as cf
cf.go_offline()
In stem and word scatter, we covered the basics when comparing 2 different sources. But scatter will also take a 3rd source as argument. In that case, it will display the data in a 3d view. We will reuse our two stories from before and add a 3rd one:
src1 = '../datasets/The Red Headed League by Arthur Conan Doyle.txt'
src2 = '../datasets/A Case of Identity by Arthur Conan Doyle.txt'
src3 = '../datasets/The Final Problem by Arthur Conan Doyle.txt'
Given how 3d scatter plots provide the most information in an interactive form, where one can zoom in, rotate, look at the data from different angles, we'll go straight into that mode:
stem_scatter(src1, src2, src3);
These are raw counts, stop_words not removed, and the data is not normalized. Typically, with stem-and-leaf, building ngrams, we probably want to normalize the data so that counts for source 2 and 3 are adjusted up or down based on their size in comparison to source 1. As to leaf count, we'll go with 2, no stem or leaf skip, so we are looking at standard trigrams at the beginning of words:
stem_scatter(src1, src2, src3, normalize=True, leaf_order=2);
And we probably want a percentage value instead of a direct count. We will also use whole percentages (double click on x, y or z in the legend to see the data that is more common in that document, or = to see where all 3 sources have the same count):
stem_scatter(src1, src2, src3, leaf_order=2, normalize=True, percentage=True, whole=True);
ax, df = stem_scatter(src1, src2, src3, percentage=True, normalize=True, interactive=False, project=True, project_only=False);
/home/fdion/DionResearch/stemgraphic/stemgraphic/alpha.py:1361: UserWarning: Log_scale is not working currently due to an issue in matplotlib
Switching to words. In interactive mode, it is possible to have labels with two or three sources:
ax, df = word_scatter(src1, src2, jitter=True, percentage=True, normalize=True, label=True, stop_words=EN);
ax, df = word_scatter(src1, src2, src3, percentage=True, normalize=True, label=True, stop_words=EN);
In non interactive mode (meaning, not using cufflinks - using %matplotlib notebook does provide rotation and translation, whereas %matplotlib inline will be static), labels are only available with 2 sources and linear scale at this time (due to a matplotlib issue)
ax, df = word_scatter(src1, src2, src3, alpha=0.7, fig_xy=(10,10),
interactive=False, percentage=True, normalize=True, label=True, stop_words=EN);
/home/fdion/DionResearch/stemgraphic/stemgraphic/alpha.py:1371: UserWarning: Labels do not currently work in log scale due to an incompatibility in matplotlib. Set log_scale=False to display text labels.
ax, df = word_scatter(src1, src2, fig_xy=(10,10), interactive=False, jitter=True, percentage=True, normalize=True,
label=True, log_scale=False, stop_words=EN);