CS 121 Lecture 6: Code and Data¶

All the usual announcements...

In [55]:

%%html
'<iframe src="http://free.timeanddate.com/countdown/i5vf6j5p/n43/cf11/cm0/cu4/ct1/cs1/ca0/co0/cr0/ss0/cac09f/cpc09f/pct/tcfff/fs100/szw576/szh243/iso2017-09-19T10:07:00" allowTransparency="true" frameborder="0" width="177" height="35"></iframe>'

''

On grades¶

Midterm and final will be designed to ensure you will do fine as long as:

You read deeply all lecture notes before lectures.
You gave an honest attempt (on your own!) to all non bonus problems.
Went over feedback and understood what you got wrong (use OH if needed).
Used the sections and Piazza to solidify your understanding

In [12]:

import random,lzma,lz4.block

def printnum(a):
    print("{:,d}".format(len(shorttext)*8))

def compress(b):
    return lzma.compress(b,preset=9)
    # return lz4.block.compress(b,mode='high_compression',compression=12)

In [13]:

letters = bytes("abcdefghijklmnopqrstuvwxyz","ascii")

longtext = bytearray(1000000) 
for i in range(len(longtext)):
    longtext[i] = random.choice(letters)

In [14]:

longtext[50:60]

Out[14]:

bytearray(b'vptswbixml')

In [15]:

shorttext = compress(longtext) # use built-in Python3 lzma library

shorttext is compression of 1,000,000 random letters

What is approximately the length in bits of shorttext?

a. 1,000,000

b. 125,000

c. 4,700,000

d. 8,000,000

In [16]:

printnum(len(shorttext)*8)

4,847,872

English text is about 40% vowels. If we chose longtext2 to be 1M random letters from such distribution and made shorttext2 = compress(longtext2), then shorttext2 is going to be:

a. Shorter than shorttext

b. Longer than shorttext

In [7]:

120480684*8 / 10**9  # Compression of 10^9 characters of English Wikipedia XML dump

Out[7]:

0.963845472

Best compression algorithms compress English text to less than one bit per character. (e.g. see large text compression benchmark )