#!/usr/bin/env python # coding: utf-8 # ![header](header.png) # # # Strings # # String manipulation is widely regarded as one of Python's strong points. These segments of text contain many rich features, and it is straightforward in Python to pull strings apart and re-combine them in interesting ways. # # Working with strings is an important skill in Python. For one, reading data into a program will often take the form of a file containing text, and so we must be proficient in extracting the bits we're interested in. Moreover, we are humans. We do not want the computer to merely spit out a list of numbers after performing a calculation; we want our outputs presented in a way that we can actually read. # # Strings are an immutable data type. The operations we perform involve forging new strings from old, rather than modifying the original string. # # ## Slicing and dicing # # To get a "slice" of a string is similar to getting an item from a list (in fact we can also get slices of lists by the same syntax). However, we specify a start and end index, rather than just an index: # In[5]: a_string = "This is a piece of string" print(a_string[8:16]) # We can also leave one end "open" # In[8]: print(a_string[:4]) print(a_string[5:]) # To get the last x letters, we can count backwards with negative numbers: # In[9]: print(a_string[-6:]) # will get the last 6 letters! # Strings can be concatenated (joined together) using the addition operator: # In[11]: print("Hallo " + "Welt") # If we wish to use this technique to include, for instance, numbers, then we must first convert the number into a string, as + has a different meaning for strings and numbers. # In[5]: year = 2017 print("The year is " + str(year) + "!") # These tools already provide a flexible system for string manipulation. Many of the functions that work on sequence-like data will also work on strings. For instance: # In[2]: print(len("How long is a piece of string?")) # In[3]: print(list("A list of letters")) # ## Special characters and escape sequences # # Certain special characters are represented with a backslash followed by a letter, called an escape sequence. Python considers this pairing to be a single character, even though it looks like two characters on the screen. For example, a new line is represented with \n. # In[4]: print("Split a string\nonto two lines") # Escape sequences also allow you to use quotation marks inside a string without Python thinking you are closing the string. Finally, if you actually do want to insert a backslash, then \\\ is the escape sequence to insert a backslash. # # A full list of escape sequences can be found [here](http://www.techpaste.com/2014/06/escape-sequences-python/). Probably you don't know what all of these do. Neither do I. Why not try some out anyway? # ## Formatting strings # # Many languages, such as C and its derivatives, allow a segment of text to receive inputs using a funny looking syntax in which % signs appear everywhere. Python supports this syntax, and in Python 2 this was the preferred way to modify strings. In Python 3, however, we have the more powerful .format() ability. # # Have a good look at this section, as it provides the ideal tools for giving useful, readable outputs. However, string formatting is virtually its own mini-langauge, and is a lot to take in at once. The important thing is to know that Python can do all these things. You can work out the details as and when you need them. # # The most basic use of string formatting is to insert data from your program into a string. This can be any data that has a suitable string representation, such as numbers. There are many options for doing this. The first is simply by position: # In[7]: destination = "St. Ives" wivescount = 7 poem = "As I was going to {} I met a man with {} wives".format(destination, wivescount) print(poem) # Notice that we have two {}s and provide two arguments to the format function. The order that we provide the arguments is the order that they appear in the text. We can reference the arguments more explicitly by including the position (this is useful if the same argument will appear several times: # In[12]: poem = """As I was going to {0} I met a man with {1} wives, those {1} wives had {1} sacks.""".format(destination, wivescount) # triple quotes allow multi-line paragraphs print(poem) # So here we can see the zero'th argument is referenced once; the one'th argument appears 3 times. If we don't want to worry about position, we can use keyword arguments: # In[14]: poem = """Those {count} sacks had {count} {animals}""".format(animals="cats", count=wivescount) print(poem) # We can supply .format() with any data structure, and access its items in the usual way: # In[21]: travellers = ["wives", "sacks", "cats", "kits"] # access list items poem = "{0[3]}, {0[2]}, {0[1]} and {0[0]}, how many going to {1}?".format(travellers, destination) print(poem) # .format() also provides ways to represent floats and large integers. # In[24]: large_int = 23455453424 print("There were {0:,} travellers to {1}".format(large_int, destination)) # :, adds comma as thousands separator. # In[39]: correct_answer = 2802 rebuke = "That answer is {0:f} times too big!".format(large_int/correct_answer) # default (6 decimal places) print(rebuke) rebuke = "That answer is {0:.2f} times too big!".format(large_int/correct_answer) print(rebuke) # two decimal places rebuke = "That answer is {0:,.2f} times too big!".format(large_int/correct_answer) print(rebuke) # two decimal places and comma separators # Yet another use for string formatting is in aligning text. Text can be left-aligned, right-aligned, or centered. Let's place 3 words on separate lines, with line-width of 30 characters, each with a different alignment: # In[1]: print( "{:<30}\n{:>30}\n{:^30}".format("left", "right", "center") ) # So, this is no more mysterious than "{}\n{}\n{}", but we use the <, >, ^ characters to show alignment followed by the linewidth. # ## More ways to carve up strings # # In the Data Structures example video, we met a function called partition() that splits up a string if you provide it with a separator character. There are many many functions on strings that perform similar tasks. # # For example, we have splitting and joining, which allow easy conversion from lists to strings and strings to list. A string can be split into indvidual words using the split() function: # In[40]: listofwords = "These words will form a list".split() print(listofwords) # Split also will take an argument to specify a different delimiter. # # The counter to this is join(), which acts on the character you wish to use a separator, and takes a list as its arguments. This is a faster algorithm for gluing together a bunch of words than repeated use of the + operator, and it can easily be combined with a list comprehension, too. # In[42]: # just straight up join, good for making a word # we here use join on an empty string to glue the letters directly word = "".join(['H', 'e', 'l', 'l', 'o']) print(word) # In[43]: # here we use join with spaces to separate, making a sentence reunited_words = " ".join(listofwords) print(reunited_words) # In[46]: # more complicated example. the joining string here is a comma followed by new line # we also use a list comprehension to capitalize each item in the list shopping_list = ['bread', 'bananas', 'beans', 'beer'] readable_shopping = ",\n".join([item.capitalize() for item in shopping_list]) print(readable_shopping) # ## Cutting off the beginning or end of a line; the string module # # It is somewhat common when working with strings to wish to remove a chunk of text from the beginning or end of a line. As an example, suppose I have copied and pasted a numbered list from the internet, and I wish to remove the numbers. The trouble is, the numbers have different numbers of digits, so I can't just do a straight up slice on each line. # In[45]: best_python_books = """1. Dive Into Python 3 2. Automate The Boring Stuff With Python 3. Python For Everyone 4. Python Cookbook, 3rd Ed 5. Python For Data Analysis 6. Fluent Python 7. Violent Python 8. Think Python 9. Learn Python The Hard Way 10. Problem Solving with Algorithms and Data Structures Using Python 11. Python Crash Course """ # The tool that comes to our rescue is .strip(), which takes as its argument a collection of characters as a string. Python will remove those characters from the beginning of a string, until it reaches a character not contained in the argument. If we give it no argument, it just removes whitespace (spaces and tabs). To break the list into separate lines, we'll use .splitlines(), which is like split, but splits at linebreaks. # In[46]: lines = best_python_books.splitlines() print(lines) # Now we need to remove the leading characters. Python provides a useful module called string in its standard library, that contains lots of useful strings, as well as additional functions for working with strings. We want to remove the numbers at the start, so we can say: # In[47]: import string # gives us a string containing all the numbers print(string.digits) # While this actually requires more keystrokes than simply writing the numbers "0123456789", the string module contains many other collections of characters like this such as punctuation # In[48]: print(string.punctuation) # In[49]: print(string.ascii_letters) # These strings are useful for checking facts about other strings. For example, this snippet will check if there is any punctuation in a string: # In[50]: def has_punc(text): import string for c in string.punctuation: if c in text: return True return False print( has_punc("Has no punctuation") ) print( has_punc("Has punctuation.")) # Anyway, back to our problem. We want to remove the numbers, full stops and whitespace from the strings in the list. Here we go: # In[51]: chars_to_remove = ". "+string.digits # make a string containing all bad chars nice_list = [book.strip(chars_to_remove) for book in lines] list_as_string = "\n".join(nice_list) print(list_as_string) # Clearly, to use strip(), we have to be pretty confident about the format of our data. If I had a book called "20 Cool Python Programs" in the list, then the "20" part would have been stripped out as well. # # ## Some quick transformations of strings # # The developers of Python kindly include many single word ways to make quick adjustments to strings. We just demonstrate a bunch of them here; their functioning should be self explanatory: # # # In[68]: my_string = "Some advanced string theory" print(my_string.capitalize()) print(my_string.lower()) print(my_string.upper()) print(my_string.swapcase()) print(my_string.title()) print(my_string.replace("advanced", "basic")) # You can read more about how to use strings [here](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str) # and about the string module [here](https://docs.python.org/3/library/string.html). # # Text # # You may be looking at the title of this section and thinking "wait, haven't we just learned all about text?". And the answer is yes... sort of. We've been learning about how to work with text inside Python. We've just been willy-nilly writing some strings and slicing them up. What we have not talked about, and is unfortunately an important thing to be aware of, is text that is out there in the wild. # # ## The bad news # # Some terminology first. A character is the abstract meaning behind the little squiggles you see on the screen that form text. The squiggles themselves are called "glyphs". However, the glyphs cannot be the characters, since in different fonts, the same character can be represented by two very different glyphs. So for instance, the character that is "the first uppercase letter of the latin alphabet and sounds like 'ey'" is represented by the glyph "A" in one font, and the glyph "$\mathfrak{A}$" in another. We don't need to worry about glyphs, since this is determined by the font of our computer, not our program. # # When we work with text in Python, we just see the strings as "a sequence of characters". This is great, and is exactly how you should think of strings when you manipulate them in Python. Trouble is, computers do not store characters; they store numbers written in binary. This means there is a natural problem of deciding which characters correspond to which numbers. # # Back in the days of yore, computer users mostly just shared software around their own universities and workplaces, at the very least within their own country. This means that lots of different character encodings sprang up -- methods for turning the bytes (numbers) in a file into text on the screen. In the UK and US, the common solution was ASCII encoding, which due to our relatively small alphabet meant that each character could be stored in a single byte as a number between 0 and 127. # # These days, there is something called the internet, which means that text is now flying all over the world, from the US, which uses only a few characters; to Europe, which uses more because of all the accents; to Greece, Russia, and Middle East, which have different alphabets; to China, which has literally thousands of characters. Also, in the 21st century, there are emojis. # # So if you're planning on interacting with text that comes from a source outside of your own computer, you might run into problems. # # ## Unicode # # Unicode is a system for organizing the characters of the world, as well as a set of standards for encoding these characters. In a nutshell, each character is assigned to a unicode "code point". For instance the letter "A" is "0041". This is not an encoding, but merely an organization method. At least now we can organize the characters of the world into some kind of coherent structure, a bit like a periodic table of elements but for text. According to Wikipedia, the current version of unicode contains 136,755 characters from 139 writing systems, plus additional symbols that are not from any particular writing system, such as emoji. When you create a string in Python, it is considered to be a sequence of unicode code points. # # Note that unicode code points are generally given in hexadecimal. This is simply a way of writing down numbers using 16 symbols instead of the usual ten (or two, in binary), meaning larger numbers can be written using fewer digits. Since we have only ten numerals, letters are used instead; the numerals used in hexadecimal are 0123456789ABCDEF. E is fourteen, F is fifteen. 10 is sixteen, 11 is seventeen, all the way to FF, which is 255 in decimal, and then we start again with 100. This is just a way to shorten long numbers in binary -- each hex digit corresponds to 4 binary digits (bits), meaning that a byte is represented by two hex digits. There's rarely any need to do arithmetic with hex-reperesented numbers, or figure out what the decimal/binary representation is. Just be aware that when you see, say, "A3" when discussing a unicode code point, it is actually a number. # # There remains the question then of how to encode these characters into bytes. Unicode has several official standards, and many other encodings use the unicode system. The reason for different standards is that European computer users, say, whose languages' characters could be stored in 2 bytes, did not want to have to use additional bytes to store all the Chinese characters they seldom used, for example by allocating 4 bytes per character. By having different encoding systems, we've solved one problem, but it's very not flexible. What about times when we need those occasional Chinese characters? What about emoji? # # Then two bright sparks, Ken Thompson and Rob Pike, came up with a new encoding system called UTF-8. In their system the number of bytes per character could vary. The standard ASCII characters, for example, could be stored as 1 byte. European characters, 2 bytes. 3 bytes is enough to store all the Chinese characters. Other, uncommon symbols, such as emoji, are relegated to taking up 4 bytes. And all of these can co-exist within the same file. There's no need to switch encoding midway through. # # There is a price to pay, as with most nice things in life. For a program to search through UTF-8 encoded text now takes longer, because it can't count on each of the characters being the same length (in terms of digits); it has to check each byte separately to see when one character ends and the next begins. # # Nonetheless, UTF-8 is now the standard encoding on the web, accounting for 90% of websites. The Python interpreter expects Python files to be written in UTF-8. Python "just works" when it reads UTF-8 encoded text. # # # ## The good news # # Working with these different encodings in Python is actually fairly straightforward for the most part. Your mission when working with text in Python should be to convert it to a string as soon as is possible. In string form, you are working directly with characters -- that is to say, unicode code points, and don't have to worry about what these characters are actually stored as in memory. # # The two functions you need are str.encode() and bytes.decode() (where str and bytes are the snippet you want to en-/decode. The argument taken is the name of the encoding to be used, given as a string -- the default encoding is UTF-8. # In[4]: my_name = "Sam" my_name_as_bytes = my_name.encode() print(type(my_name_as_bytes)) # If we print my_name_as_bytes, what do we get? # In[5]: print(my_name_as_bytes) # It looks the same, but with a little b at the beginning. But it is certainly not the same. These are no longer characters, but representations of sequences of bytes. Confused? # # Here is what is going on. There is some good news here, actually: the most commonly used characters in English form part of the ASCII character set, which if we recall stores each character as a number between 0 and 127. One of the clever aspects of UTF-8 encoding is that the encodings for these particular characters are identical to their ASCII encodings. You can open a file written in ASCII using an UTF-8 decoding algorithm and you wouldn't know the difference. # # Now, when Python represents a sequence of bytes, rather than give you a sequence of numbers, it represents each byte with an ASCII character, as this is usually easier to interpret. Because the letters of Sam are in the ASCII character set, this is no problem! However, the same cannot be said for another string, for example: # In[6]: jacket = "This jacket costs £30" jacket_as_bytes = jacket.encode() print(jacket_as_bytes) # Now we can see £ sign has a strange representation, because this character is not ASCII (owing to the fact that ASCII is an American standard, so only the dollar sign is part of the system). \x just means "heXadecimal". In other words, these numbers don't have an ASCII representation, so they are given as their hex values instead, and the \x makes us aware of this. # # However, this is still valid UTF-8 bytes, and a program reading these bytes as UTF-8 can recover the original: # In[7]: print(jacket_as_bytes.decode('UTF-8')) # Tada! So what happens if we try to encode the last string as ASCII? # In[8]: print(jacket.encode('ASCII')) # We get an error, because the £ sign simply has no corresponding ASCII number. # # So, soon we'll be looking at how to open a (text) file in Python. How do we know how to decode the text? In general, sadly, we don't. For files that use one of the unicode encodings, the file may have a so called byte-order mark (BOM) at the beginning containing 2 or 3 bytes which indicate the encoding used (though it also might not -- more information on that [here](http://codesnipers.com/?q=node/68)). Other than that, you may have to use some trial and error. Presuming you have a rough idea of what the file should look like when decoded, you could write a loop that tries some different encodings for you. Finally, included with Anaconda is a library that will attempt to determine the encoding for you. We demonstrate its usage here. Let's first create a string with some non-ASCII characters in: # In[37]: # Here I have taken a paragraph from a French newspaper, containing some accented characters french_paragraph = "Vite, de l'ombre ! Un pic de chaleur est attendu en France mardi, selon les instituts météorologiques, qui tablaient la veille sur des températures allant jusqu'à... 38°C ! «On assiste à une dépression sur le proche Atlantique, décrypte Frédéric Decker, météorologue chez MeteoNews. Celui-ci va avoir un effet de pompe à chaleur en faisant remonter de l'air en provenance d'Espagne et du Maghreb.»" # Let's see how many non-ASCII characters are in the string # For this, we'll use a set, a data structure we haven't met before. # Sets are unordered collections with no duplicates -- so it won't # count the same characters twice! non_ascii = set() from curses.ascii import isascii for character in french_paragraph: if not isascii(character): non_ascii.add(character) print("The string contains {} non-ASCII characters".format(len(non_ascii))) # Now we will encode it into some different formats: # In[38]: encodings = ['UTF-8', 'UTF-16', 'ISO-8859-1', 'macintosh'] french_bytes = {} for code in encodings: french_bytes[code] = french_paragraph.encode(code) # Now to see if the library is able to have a good guess at how this paragraph has been encoded: # In[28]: import chardet # the dictionary.items() function gives tuples of key-value pairs for truecode, byte_string in french_bytes.items(): # chardet.detect() will make this library attempt to guess guess = chardet.detect(byte_string)['encoding'] print(chardet.detect(byte_string)) print("Correct answer is {}. chardet's guess is: {}".format(truecode, guess)) # So it managed to correctly guess 3/4 encodings. The moral is that this library can help you out, but you mustn't trust it blindly! # # Quiz/Exercises # # 1. Find out what the default character encoding is on your computer. Import the locale module and use locale.getpreferredencoding(). # 2. If you save this notebook (or copy the below cell into a new module/IPython) and the accompanying file "obituary.txt" into the same folder, the following code will open the file, in an unknown encoding, as bytes, store it as a bytes object. This is a full article from the Daily Telegraph; do not print this bytes object: it's quite long! # In[4]: with open('obituary.txt', 'rb') as f: obituary = f.read() print(type(obituary)) # Exercises with the obituary: # 1. Figure out what encoding this file uses, perhaps using guesswork or chardet, and decode it. Remember: your first mission when working with text is to turn it into a string! # 2. How many words are in the obituary? # 3. How many sentences are in the obituary? (hint: every full stop in the article terminates a sentence) # 7. What is the commonest word in the article? # 4. What is the commest word longer than 5 letters? # 8. What is the longest word in the article? # 4. Which non-ASCII characters are in the obituary? # 5. Think of suitable ASCII replacement characters for the non-ASCII characters, and create a version of the obituary called obituary_ascii using these replacements, such that obituary_ascii.encode('ascii') does not raise an error. # # Video solution below # In[2]: from IPython.display import YouTubeVideo YouTubeVideo("T8LzuaQH8GQ") # In[ ]: