Python Programming

Lecture 13 Wordcloud & Algorithm Intro

13.1 Wordcloud & Encoding

  • A long string

from matplotlib import pyplot as plt
from wordcloud import WordCloud
 
string = 'Importance of relative word frequencies for font-size. 
With relative_scaling=0, only word-ranks are considered. With 
relative_scaling=1, a word that is twice as frequent will have 
twice the size. If you want to consider the word frequencies 
and not only their rank, relative_scaling around .5 often looks good.'

font = r'C:\Windows\Fonts\Arial.TTF'
wc = WordCloud(font_path=font, #如果是中文必须要添加这个,否则会显示成框框
               background_color='white',
               width=1000,
               height=800,
               ).generate(string)
wc.to_file('s1.png') #保存图片
plt.imshow(wc)  #用plt显示图片
plt.axis('off') #不显示坐标轴
plt.show() #显示图片
  • Loading text file

from matplotlib import pyplot as plt
from wordcloud import WordCloud
 
filename = 'Harry Potter.txt'
with open(filename , encoding="utf-8") as f_obj:
    contents = f_obj.read()

font = r'C:\Windows\Fonts\Arial.TTF'
wc = WordCloud(font_path=font, 
               background_color='white',
               width=1000,
               height=800,
               ).generate(contents)
wc.to_file('s2.png') 
plt.imshow(wc)  
plt.axis('off') 
plt.show() 
  • Loading Chinese text file

from matplotlib import pyplot as plt
from wordcloud import WordCloud

filename = 'sanguo.txt'
with open(filename , encoding="utf-8") as f_obj:
    contents = f_obj.read()


font = r'C:\Windows\Fonts\simkai.ttf'

wc = WordCloud(font_path=font, 
               background_color='white',
               width=1000,
               height=800,
               ).generate(contents)
wc.to_file('s3.png') 
plt.imshow(wc)  
plt.axis('off') 
plt.show() 
  • Using jieba

from matplotlib import pyplot as plt
from wordcloud import WordCloud
import jieba
filename = 'sanguo.txt'
with open(filename , encoding="utf-8") as f_obj:
    contents = f_obj.read()

s = jieba.lcut(contents) 
txt = " ".join(s)
print(txt)
font = r'C:\Windows\Fonts\simkai.ttf'
wc = WordCloud(font_path=font, 
               background_color='white',
               width=1000,
               height=800,
               ).generate(txt)
wc.to_file('s4.png') 
plt.imshow(wc)  
plt.axis('off') 
plt.show() 

Character encoding Encoding

  • ASCII
  • Unicode
  • UTF-8
  • Python Crash Couse (Chapters we do not cover: Chapter 12 - 14, 18 - 20)

    • Chapter 12 -14: Alien Invasion

    • Chapter 18 - 20: Django

  • Python for Everybody (Chapters we do not cover: Chapter 11 - 13, 15 - 16)

    • Chapter 11: Regualer Expressions

    • Chapter 12: Networked Programs 12.4 - 12.8 (urlib, BeautifulSoup)

    • Chapter 13: Using Web Services (XML, JSON, API)

    • Chapter 15: Databases and SQL

    • Chapter 16: Visualizing data (Network, Word Cloud)

13.2 Algorithm Intro

  • Complexity, Big-O (Searching)
  • Recursion, D&C
  • Sorting Algorithms
  • 0/1 Knapsack Problem
    • Greedy Algorithm
    • Dynamic Programming
  • Another stream of the basic knowledge about programming is algorithm and data structure.

  • An algorithm is a set of instructions for accomplishing a task.

  • Every piece of code could be called an algorithm.

  • To learn this, you only need to know some basic algebra and Discrete mathematics.

  • Let's start with a simple algorithm.

Binary Search
  • Binary search is an algorithm; its input is a sorted list of elements. If an element you’re looking for is in that list, binary search returns the position where it’s located. Otherwise, binary search returns null.

  • Here's an example of how binary search works. I'm thinking of a number between 1 and 100.

  • You have to try to guess my number in the fewest tries possible. With every guess, I'll tell you if your guess is too low, too high, or correct.

  • Suppose you start guessing like this: 1, 2, 3, 4 ...

  • If you perform a simple search, and my number was 99, it could take you 99 guesses to get there!

  • Here’s a better technique. Start with 50.

  • In general, for any list of $n$, binary search will take $log_2n$ steps to run in the worst case, whereas simple search will take $n$ steps.

  • Running time

  • Any time I talk about an algorithm, I'll discuss its running time. Generally you want to choose the most efficient algorithm— whether you're trying to optimize for time or space.

  • For simple search, the maximum number of guesses is the same as the size of the list. This is called linear time.

  • Binary search runs in logarithmic time.

Big O notation
  • Big O notation is special notation that tells you how fast an algorithm is.

  • You need to know how the running time increases as the list size increases. That’s where Big O notation comes in.

  • Big O notation tells you how fast an algorithm is. For example, suppose you have a list of size n. Simple search needs to check each element, so it will take n operations. The run time in Big O notation is O(n).

  • Big O notation lets you compare the number of operations. It tells you how fast the algorithm grows. Run times grow at very different speeds

  • Binary search needs log n operations to check a list of size n. It's $O(log n)$.

  • Sometimes the performance of an algorithm depends on the exact values of the data rather than simply the size of the problem. For these kinds of algorithms we need to characterize their performance in terms of best case, average case and worst-case performance.

  • $O(log n)$, $O(n)$, $O(n * log n)$, $O(n^2)$, $O(2^n)$, $O(n!)$ (P and NP)

13.3 Recursion

Base case and recursive case

def fact(n):
    if n==1:
        return 1 #Base case
    else:
        return n * fact(n - 1) #recursive case
  • The recursive case is when the function calls itself. The base case is when the function does not call itself again... so it doesn't go into an infinite loop.

The stack

def greet(name):
    print('hello '+ name + '!')
    greet2(name)
    print('getting ready to say bye...')
    bye()

def greet2(name):
    print('how are you, ' + name + '?')
def bye():
    print('Ok, bye!')
The call stack with recursion
Divide and Conquer (D&C, 分而治之)
  • D&C gives you a new way to think about solving problems. When you get a new problem, you don't have to be stumped. Instead, you can ask, "Can I solve this if I use divide and conquer?"

  • You want to divide this farm evenly into square plots. You want the plots to be as big as possible.

  • To solve a problem using D&C, there are two steps:

  • 1. Figure out the base case. This should be the simplest possible case.

  • 2. Divide or decrease your problem until it becomes the base case.

  • What is the largest square size you can use? You have to reduce your problem. Let’s start by marking out the biggest boxes you can use.

  • There’s a farm segment left to divide. Why don’t you apply the same algorithm to this segment?

  • If you find the biggest box that will work for this size, that will be the biggest box that will work for the entire farm. You just reduced the problem from a 1680 × 640 farm to a 640 × 400 farm!

  • To recap, here's how D&C works:

  • 1. Figure out a simple case as the base case.

  • 2. Figure out how to reduce your problem and get to the base case.

  • D&C isn't a simple algorithm that you can apply to a problem. Instead, it's a way to think about a problem.

Summary

  • Wordcloud
  • Algorithm Intro