# Clustering Weibo Tags

I started a projected last October to collect Weibo's top search data (微博热搜榜) hourly. Together with the keywords or tags (关键词), most recent related weibos (or tweets) are collected as well. The result is save to a JSON file, with the format explained in this page.

In this post, I would like to explore this data set and try to cluster tags. To be more precise, multiple tags can be used to refer to a same event, and these different tags are related and even share the same meaning. The task is to group similar tags together based on the data collected.

# Principal Component Analysis

This is the note I used as an example of applications in Linear Algebra I lectured at Purdue University. It is slightly modified so that it is more or less self contained.

Principal Component Analysis (PCA) is a linear algebra technique for data analysis, which is an application of eigenvalues and eigenvectors. PCA can be used in

1. exploratory data analysis (visualizing the data)
2. features reduction

We will learn the basic idea of PCA and see its applications in handwritten-digits recognition, eigenfaces and etc.

# Linear Regression

This is the note I used as an example of applications in Linear Algebra I lectured at Purdue University. It is slightly modified so that it is more or less self contained.

Starting from least-squares solution, we are going to give an introductory exploration on (linear) regression in this note.

import numpy as np
import sklearn.linear_model
import matplotlib.pyplot as plt
from IPython.display import set_matplotlib_formats

plt.rcParams["figure.figsize"] = (8, 6)
set_matplotlib_formats('png', 'pdf')


## Least-squares solution

Let $A$ be an $m \times n$ matrix, and $B$ be a vector in $\mathbb{R}^m$. A least-squares solution to a linear system $Ax = B$ is an $\hat{x}$ such that $|A \hat{x} - B| \le |A x - B|$ for all $x$. Here, $|x|$ is the length of the vector $x$. If the system $Ax = B$ is consistent, then a least-squares solution is just a solution.

# LeetCode Contest 209

class Solution:
def specialArray(self, nums: List[int]) -> int:
for n in range(len(nums) + 1):
if sum(1 for v in nums if v >= n) == n:
return n
return -1


# LeetCode Contest 208

class Solution:
def minOperations(self, logs: List[str]) -> int:
depth = 0
for path in logs:
if path == '../':
depth = max(0, depth - 1)
elif path == './':
pass
else:
depth += 1
return depth


# LeetCode Contest 204

class Solution:
def containsPattern(self, arr: List[int], m: int, k: int) -> bool:
n = len(arr)
for i in range(n - m * k + 1):
t = True
for s in range(m):
for t in range(k):
if arr[i + s] != arr[i + s + t * m]:
t = False
break
if not t:
break
if t:
return True
return False


# SIR to fit COVID-19 data in the U.S.

Life is quite disrupted during this hard time of pandemic. It is especially stressful to see new cases rising again within the U.S.. When will the pandemic come to its peak? I belive many have asked about this question and many have given it serious thought. In this blog post, I want to give this question an answer using SIR model.

# LeetCode Contest 188

class Solution:
def buildArray(self, target: List[int], n: int) -> List[str]:
ret = []
cur = 0
for t in target:
ret.extend(["Push", "Pop"] * (t - cur - 1))
ret.append("Push")
cur = t
return ret