PySpark Estimator and Transformer

PySpark's pipeline is a powerful tool that encapsulates machine learning processes. We can build rather complicated pipelines to our needs using the existing estimators/transformers come with the PySpark's library, until we can't. In this article, I will show how we can build custom estimators and transformers to make the pipeline even more powerful.

Imagine that we want to build a model with some high cardinality categorical features. Upon inspection, we find that only some most frequent values are useful and we decide to keep those frequent values and mask other values "OTHERS". We will implement CardinalityReducer that will keep only most frequent N values in a categorical column (or a column of string type). We will implement it in a way so that it can fit training sets together with other components in a pipeline.

Read More

Fixing an issue in saving/loading BERT models

Recently I came across an issue in saving/loading BERT models with TensorFlow. The BERT models are provided by the Transformers library, and I used Tensorflow backend. When saving with model.save(path) then loading with tf.keras.models.load_model(path), it gave the following TypeError or ValueError:

TypeError/ValueError: The two structures don't have the same nested structure.

The article is to document several ways to solve the issue.

Read More

Simple SVD with Bias for Netflix Prize

In my linear algebra class this summer, I used the Netflix Prize challenge as a pratical example for an application of singular value decomposition (SVD). To be more precise, I explained the term \(p_u^Tq_i\) in the simple SVD with bias model:

$$\hat{r}_{ui} = \mu + b_u + b_i + p_u^Tq_i.$$


The above model can be found in section 2.1 in this progress paper of the winning team. In this note, I will explain this model and give an implementation in Python. A C implementation of the moddel can be found in my GitHub repository here: https://github.com/wormtooth/netflix_svd.

Read More

Clustering Weibo Tags

I started a projected last October to collect Weibo's top search data (微博热搜榜) hourly. Together with the keywords or tags (关键词), most recent related weibos (or tweets) are collected as well. The result is save to a JSON file, with the format explained in this page.

In this post, I would like to explore this data set and try to cluster tags. To be more precise, multiple tags can be used to refer to a same event, and these different tags are related and even share the same meaning. The task is to group similar tags together based on the data collected.

Read More

Principal Component Analysis

This is the note I used as an example of applications in Linear Algebra I lectured at Purdue University. It is slightly modified so that it is more or less self contained.

Principal Component Analysis (PCA) is a linear algebra technique for data analysis, which is an application of eigenvalues and eigenvectors. PCA can be used in

  1. exploratory data analysis (visualizing the data)
  2. features reduction

We will learn the basic idea of PCA and see its applications in handwritten-digits recognition, eigenfaces and etc.

Read More

Linear Regression

This is the note I used as an example of applications in Linear Algebra I lectured at Purdue University. It is slightly modified so that it is more or less self contained.

Starting from least-squares solution, we are going to give an introductory exploration on (linear) regression in this note.

import numpy as np
import sklearn.linear_model
import matplotlib.pyplot as plt
from IPython.display import set_matplotlib_formats

plt.rcParams["figure.figsize"] = (8, 6)
set_matplotlib_formats('png', 'pdf')

Least-squares solution

Let \(A\) be an \(m \times n\) matrix, and \(B\) be a vector in \(\mathbb{R}^m\). A least-squares solution to a linear system \(Ax = B\) is an \(\hat{x}\) such that \(|A \hat{x} - B| \le |A x - B|\) for all \(x\). Here, \(|x|\) is the length of the vector \(x\). If the system \(Ax = B\) is consistent, then a least-squares solution is just a solution.

Read More

LeetCode Contest 209

这次只能做出前面三题,而且第三题用时过长,导致这次排名只有505。

第一题Special Array With X Elements Greater Than or Equal X

给定一个数组,找出 x 使得恰有 x 个数不小于 x

暴力枚举所有可能的 x

class Solution:
    def specialArray(self, nums: List[int]) -> int:
        for n in range(len(nums) + 1):
            if sum(1 for v in nums if v >= n) == n:
                return n
        return -1

Read More

LeetCode Contest 208

第二题 WA 了一次,其他都能一次 AC。这次拖慢速度的竟然不是手速,而是阅读速度。。。排名150左右。

第一题Crawler Log Folder

给定一系列类似于 cd 的操作,问最后的文件路径深度是多少。

模拟题,因为只需要知道深度,所以我们并不关心文件夹的名字,只考虑对深度影响 ../ -> -1,./ -> 0,其他 -> +1 。

class Solution:
    def minOperations(self, logs: List[str]) -> int:
        depth = 0
        for path in logs:
            if path == '../':
                depth = max(0, depth - 1)
            elif path == './':
                pass
            else:
                depth += 1
        return depth

Read More

LeetCode Contest 204

偶尔冒泡参加 Leetcode 的比赛。这次四题都做出来了,但是第一、三和四题都各自错了一次。排名250左右。

第一题Detect Pattern of Length M Repeated K or More Times

问一个数组里面是否存在长度为 m 的子数组被重复至少 k 次。

因为数据比较小,直接暴力搜索就行。第一次 WA 是因为 n - m * k + 1 写成 n - m * k 了。

class Solution:
    def containsPattern(self, arr: List[int], m: int, k: int) -> bool:
        n = len(arr)
        for i in range(n - m * k + 1):
            t = True
            for s in range(m):
                for t in range(k):
                    if arr[i + s] != arr[i + s + t * m]:
                        t = False
                        break
                if not t:
                    break
            if t:
                return True
        return False

Read More

SIR to fit COVID-19 data in the U.S.

Life is quite disrupted during this hard time of pandemic. It is especially stressful to see new cases rising again within the U.S.. When will the pandemic come to its peak? I belive many have asked about this question and many have given it serious thought. In this blog post, I want to give this question an answer using SIR model.

Read More