Aggregate Means and Standard Deviations

2025-05-17

Imagine that we need to monitor the average response time of a webpage and the standard deviation daily. This is a simple ask. At the end of the day, we will get a list of response times $x_1, x_2, ..., x_n$. Then we use the formulas to get the mean $\mu$ and standard deviation $\sigma$:

$$\mu = \frac{\sum_i^n x_i}{n}, \text{ and } \sigma = \sqrt{\frac{\sum_i^n(x_i - \mu)^2}{n}}.$$

What if at the end of the month we want to get the mean and standard deviation for the whole month? Well, it is also a simple ask. The simpleset way is to pull the record of all response times and use the formulas again to compute the mean and standard deviation. But we already has computed the daily means and standard deviations. Can we aggregate the daily means and standard deviations into monthly statistics? More generally, can we aggregate means and standard deviations of subsets into that of the whole set? The answer is affirmative with the following formulas.

LoRA for Sequence Classification

2025-04-13

Low-rank adaption, or LoRA, is another parameter-efficient fine-tuning techniques to align large models to specific tasks. The core idea is to approximate updates of a large matrix by two matrices of smaller rank:

Illustration of LoRA, from the original paper

The idea works becasue in the fine-tuning stage, the data we use is very small and narrowly-focused (on some speficic domain) compared to the pretrained data, and thus we can represent the updates with smaller matrices.

Prompt Tuning for Sequence Classification

2025-03-23

In my previous blog post Zero-Shot Text Classification with pretrained LLM, I used Qwen2.5-0.5B-Instruct for sentiment analysis without any training. With some tweet on the prompts, we can see an improvement of accuracy from 77.5% to 82.5%. We might be able to squeeze the performance even more with prompt engineering, but it is inefficient as most of the time we don't know why one word is better than another in the prompts. Instead of prompt engineering, we can do prompt tuning with some labelled data, which is one of the parameter-efficent ways to fine tune a LLM model. Its main idea is to prepend some tunable tokens to some task specific prompt while freezing the LLM model. We then train the embeddings of the prepended tokens on the labelled data so that the learned tokens can align the task specific prompt better to the task.

Sequence Classification with Apple MLX

2025-03-08

MLX is an array framework for machine learning on Apple silicon. The biggest advantage of the framework is the compatibility with the unified memory on Apple so that operations on MLX arrays can be performed on any of the supported device types without transferring data. It makes MLX a strong candidate when it comes to inferencing and even training a large model on Apple silicon. There are examples specifically designed for LLM, with a focus on text completion. As of today, there are few examples on other LLM tasks such as sequence classification for MLX since the framework is relatively new. I will provide an example to do classification inference with MLX, replicating what I did in my previous article Zero-Shot Text Classification with pretrained LLM.

Zero-Shot Text Classification with pretrained LLM

2025-02-23

According to this article,

Zero-shot text classification is a task in natural language processing where a model is trained on a set of labeled examples but is then able to classify new examples from previously unseen classes.

Simply put, zero-shot text classification is to use preexisting models on classification tasks that the models are not trained upon. Large Language Models backed by attention have a lot of great applications, such as summarization, chatbot, code completion and etc. It aslo gives zero-shot text classification a huge potential since most LLMs are pretrained on tremendous data which cover most common use case already. LLMs with strong reasoning capability such as deepseek can even perform well on unseen data. In this article, I want to discuss some pratical ways to use pretrained LLMs to do zero-shot classification using 🤗 Transformers.

PySpark Estimator and Transformer

2023-09-04

PySpark's pipeline is a powerful tool that encapsulates machine learning processes. We can build rather complicated pipelines to our needs using the existing estimators/transformers come with the PySpark's library, until we can't. In this article, I will show how we can build custom estimators and transformers to make the pipeline even more powerful.

Imagine that we want to build a model with some high cardinality categorical features. Upon inspection, we find that only some most frequent values are useful and we decide to keep those frequent values and mask other values "OTHERS". We will implement CardinalityReducer that will keep only most frequent N values in a categorical column (or a column of string type). We will implement it in a way so that it can fit training sets together with other components in a pipeline.

Fixing an issue in saving/loading BERT models

2022-12-29

Recently I came across an issue in saving/loading BERT models with TensorFlow. The BERT models are provided by the Transformers library, and I used Tensorflow backend. When saving with model.save(path) then loading with tf.keras.models.load_model(path), it gave the following TypeError or ValueError:

TypeError/ValueError: The two structures don't have the same nested structure.

The article is to document several ways to solve the issue.

Simple SVD with Bias for Netflix Prize

2021-08-05

In my linear algebra class this summer, I used the Netflix Prize challenge as a pratical example for an application of singular value decomposition (SVD). To be more precise, I explained the term $p_u^Tq_i$ in the simple SVD with bias model:

$$\hat{r}_{ui} = \mu + b_u + b_i + p_u^Tq_i.$$

The above model can be found in section 2.1 in this progress paper of the winning team. In this note, I will explain this model and give an implementation in Python. A C implementation of the moddel can be found in my GitHub repository here: https://github.com/wormtooth/netflix_svd.

Clustering Weibo Tags

2020-12-03

I started a projected last October to collect Weibo's top search data (微博热搜榜) hourly. Together with the keywords or tags (关键词), most recent related weibos (or tweets) are collected as well. The result is save to a JSON file, with the format explained in this page.

In this post, I would like to explore this data set and try to cluster tags. To be more precise, multiple tags can be used to refer to a same event, and these different tags are related and even share the same meaning. The task is to group similar tags together based on the data collected.

Principal Component Analysis

2020-10-10

This is the note I used as an example of applications in Linear Algebra I lectured at Purdue University. It is slightly modified so that it is more or less self contained.

Principal Component Analysis (PCA) is a linear algebra technique for data analysis, which is an application of eigenvalues and eigenvectors. PCA can be used in

exploratory data analysis (visualizing the data)
features reduction

We will learn the basic idea of PCA and see its applications in handwritten-digits recognition, eigenfaces and etc.

1 2 3 ... 11 Next