Since October 2019, I have a Python script scraping Weibo's top searches (微博热搜榜) hourly. Each hour's results are stored in a JSON file with its name in the pattern YYYYMMDDHH.json. YYYY is for year, MM is zero padded two-digits number for month, DD is zero padded two digits number for day, and HH is zero padded two digits number for hour. Note that the time is in UTC+8 timezone. For example, 2020100316.json stores top searches results of Oct. 3, 2020 at 4pm (in UTC+8). The first avalibale JSON file is 2019101510.json.

Within each JSON file, it has the following layout:

[
    {
        "hotness": ###,
        "tag": "xxxx",
        "weibo": [
            {
                "content": "xxxx",
                "nickname": "xxxx"
            },
            ...
        ]
    },
    ...
]
  • hotness (热度): int. An integer indicating how popular the topic is.
  • tag: str. The name of the topic.
  • weibo: list. A list of most recent tweets in this topic.
    • content: str. The conent of the tweets.
    • nickname: str. The nickname of the owner of the tweets.

A sample is provided: 2019101510.json