HyperAI

Reader-LM: Convert HTML to MarkDown Quickly and Efficiently

1. Tutorial Introduction

该教程使用的基础算力为 RTX 4090 。

Reader-LM is a series of small language models developed by Jina AI in 2024, specifically for converting raw HTML content on the web into clear and tidy Markdown format. These models include Reader-LM-0.5B and Reader-LM-1.5B, which excel in processing long texts and multilingual content, supporting context lengths up to 256K bytes.

The Reader-LM models are designed to address the need for efficient and economical data extraction from noisy web content. They outperform several large language models such as GPT-4o and Gemini-1.5-Flash in HTML to Markdown conversion tasks, while being smaller and more suitable for running in resource-constrained environments.

The model is trained on a curated collection of HTML content and its corresponding Markdown content. This tutorial demonstrates how to convert HTML to markdown using reader-lm-1.5b or reader-lm-0.5b.

请注意!模型的输入(即提示)是原始 HTML—不需要前缀指令。

2. Operation steps

1. 启动容器后点击 API 地址即可进入 Web 界面 (需要完成实名认证,无需打开工作空间)
2. WebUI Demo 详细教程
* 模型输入:一定要注意模型的输入(即提示)是原始 HTML—不需要前缀指令。

* 模型选择:jina 提供了 2 个参数量不同的模型,分别为 reader-lm-1.5B 和 reader-lm-0.5B,可根据自己的需要进行选择。

* 这里我们选择一个示例点击提交即可看到模型输出结果,一定要注意模型的输入(即提示)是原始 HTML—不需要前缀指令。
* 生成结果
  • Reader LM Output: the result of using the model output;
  • Markdownify Output: markdownify is a Python library that can convert HTML content to Markdown format. This library is particularly useful when you need to display data originally in HTML format on a platform that supports Markdown.
    • Save the file as shown in the figure below: Two md files are generated each time, the file name is timestamp + generation method, and the save directory is: ./HTML-to-Markdown/output_md/「timestamp」_「generation method」.md