Magika：用 AI 來幫助你輕鬆分類各類文件

大貓咪收錄於 Python

2024-02-28 2024-02-28 約 1086 字預計閱讀 5 分鐘次閱讀

最近在 Github 上看到了有趣的東西

使用 AI Model 讀取檔案內容並辨別檔案的種類

準確率高達 99% 以上

廣告 AD

Magika

Magika 是偵測檔案類型的 AI 模型，參數大約只有 1MB，並且能在單 CPU 上在數毫秒得到精確的結果。

Magika 能在 Email 的附件、檔案庫的分類以及網頁上傳檔案的功能上提供幫助。

Magika 一共會偵測以下的檔案類型：

supported_content_types_list.md

Demo

Magika 有提供線上 Demo 的連結，使用者可以上去自行測試並玩玩看。

Demo Link

Install

安裝有三種方法：

用 pip 安裝 (需要先安裝 Python)
用 Docker 執行 (需要先安裝 Docker)
用 npm 安裝

Pip

shell

pip install magika

Docker

shell

git clone https://github.com/google/magika
cd magika/
docker build -t magika .

Windows

如果你用 Windows，可以先使用 chdir 拿到當前目錄路徑後，再將 $(pwd) 替換成 chdir 的輸出。

NPM

shell

npm install magika

Usage

Magika 提供幾種方式使用：

Command Line Interface (CLI)
在 Python 上用 API
在 browser 上使用

CLI

Magika 提供很多的指令：

shell

# magika -h
Usage: magika [OPTIONS] [FILE]...

  Magika - Determine type of FILEs with deep-learning.

Options:
  -r, --recursive                 When passing this option, magika scans every
                                  file within directories, instead of
                                  outputting "directory"
  --json                          Output in JSON format.
  --jsonl                         Output in JSONL format.
  -i, --mime-type                 Output the MIME type instead of a verbose
                                  content type description.
  -l, --label                     Output a simple label instead of a verbose
                                  content type description. Use --list-output-
                                  content-types for the list of supported
                                  output.
  -c, --compatibility-mode        Compatibility mode: output is as close as
                                  possible to `file` and colors are disabled.
  -s, --output-score              Output the prediction's score in addition to
                                  the content type.
  -m, --prediction-mode [best-guess|medium-confidence|high-confidence]
  --batch-size INTEGER            How many files to process in one batch.
  --no-dereference                This option causes symlinks not to be
                                  followed. By default, symlinks are
                                  dereferenced.
  --colors / --no-colors          Enable/disable use of colors.
  -v, --verbose                   Enable more verbose output.
  -vv, --debug                    Enable debug logging.
  --generate-report               Generate report useful when reporting
                                  feedback.
  --version                       Print the version and exit.
  --list-output-content-types     Show a list of supported content types.
  --model-dir DIRECTORY           Use a custom model.
  -h, --help                      Show this message and exit.

  Magika version: "0.5.0"

  Default model: "standard_v1"

  Send any feedback to [email protected] or via GitHub issues.

如果要測試某個檔案可以用以下指令：

shell

magika <file path>
# or docker
docker run -it --rm -v <folder>:/magika magika /magika/<file name>

# example
magika test.cpp
# -- output --
test.cpp: C source (code)

如果要輸出成 json 格式可以使用：

shell

magika --json <file path>
# or docker
docker run -it --rm -v <folder>:/magika magika --json /magika/<file name>

# example
magika --json test.cpp
# -- output --
[
    {
        "path": "test.cpp",
        "dl": {
            "ct_label": "c",
            "score": 0.999630331993103,
            "group": "code",
            "mime_type": "text/x-c",
            "magic": "C source",
            "description": "C source"
        },
        "output": {
            "ct_label": "c",
            "score": 0.999630331993103,
            "group": "code",
            "mime_type": "text/x-c",
            "magic": "C source",
            "description": "C source"
        }
    }
]

以下稍微解釋各欄位的意思：

path: 該檔案的路徑。
dl: Deep Learning 的模型輸出的分數。
output: Magika 根據各項指標所做出的綜合考量，其中會參考模型的分數和 confidence 來給出綜合分數。

要顯示分數可以用以下指令：

shell

magika -s <file path>
# or docker
docker run -it --rm -v <folder>:/magika magika -s /magika/<file name>

# example
magika -s test.cpp
# -- output --
test.cpp: C source (code) 99%

API

使用 Python 的 API 可以將這個功能套到你的程式當中，也能做更多客製化的功能。

一共提供三種 function：

identify_bytes(b"test")：將內容以 bytes 的形式傳入
identify_path(Path(“test.txt”))：傳入檔案的路徑
identify_paths([Path(“test.txt”), Path(“test2.txt”)])：傳入多個檔案的路徑

如果要判別大型檔案的類型，建議使用 path 的方式，不要使用 bytes 的方式，這樣執行的效率比較高。

以下為範例：

Python

from magika import Magika

m = Magika()
res = m.identify_bytes(b"#include <iostream>\nusing namespace std;\nint main(){ return 0; }\n")
print(res.output)

text

> python3 test.py
MagikaOutputFields(ct_label='c', score=0.9599996209144592, group='code', mime_type='text/x-c', magic='C source', description='C source')

Browser

JS 的部分有提供兩種 function：

identifyBytes: 根據傳入的 byte stream 回傳最有可能的結果。
identifyBytesFull: 根據傳入的 byte stream 回傳所有可能的結果。

const data = await readFile('some file');
const magika = new Magika();
await magika.load();
const prediction = await magika.identifyBytes(data);
console.log(prediction);

Reference

https://github.com/google/magika

廣告 AD

目錄

Magika：用 AI 來幫助你輕鬆分類各類文件

Magika

Demo

Install

Pip

Docker

NPM

Usage

CLI

API

Browser

Reference

WAT：檢視 Python 物件，一款強大的 Debug 工具

py-spy: 找出 Python 程式性能瓶頸

將 DeepSeek 透過 Transformers.js 在你的瀏覽器上運行！

Pexpect：自動化的神奇工具，讓自動化變得更簡單

加速大型語言模型：運用 ONNX 格式推理和 KV Cache 的使用