4달 전

VICTOR: 브라질 법률 문서 분류를 위한 데이터셋

{Te{\'o}filo Em{\'\i}dio de Campos Pedro Henrique Luz de Araujo Nilton Correia da Silva Fabricio Ataides Braz}

초록

이 논문은 브라질 최고법원의 디지털화된 법적 문서를 기반으로 구축된 새로운 데이터셋인 VICTOR에 대해 설명한다. 이 데이터셋은 45,000건 이상의 항소 사건을 포함하며, 약 692,000건의 문서—총 약 460만 페이지—로 구성되어 있다. 데이터셋은 레이블이 부여된 텍스트 데이터를 포함하고 있으며, 문서 유형 분류와 주제 할당(다중 레이블 문제)이라는 두 가지 유형의 작업을 지원한다. 본 연구에서는 단어 집합 모델, 컨볼루션 신경망, 순환 신경망, 부스팅 알고리즘을 활용한 베이스라인 결과를 제시한다. 또한 소송의 순차적 특성을 활용하기 위해 선형 체인 조건부 확률 필드(Linear-chain Conditional Random Fields)를 실험하여, 문서 유형 분류 성능 향상이 가능함을 확인하였다. 마지막으로, 도메인 지식을 활용해 정보성이 낮은 문서 페이지를 사전에 제거하는 주제 분류 접근법과, 모든 문서 페이지를 사용하는 기존 방법을 비교하였다. 법원 전문가들의 예상과는 달리, 모든 이용 가능한 데이터를 활용하는 것이 더 우수한 성능을 보였다. 본 연구에서는 보다 우수한 모델과 기법 탐색을 촉진하기 위해 데이터셋을 크기와 내용이 다른 세 가지 버전으로 공개한다.

벤치마크

벤치마크	방법론	지표
multi-label-text-classification-on-bvictor	XGBoost	Average F1: 0.8843 Weighted F1: 0.8957
multi-label-text-classification-on-bvictor	SVM	Average F1: 0.7761 Weighted F1: 0.8235
multi-label-text-classification-on-bvictor	NB	Average F1: 0.6335 Weighted F1: 0.6955
multi-label-text-classification-on-mvictor	SVM	Average F1: 0.6642 Weighted F1: 0.8137
multi-label-text-classification-on-mvictor	NB	Average F1: 0.3797 Weighted F1: 0.6062
multi-label-text-classification-on-mvictor	XGBoost	Average F1: 0.8882 Weighted F1: 0.9072
multi-label-text-classification-on-svictor	SVM	Average F1: 0.8246 Weighted F1: 0.8231
multi-label-text-classification-on-svictor	NB	Average F1: 0.5121 Weighted F1: 0.4875
multi-label-text-classification-on-svictor	XGBoost	Average F1: 0.8887 Weighted F1: 0.8634
text-classification-on-mvictor-type	BiLSTM	Average F1: 0.7092 Weighted F1: 0.9433
text-classification-on-mvictor-type	CNN	Average F1: 0.7061 Weighted F1: 0.9464
text-classification-on-mvictor-type	SVM	Average F1: 0.6792 Weighted F1: 0.9288
text-classification-on-mvictor-type	CNN + CRF	Average F1: 0.7505 Weighted F1: 0.9537
text-classification-on-mvictor-type	NB	Average F1: 0.4772 Weighted F1: 0.8477
text-classification-on-svictor-type	SVM	Average F1: 0.7632 Weighted F1: 0.9425
text-classification-on-svictor-type	BiLSTM	Average F1: 0.7281 Weighted F1: 0.9465
text-classification-on-svictor-type	NB	Average F1: 0.5979 Weighted F1: 0.8893
text-classification-on-svictor-type	CNN + CRF	Average F1: 0.7740 Weighted F1: 0.9533
text-classification-on-svictor-type	CNN	Average F1: 0.7584 Weighted F1: 0.9472

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 공동 코딩, 즉시 사용 가능한 환경, 최적 가격 GPU로 AI 개발을 가속화하세요.

AI 공동 코딩

즉시 사용 가능한 GPU

최적 가격

시작하기

Hyper Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

Command Palette

VICTOR: 브라질 법률 문서 분류를 위한 데이터셋

{Te{\'o}filo Em{\'\i}dio de Campos Pedro Henrique Luz de Araujo Nilton Correia da Silva Fabricio Ataides Braz}

초록

벤치마크

AI로 AI 구축

Hyper Newsletters