3年前

Dame Jovanoski Veno Pachovski Preslav Nakov

ツイッター感情分析

RTX 5090のコンピュートリソースがわずか20時間分 $1 (価値 $7)

概要

本研究では、マケドニア語のTwitterにおける感情分析に関する取り組みを紹介する。この言語とジャンルを組み合わせた研究は先駆的なものであるため、マケドニア語のツイートに対する感情分析システムの訓練および評価に適したリソースを作成した。具体的には、ツイートレベルの感情極性（肯定的、否定的、中立的）およびフレーズレベルの感情で注釈を付したツイートコーパスを開発し、研究目的のために自由に公開した。さらに、英語に関する先行研究に触発され、マケドニア語のための大規模な感情辞書群をブートストラップにより構築した。形態論的に豊かなマケドニア語を対象としたTwitter感情分析システムの構築に向けた最初の試みとして、複数の前処理ステップおよび各種特徴量の影響を実験で示した。全体的に、実験結果はF1スコア92.16を示し、これは非常に優れた結果であり、最近のSemEval競技会で達成された英語の最良の結果と同等である。

One-sentence Summary

This work presents a pioneering system for Twitter sentiment analysis in Macedonian that leverages a freely available corpus annotated with tweet- and phrase-level sentiment polarity alongside bootstrapped large-scale lexicons, achieving an F1-score of 92.16 that matches top SemEval English benchmarks after evaluating various pre-processing steps and features.

Key Contributions

A freely available annotated corpus of Macedonian tweets is introduced, featuring both tweet-level and phrase-level sentiment polarity labels to support training and evaluation for this morphologically rich language.
Several large-scale sentiment lexicons are bootstrapped for Macedonian, and the impact of various preprocessing steps and feature sets is systematically evaluated to construct a Twitter sentiment analysis system.
Experimental evaluation demonstrates that the proposed approach achieves an F1-score of 92.16, matching the top performance levels recorded in recent SemEval competitions for English.

Introduction

The rapid expansion of social media has positioned sentiment analysis as a vital tool for tracking public opinion, informing marketing strategies, and monitoring societal trends. While English Twitter sentiment analysis has matured through standardized benchmarks and large-scale automated lexicons, Macedonian natural language processing remains significantly underdeveloped. Previous Macedonian studies relied on small, domain-specific corpora with non-public annotations and manually curated lexicons that could not scale or adequately address the morphological complexity of social media text. To address these gaps, the authors develop the first end-to-end sentiment analysis framework for Macedonian tweets. They release a native-annotated corpus covering positive, negative, and neutral polarities, bootstrap large-scale sentiment lexicons specifically for the language, and demonstrate a classification model that achieves performance on par with top-tier English systems.

Dataset

Dataset composition and sources: The authors collected approximately 500,000 tweets via the Twitter API between November 2014 and April 2015. A high-precision Naïve Bayes classifier was applied to filter out misclassified Bulgarian and Russian posts, isolating a Macedonian-only corpus. The dataset includes both tweet-level polarity labels and phrase-level sentiment annotations.
Subset details and filtering: The training set was annotated by a single native speaker for both tweet and phrase sentiment. The testing set was annotated by two native speakers, and the authors discarded 474 tweets where annotators disagreed, resulting in moderate inter-annotator agreement (Cohen's Kappa = 0.41). The final split maintains a roughly balanced ratio of positive and negative tweets, with a smaller neutral category.
Data usage and modeling: The authors use the corpus to train a logistic regression sentiment classifier weighted by TF-IDF features. The phrase-level annotations and translated lexicons are integrated into the training pipeline to bootstrap Macedonian sentiment resources and improve feature representation.
Processing and feature engineering: The preprocessing pipeline converts text to lowercase, strips URLs and @-mentions, removes 146 stopwords, and eliminates repeated characters. Negation is handled by prefixing tokens with NEG_CONTEXT_. Non-standard dialects and slang are normalized using a 173-entry mapping dictionary. The pipeline further applies rule-based and perceptron-trained POS tagging, fuzzy lemmatization via Jaro-Winkler and Levenshtein distances, and a custom 65-rule stemmer designed for Macedonian's highly inflected morphology. Final features are computed using TF-IDF weighting. No cropping or explicit metadata construction was applied beyond these linguistic transformations.

Method

The authors leverage an automated approach to construct sentiment lexicons using Pointwise Mutual Information (PMI) to determine the semantic orientation of words and phrases within textual data. This method assigns a sentiment score to each word based on its association with predefined positive and negative seed terms. The semantic orientation $SO(w)$ of a word $w$ is computed as the difference between its PMI with positive seed terms and its PMI with negative seed terms:

SO(w) = PMI(w, pos) - PMI(w, neg)

Here, $PMI(w, pos)$ quantifies how strongly $w$ co-occurs with any seed positive term, and $PMI(w, neg)$ measures its association with seed negative terms. The PMI formula is defined as:

PMI(w, pos) = \frac{P(w, pos)}{P(w)P(pos)}

where $P(w, pos)$ is the joint probability of observing $w$ alongside any seed positive term in a tweet, $P(w)$ is the marginal probability of $w$ appearing in any tweet, and $P(pos)$ is the probability of encountering any seed positive term in a tweet. A similar definition applies to $PMI(w, neg)$ . A positive $SO(w)$ value indicates a positive sentiment polarity, while a negative value reflects negative polarity, with the magnitude indicating the strength of sentiment.

This PMI-based framework forms the basis for large-scale sentiment lexicons in English, such as the Hashtag Sentiment Lexicon and Sentiment140, which extract sentiment-bearing terms from millions of tweets using hashtags or emoticons as seed indicators. In the context of Macedonian language processing, the authors adapt this method by using their manually constructed Macedonian sentiment polarity lexicon as seeds. They then mine additional sentiment-bearing words from a corpus of half a million Macedonian tweets, despite the smaller scale compared to English datasets. The use of a larger seed set is shown to be effective in enriching the lexicon. Furthermore, they explore the construction of lexicons using translated versions of existing lexicons as seeds.

To evaluate the impact of these lexicons, the authors define a set of features that are either fully or partially dependent on them. These include: the number of positive and negative words in a tweet, the ratio of positive and negative words to the total number of sentiment words, the sum of sentiment scores for all dictionary entries found, and the sum of positive and negative sentiment scores separately. Additional features include the count of positive and negative emoticons. When multiple lexicons are used simultaneously, each feature is instantiated separately for each lexicon. The classification model employs logistic regression, with base features consisting of TF-IDF-weighted unigrams and bigrams, along with emoticon presence, augmented by the sentiment-focused features derived from the lexicons.

Experiment

The evaluation follows the SemEval Twitter sentiment analysis framework, measuring performance using a balanced F-score across positive and negative classes. The first experiment validates the contribution of individual preprocessing steps, demonstrating that stopword removal, negation handling, and dialect normalization are essential for maintaining model accuracy, while part-of-speech tagging provides negligible benefits. The second experiment validates the role of sentiment resources, revealing that both manually curated and bootstrapped lexicons significantly drive performance, with the bootstrapped data proving particularly impactful for capturing nuanced sentiment cues.

{"summary": "The authors evaluate the impact of preprocessing steps and sentiment lexicons on sentiment analysis performance. Results show that certain preprocessing steps and lexicons significantly affect the F-score, with some contributing more than others.", "highlights": ["Excluding stopword removal and negation handling leads to a substantial drop in performance.", "Normalization to Standard Macedonian has a notable impact on the results.", "The manually-crafted lexicon contributes significantly, but bootstrapped lexicons have an even greater influence."]

The authors evaluate their system using a setup consistent with previous SemEval tasks, focusing on F-score as the primary metric. The dataset is split into training and testing sets with specified class distributions, and the results highlight the importance of preprocessing steps and lexicon features. Preprocessing steps such as stopword removal and negation handling significantly impact performance. The manually-crafted sentiment lexicon contributes substantially to the overall F-score. Bootstrapped lexicons are even more influential than the manually-crafted ones in improving performance.

The authors evaluate the impact of removing individual preprocessing steps on the F-score, with stopword removal and negation handling showing the most significant negative effects. Normalization to Standard Macedonian and handling of repeating characters also contribute meaningfully, while POS tagging has minimal impact. Results show that removing certain preprocessing components leads to substantial performance drops, with stopword removal and negation handling causing the largest decreases. Removing stopword removal and negation handling leads to the largest drops in F-score. Normalization to Standard Macedonian and handling of repeating characters significantly affect performance. POS tagging has negligible impact on the F-score.

The authors evaluate the impact of different preprocessing steps and sentiment lexicons on F-score performance. Results show that excluding certain lexicons leads to significant drops in performance, while the effect of removing other components is minimal. Excluding manually-crafted lexicons leads to a substantial drop in F-score. Excluding automatically-constructed lexicons causes the most significant decline in F-score. Removing translated lexicons has a negligible effect on F-score.

The authors evaluate their sentiment analysis system using a standard SemEval-style framework with dedicated training and testing splits to measure overall classification quality. One set of experiments validates the necessity of specific preprocessing techniques, revealing that stopword removal and negation handling are essential for robust performance while part-of-speech tagging proves largely unnecessary. A second series of experiments assesses different sentiment resource types, demonstrating that automatically constructed lexicons provide the greatest performance boost, followed by manually curated dictionaries, with translated resources offering minimal benefit.

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

HyperAI

このノートブックを実行 Discordで議論

3年前

Dame Jovanoski Veno Pachovski Preslav Nakov

ツイッター感情分析

RTX 5090のコンピュートリソースがわずか20時間分 $1 (価値 $7)

ノートブックへ移動

概要

One-sentence Summary

Key Contributions

A freely available annotated corpus of Macedonian tweets is introduced, featuring both tweet-level and phrase-level sentiment polarity labels to support training and evaluation for this morphologically rich language.
Several large-scale sentiment lexicons are bootstrapped for Macedonian, and the impact of various preprocessing steps and feature sets is systematically evaluated to construct a Twitter sentiment analysis system.
Experimental evaluation demonstrates that the proposed approach achieves an F1-score of 92.16, matching the top performance levels recorded in recent SemEval competitions for English.

Introduction

Dataset

Dataset composition and sources: The authors collected approximately 500,000 tweets via the Twitter API between November 2014 and April 2015. A high-precision Naïve Bayes classifier was applied to filter out misclassified Bulgarian and Russian posts, isolating a Macedonian-only corpus. The dataset includes both tweet-level polarity labels and phrase-level sentiment annotations.
Subset details and filtering: The training set was annotated by a single native speaker for both tweet and phrase sentiment. The testing set was annotated by two native speakers, and the authors discarded 474 tweets where annotators disagreed, resulting in moderate inter-annotator agreement (Cohen's Kappa = 0.41). The final split maintains a roughly balanced ratio of positive and negative tweets, with a smaller neutral category.
Data usage and modeling: The authors use the corpus to train a logistic regression sentiment classifier weighted by TF-IDF features. The phrase-level annotations and translated lexicons are integrated into the training pipeline to bootstrap Macedonian sentiment resources and improve feature representation.
Processing and feature engineering: The preprocessing pipeline converts text to lowercase, strips URLs and @-mentions, removes 146 stopwords, and eliminates repeated characters. Negation is handled by prefixing tokens with NEG_CONTEXT_. Non-standard dialects and slang are normalized using a 173-entry mapping dictionary. The pipeline further applies rule-based and perceptron-trained POS tagging, fuzzy lemmatization via Jaro-Winkler and Levenshtein distances, and a custom 65-rule stemmer designed for Macedonian's highly inflected morphology. Final features are computed using TF-IDF weighting. No cropping or explicit metadata construction was applied beyond these linguistic transformations.

Method

SO(w) = PMI(w, pos) - PMI(w, neg)

Here, $PMI(w, pos)$ quantifies how strongly $w$ co-occurs with any seed positive term, and $PMI(w, neg)$ measures its association with seed negative terms. The PMI formula is defined as:

PMI(w, pos) = \frac{P(w, pos)}{P(w)P(pos)}

Experiment

ソースPDF

AIでAIを構築

アイデアからローンチまで — 無料のAIコーディング支援、すぐに使える環境、最高のGPU価格でAI開発を加速。

AI コーディング補助

すぐに使える GPU

最適な料金体系

開始する料金を見る

HyperAI Newsletters

最新情報を購読する

北京時間 毎週月曜日の午前9時 に、その週の最新情報をメールでお届けします

メール配信サービスは MailChimp によって提供されています

Command Palette

マケドニア語のTwitterにおける感情分析

Dame Jovanoski Veno Pachovski Preslav Nakov

ツイッター感情分析

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

マケドニア語のTwitterにおける感情分析

Dame Jovanoski Veno Pachovski Preslav Nakov

ツイッター感情分析

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters

Command Palette

マケドニア語のTwitterにおける感情分析

Dame Jovanoski Veno Pachovski Preslav Nakov

ツイッター感情分析

概要

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

AIでAIを構築

HyperAI Newsletters