3 years ago

Dame Jovanoski Veno Pachovski Preslav Nakov

Twitter Sentiment Analysis

20 Hours of RTX 5090 Compute Resources for Only $1 (Worth $7)

Table of Contents

Abstract

We present work on sentiment analysis in Twitter for Macedonian. As this is pioneering work for this combination of language and genre, we created suitable resources for training and evaluating a system for sentiment analysis of Macedonian tweets. In particular, we developed a corpus of tweets annotated with tweet-level sentiment polarity (positive, negative, and neutral), as well as with phrase-level sentiment, which we made freely available for research purposes. We further bootstrapped several large-scale sentiment lexicons for Macedonian, motivated by previous work for English. The impact of several different pre-processing steps as well as of various features is shown in experiments that represent the first attempt to build a system for sentiment analysis in Twitter for the morphologically rich Macedonian language. Overall, our experimental results show an F1-score of 92.16, which is very strong and is on par with the best results for English, which were achieved in recent SemEval competitions.

One-sentence Summary

This work presents a pioneering system for Twitter sentiment analysis in Macedonian that leverages a freely available corpus annotated with tweet- and phrase-level sentiment polarity alongside bootstrapped large-scale lexicons, achieving an F1-score of 92.16 that matches top SemEval English benchmarks after evaluating various pre-processing steps and features.

Key Contributions

A freely available annotated corpus of Macedonian tweets is introduced, featuring both tweet-level and phrase-level sentiment polarity labels to support training and evaluation for this morphologically rich language.
Several large-scale sentiment lexicons are bootstrapped for Macedonian, and the impact of various preprocessing steps and feature sets is systematically evaluated to construct a Twitter sentiment analysis system.
Experimental evaluation demonstrates that the proposed approach achieves an F1-score of 92.16, matching the top performance levels recorded in recent SemEval competitions for English.

Introduction

The rapid expansion of social media has positioned sentiment analysis as a vital tool for tracking public opinion, informing marketing strategies, and monitoring societal trends. While English Twitter sentiment analysis has matured through standardized benchmarks and large-scale automated lexicons, Macedonian natural language processing remains significantly underdeveloped. Previous Macedonian studies relied on small, domain-specific corpora with non-public annotations and manually curated lexicons that could not scale or adequately address the morphological complexity of social media text. To address these gaps, the authors develop the first end-to-end sentiment analysis framework for Macedonian tweets. They release a native-annotated corpus covering positive, negative, and neutral polarities, bootstrap large-scale sentiment lexicons specifically for the language, and demonstrate a classification model that achieves performance on par with top-tier English systems.

Dataset

Dataset composition and sources: The authors collected approximately 500,000 tweets via the Twitter API between November 2014 and April 2015. A high-precision Naïve Bayes classifier was applied to filter out misclassified Bulgarian and Russian posts, isolating a Macedonian-only corpus. The dataset includes both tweet-level polarity labels and phrase-level sentiment annotations.
Subset details and filtering: The training set was annotated by a single native speaker for both tweet and phrase sentiment. The testing set was annotated by two native speakers, and the authors discarded 474 tweets where annotators disagreed, resulting in moderate inter-annotator agreement (Cohen's Kappa = 0.41). The final split maintains a roughly balanced ratio of positive and negative tweets, with a smaller neutral category.
Data usage and modeling: The authors use the corpus to train a logistic regression sentiment classifier weighted by TF-IDF features. The phrase-level annotations and translated lexicons are integrated into the training pipeline to bootstrap Macedonian sentiment resources and improve feature representation.
Processing and feature engineering: The preprocessing pipeline converts text to lowercase, strips URLs and @-mentions, removes 146 stopwords, and eliminates repeated characters. Negation is handled by prefixing tokens with NEG_CONTEXT_. Non-standard dialects and slang are normalized using a 173-entry mapping dictionary. The pipeline further applies rule-based and perceptron-trained POS tagging, fuzzy lemmatization via Jaro-Winkler and Levenshtein distances, and a custom 65-rule stemmer designed for Macedonian's highly inflected morphology. Final features are computed using TF-IDF weighting. No cropping or explicit metadata construction was applied beyond these linguistic transformations.

Method

The authors leverage an automated approach to construct sentiment lexicons using Pointwise Mutual Information (PMI) to determine the semantic orientation of words and phrases within textual data. This method assigns a sentiment score to each word based on its association with predefined positive and negative seed terms. The semantic orientation $SO(w)$ of a word $w$ is computed as the difference between its PMI with positive seed terms and its PMI with negative seed terms:

SO(w) = PMI(w, pos) - PMI(w, neg)

Here, $PMI(w, pos)$ quantifies how strongly $w$ co-occurs with any seed positive term, and $PMI(w, neg)$ measures its association with seed negative terms. The PMI formula is defined as:

PMI(w, pos) = \frac{P(w, pos)}{P(w)P(pos)}

where $P(w, pos)$ is the joint probability of observing $w$ alongside any seed positive term in a tweet, $P(w)$ is the marginal probability of $w$ appearing in any tweet, and $P(pos)$ is the probability of encountering any seed positive term in a tweet. A similar definition applies to $PMI(w, neg)$ . A positive $SO(w)$ value indicates a positive sentiment polarity, while a negative value reflects negative polarity, with the magnitude indicating the strength of sentiment.

This PMI-based framework forms the basis for large-scale sentiment lexicons in English, such as the Hashtag Sentiment Lexicon and Sentiment140, which extract sentiment-bearing terms from millions of tweets using hashtags or emoticons as seed indicators. In the context of Macedonian language processing, the authors adapt this method by using their manually constructed Macedonian sentiment polarity lexicon as seeds. They then mine additional sentiment-bearing words from a corpus of half a million Macedonian tweets, despite the smaller scale compared to English datasets. The use of a larger seed set is shown to be effective in enriching the lexicon. Furthermore, they explore the construction of lexicons using translated versions of existing lexicons as seeds.

To evaluate the impact of these lexicons, the authors define a set of features that are either fully or partially dependent on them. These include: the number of positive and negative words in a tweet, the ratio of positive and negative words to the total number of sentiment words, the sum of sentiment scores for all dictionary entries found, and the sum of positive and negative sentiment scores separately. Additional features include the count of positive and negative emoticons. When multiple lexicons are used simultaneously, each feature is instantiated separately for each lexicon. The classification model employs logistic regression, with base features consisting of TF-IDF-weighted unigrams and bigrams, along with emoticon presence, augmented by the sentiment-focused features derived from the lexicons.

Experiment

The evaluation follows the SemEval Twitter sentiment analysis framework, measuring performance using a balanced F-score across positive and negative classes. The first experiment validates the contribution of individual preprocessing steps, demonstrating that stopword removal, negation handling, and dialect normalization are essential for maintaining model accuracy, while part-of-speech tagging provides negligible benefits. The second experiment validates the role of sentiment resources, revealing that both manually curated and bootstrapped lexicons significantly drive performance, with the bootstrapped data proving particularly impactful for capturing nuanced sentiment cues.

{"summary": "The authors evaluate the impact of preprocessing steps and sentiment lexicons on sentiment analysis performance. Results show that certain preprocessing steps and lexicons significantly affect the F-score, with some contributing more than others.", "highlights": ["Excluding stopword removal and negation handling leads to a substantial drop in performance.", "Normalization to Standard Macedonian has a notable impact on the results.", "The manually-crafted lexicon contributes significantly, but bootstrapped lexicons have an even greater influence."]

The authors evaluate their system using a setup consistent with previous SemEval tasks, focusing on F-score as the primary metric. The dataset is split into training and testing sets with specified class distributions, and the results highlight the importance of preprocessing steps and lexicon features. Preprocessing steps such as stopword removal and negation handling significantly impact performance. The manually-crafted sentiment lexicon contributes substantially to the overall F-score. Bootstrapped lexicons are even more influential than the manually-crafted ones in improving performance.

The authors evaluate the impact of removing individual preprocessing steps on the F-score, with stopword removal and negation handling showing the most significant negative effects. Normalization to Standard Macedonian and handling of repeating characters also contribute meaningfully, while POS tagging has minimal impact. Results show that removing certain preprocessing components leads to substantial performance drops, with stopword removal and negation handling causing the largest decreases. Removing stopword removal and negation handling leads to the largest drops in F-score. Normalization to Standard Macedonian and handling of repeating characters significantly affect performance. POS tagging has negligible impact on the F-score.

The authors evaluate the impact of different preprocessing steps and sentiment lexicons on F-score performance. Results show that excluding certain lexicons leads to significant drops in performance, while the effect of removing other components is minimal. Excluding manually-crafted lexicons leads to a substantial drop in F-score. Excluding automatically-constructed lexicons causes the most significant decline in F-score. Removing translated lexicons has a negligible effect on F-score.

The authors evaluate their sentiment analysis system using a standard SemEval-style framework with dedicated training and testing splits to measure overall classification quality. One set of experiments validates the necessity of specific preprocessing techniques, revealing that stopword removal and negation handling are essential for robust performance while part-of-speech tagging proves largely unnecessary. A second series of experiments assesses different sentiment resource types, demonstrating that automatically constructed lexicons provide the greatest performance boost, followed by manually curated dictionaries, with translated resources offering minimal benefit.

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Run this Notebook Discuss on Discord

3 years ago

Dame Jovanoski Veno Pachovski Preslav Nakov

Twitter Sentiment Analysis

20 Hours of RTX 5090 Compute Resources for Only $1 (Worth $7)

Go to Notebook

Table of Contents

Abstract

One-sentence Summary

Key Contributions

A freely available annotated corpus of Macedonian tweets is introduced, featuring both tweet-level and phrase-level sentiment polarity labels to support training and evaluation for this morphologically rich language.
Several large-scale sentiment lexicons are bootstrapped for Macedonian, and the impact of various preprocessing steps and feature sets is systematically evaluated to construct a Twitter sentiment analysis system.
Experimental evaluation demonstrates that the proposed approach achieves an F1-score of 92.16, matching the top performance levels recorded in recent SemEval competitions for English.

Introduction

Dataset

Dataset composition and sources: The authors collected approximately 500,000 tweets via the Twitter API between November 2014 and April 2015. A high-precision Naïve Bayes classifier was applied to filter out misclassified Bulgarian and Russian posts, isolating a Macedonian-only corpus. The dataset includes both tweet-level polarity labels and phrase-level sentiment annotations.
Subset details and filtering: The training set was annotated by a single native speaker for both tweet and phrase sentiment. The testing set was annotated by two native speakers, and the authors discarded 474 tweets where annotators disagreed, resulting in moderate inter-annotator agreement (Cohen's Kappa = 0.41). The final split maintains a roughly balanced ratio of positive and negative tweets, with a smaller neutral category.
Data usage and modeling: The authors use the corpus to train a logistic regression sentiment classifier weighted by TF-IDF features. The phrase-level annotations and translated lexicons are integrated into the training pipeline to bootstrap Macedonian sentiment resources and improve feature representation.
Processing and feature engineering: The preprocessing pipeline converts text to lowercase, strips URLs and @-mentions, removes 146 stopwords, and eliminates repeated characters. Negation is handled by prefixing tokens with NEG_CONTEXT_. Non-standard dialects and slang are normalized using a 173-entry mapping dictionary. The pipeline further applies rule-based and perceptron-trained POS tagging, fuzzy lemmatization via Jaro-Winkler and Levenshtein distances, and a custom 65-rule stemmer designed for Macedonian's highly inflected morphology. Final features are computed using TF-IDF weighting. No cropping or explicit metadata construction was applied beyond these linguistic transformations.

Method

SO(w) = PMI(w, pos) - PMI(w, neg)

Here, $PMI(w, pos)$ quantifies how strongly $w$ co-occurs with any seed positive term, and $PMI(w, neg)$ measures its association with seed negative terms. The PMI formula is defined as:

PMI(w, pos) = \frac{P(w, pos)}{P(w)P(pos)}

Experiment

Source PDF

Table of Contents

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Sentiment Analysis in Twitter for Macedonian

Dame Jovanoski Veno Pachovski Preslav Nakov

Twitter Sentiment Analysis

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Sentiment Analysis in Twitter for Macedonian

Dame Jovanoski Veno Pachovski Preslav Nakov

Twitter Sentiment Analysis

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters

Command Palette

Sentiment Analysis in Twitter for Macedonian

Dame Jovanoski Veno Pachovski Preslav Nakov

Twitter Sentiment Analysis

Abstract

One-sentence Summary

Key Contributions

Introduction

Dataset

Method

Experiment

Build AI with AI

HyperAI Newsletters