Description

Multi-label Classification Dataset for Brazilian Restaurant Reviews

This dataset was used in the article "Exploring Supervised Learning Models for Multi-Label Text Classification in Brazilian Restaurant Reviews". It contains 4,000 manually annotated sentences extracted from user reviews of Brazilian restaurants posted on platforms such as Google Reviews, Facebook, and TripAdvisor.

Each sentence is labeled with one or more thematic categories from the following set: food, service, ambience, general, price, drink, location, and other. The labels are presented as comma-separated strings in the categoria column.

File format: CSV with two columns:

Use cases: Multi-label classification, thematic categorization of user reviews, NLP model evaluation in Portuguese.

Download: [download]

Article link: https://sol.sbc.org.br/index.php/eniac/article/view/25698

Polatity sentences (Portuguese language)

Computer-BR dataset used on article "Comparing Approaches to Subjectivity Classification: A Study on Portuguese Tweets". Shared online [download].

Article link: https://link.springer.com/chapter/10.1007/978-3-319-41552-9_8

Dataset based from article "The role of text pre-processing in opinion mining on a social media language dataset" and shared online [download].

Article Link: https://ieeexplore.ieee.org/abstract/document/6984806

Sentences from Google PlayStore in portuguese, with negative and positive labels.

Sentiment Lexicons (Portuguese language)

This dataset contains sentiment lexicons for the Portuguese language with 56,755 terms in restaurant-specific domain [download].

Sample sentiment lexicon

"excelente","0.9919043535940205","0.008095646405979519","positivo"

where

Example

term p_pos p_neg class
excelente 0.991904 0.008096 positivo
agradável 0.971788 0.028212 positivo
ruim 0.3268206840537858 0.6731793159462143 negativo

Code

path = 'lexicons-webmedia21.csv'
df = pd.read_csv(path)
df.head()

Citation

Please cite the following if you use the data:

Tiago de Melo. Building a Restaurant-Specific Sentiment Lexicon via Probability Theory. In: Proceedings of the Brazilian Symposium on Multimedia and the Web (WebMedia). 2021. p. 129-132. [link]

SentiProdBR (Portuguese Language)

This dataset contains sentiment lexicons for the Portuguese language with 32,009 terms in 10 product domains [download].

Sample sentiment lexicon

"laptops", "fácil", "0.9801980198019802", "0.0198019801980198", "positive"

where

Example

domain term p_pos p_neg class
laptops fácil 0.9801980198019802 0.0198019801980198 positive
pets molegngo 0.0 1.0 negative
food saboroso 0.9769021739130436 0.02309782608695652 positive

Code

path = 'sentiprodbr.csv'
df = pd.read_csv(path)
df.head()

Citation

Please cite the following if you use the data:

Tiago de Melo. SentiProdBR: Building Domain-Specific Sentiment Lexicons for the Portuguese Language. In: Anais do XXXVI Simpósio Brasileiro de Bancos de Dados. 2021. p. XXX-YYY.