Synthetic Data in Research: Is This Genius or Just Lazy?

Introduction

Let’s cut the crap and get straight to it: are synthetic data and artificial respondents the smartest innovation in market research, or are we just being lazy as f***? In this post, I’ll break down what synthetic data is, how market research worked back in the day, and whether this AI-powered shortcut is brilliance or bulls***.

Spoiler alert: if you’ve ever paid a fortune for McKinsey or a high-end market research agency, buckle up. This one’s going to sting.

What Is AI-Driven Synthetic Market Research?

Imagine you’re conducting a survey, but instead of paying actual humans to fill it out (or begging friends and family to answer just one more goddamn questionnaire), you let artificial intelligence create synthetic respondents. These are virtual personas that mimic real people, complete with attitudes, preferences, and opinions.

Here’s the deal:

Sounds neat, right? Until you realize that the data they spit out is only as good as the model you feed them. Garbage in, garbage out, my friend.

A Brief History of Data Collection

Before AI swaggered onto the scene, market research was gritty, expensive, and slow as hell. Here’s a quick timeline for the uninitiated:

Today: Enter AI and synthetic data to save the day—or ruin everything.

Case Study: Synthetic Data in Consumer Behavior Analysis

Let's get into the nitty-gritty of how synthetic data is shaking up consumer behavior analysis. Traditional methods? Slow, costly, and often riddled with biases. Enter synthetic data: AI-generated datasets that mirror real-world data without the privacy headaches.

The Process:

Why Bother?

But Wait, There's More:

A study by MOSTLY AI delved into the fidelity and privacy of mixed-type synthetic data. The findings? High-quality synthetic data can maintain the statistical properties of real data while safeguarding individual privacy. Source: GitHub - mostly-ai/paper-fidelity-accuracy 

Conclusion: So, Is Synthetic Data a Game-Changer or Lazy as F***?

Let’s be real—synthetic data has its place, and it’s a damn exciting one. It’s faster, cheaper, and lets you test ideas at a scale that traditional methods can’t touch. But let’s not ignore the pitfalls:

The Pros

The Cons

Now, a special shoutout to all the overpriced agencies like McKinsey, Bain, and the rest of the gang. You’ve been charging eye-watering sums for market research for decades. What’s the game plan now? Oh, right—you’ll pivot to selling the idea of synthetic data with a sprinkle of your signature "expertise."

Let’s face it: companies don’t pay you for accuracy—they pay you to avoid getting fired when sh** hits the fan. When a strategy bombs, who’s first in the firing line? The research team. Next up, the creatives. The product team—where the real problems often lie—remains untouchable. Classic.

Visualizing Synthetic Data: Charts and Insights in Python

Here's a sneak peek:

# Import potrebnih biblioteka

from sdv.single_table import GaussianCopulaSynthesizer

from sdv.datasets.demo import download_demo

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt

import pandas as pd

import numpy as np


# Preuzimanje demo podataka iz SDV

data, metadata = download_demo(modality='single_table', dataset_name='fake_hotel_guests')


# Dodavanje nove značajke 'free_gift' u originalne podatke

data['free_gift'] = np.random.choice([0, 1], size=len(data), p=[0.7, 0.3])


# Treniranje SDV synthesizera za generiranje sintetičkih podataka

synthesizer = GaussianCopulaSynthesizer(metadata)

synthesizer.fit(data)

synthetic_data = synthesizer.sample(num_rows=500)  # Generiraj 500 sintetičkih podataka


# Pretvaranje kolona za analizu (npr. binarna kolona za visoke ocjene)

synthetic_data['high_rating'] = (synthetic_data['room_rating'] > 4).astype(int)


# Pretvaranje kategorijskih podataka u numeričke (dummy varijable)

synthetic_data = pd.get_dummies(synthetic_data, columns=['gender', 'room_type'], drop_first=True)


# Definiranje značajki i cilja

X = synthetic_data.drop(['room_rating', 'high_rating'], axis=1)

y = synthetic_data['high_rating']


# Podjela podataka na trening i test setove

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Treniranje Random Forest modela

classifier = RandomForestClassifier(random_state=42)

classifier.fit(X_train, y_train)


# Dobivanje važnosti značajki

importances = classifier.feature_importances_

feature_names = X.columns


# Kreiranje DataFrame-a za sortiranje značajki

feature_importance_df = pd.DataFrame({

    'Feature': feature_names,

    'Importance': importances

}).sort_values(by='Importance', ascending=False)


# Prikaz tablice važnosti značajki

print("Važnost značajki za visoke ocjene:")

print(feature_importance_df)


# Vizualizacija važnosti značajki s pie chart-om

plt.figure(figsize=(8, 8))

plt.pie(

    feature_importance_df['Importance'], 

    labels=feature_importance_df['Feature'], 

    autopct='%1.1f%%', 

    startangle=140, 

    colors=plt.cm.Paired.colors

)

plt.title('Udio važnosti značajki za visoke ocjene', fontsize=16)

plt.show()



Feature Importance Table

The output provides insights into which features (variables) are most important for predicting whether a guest will give a high rating (e.g., greater than 4). 

Here’s a detailed breakdown:

Feature Importance for High Ratings:

           Feature          Importance

2           nights           0.3100

0              age           0.2600

4  room_type_Deluxe           0.1500

6         free_gift           0.1400

5  room_type_Standard         0.1100

3      gender_Male            0.0300

Pie Chart

31.0%: nights

26.0%: age

15.0%: room_type_Deluxe

14.0%: free_gift

11.0%: room_type_Standard

3.0%: gender_Male

How to Interpret This?

nights (31.0%):

The number of nights a guest stays is the most significant factor.

Longer stays are highly correlated with higher ratings.

Actionable Insight: Focus on creating personalized experiences for guests staying multiple nights, such as offering discounts for longer stays or providing perks like free breakfast after a certain number of nights.


age (26.0%):

The age of the guest is the second most important factor.

Certain age groups (e.g., younger or older guests) may have a tendency to give higher ratings.

Actionable Insight: Use demographic data to tailor experiences. 

For example:

Younger guests: Offer social activities, adventure tours, or nightlife suggestions.

Older guests: Provide quieter accommodations or wellness services like spa treatments.


room_type_Deluxe (15.0%):

Guests staying in Deluxe rooms are more likely to leave high ratings.

Actionable Insight: Promote Deluxe rooms through upgrades, limited-time offers, or as part of premium packages to increase customer satisfaction.


free_gift (14.0%):

Offering a free gift (e.g., a bottle of wine, chocolates) significantly impacts guest ratings.

Actionable Insight: Consider small but meaningful gestures like complimentary gifts to improve guest experience. Use this strategy particularly for guests who might otherwise be neutral about their stay.


room_type_Standard (11.0%):

While Standard rooms contribute less to high ratings compared to Deluxe rooms, they still hold some influence.

Actionable Insight: Focus on improving the perceived value of Standard rooms, such as offering optional add-ons or small upgrades (e.g., better toiletries or complimentary water).


gender_Male (3.0%):

Guest gender (whether male or female) has the least influence on ratings.

Actionable Insight: Gender-specific strategies may not significantly improve satisfaction. Focus instead on other features like stay duration or room type.

This code uses the Synthetic Data Vault (SDV) library to generate synthetic data based on a demo dataset. The GaussianCopulaSynthesizer learns the patterns in the real data and generates new, synthetic data that mirrors the original dataset's structure and statistical properties.

So, what’s your take—genius or lazy as f***? 

Drop your thoughts in the comments. Just don’t let an AI bot answer for you. 😉 

Goran Peremin

With over 10 years of experience in the field, Goran has worked on a wide range of projects, including performance marketing for eCommerce, SEO, design thinking, UX design, graphic design, social media management, and influencer marketing. It’s safe to say that Goran is a master of many skills, each more refined than the last.


Goran holds great responsibility in shaping and executing all kinds of digital marketing strategies.

 From the high peaks of content marketing to the deep caves of SEO, PPC, email, and social media, he navigates them all. He leads Bima’s digital presence, always vigilant in measuring and reporting on the performance of all digital marketing campaigns. Indeed, he ensures they achieve their goals, both ROI and KPI, lest he incur the wrath of the marketing gods.

Goran, a Growth Marketer with a keen eye for trends and insights, tirelessly optimizes his costs and performance based on such findings.

He is a true brainstorming wizard, effortlessly conjuring up new and creative growth strategies, boldly running experiments and conversion tests to uncover the most effective paths to digital marketing success.


LinkedInGitHub