Synthetic Data - What Is It and What You Need to Know About It

Synthetic data has become an emerging topic in the artificial intelligence (AI) world.

More than ever, organizations turn to advanced analytics and AI to optimize their operational process, enhance customer experience or innovate new products and services. But, to stay competitive in today’s landscape using these capabilities, organizations need bigger access to internal and external data – a resource that is hard to find, not always available, and sometimes tricky to use. Especially customer data. Given the increased emphasis on data protection and AI ethics in Europe, many organizations have started deescalating their AI innovation efforts. They wonder how to maintain competitiveness in the algorithmic economy when data is abundant, but can not be used to train models.

That is why most professionals agree that synthetic data has value to address these issues. To better understand the value of synthetic data, we’ll open some of the fundamental questions in this article: What is synthetic data? How is it generated? Who benefits the most? And are there any valuable examples from everyday life?

Glasses in front of computer screens with data and codes — Photo by Kevin Ku from Unsplash

What Is Synthetic Data and How Is It Generated?

Synthetic data is data that is artificially created by ML algorithms instead of generated by actual events. It can be used for a wide range of activities, such as testing data for new products and tools, or adding more complexity in AI training models.

Many sources identify different types of synthetic data for various purposes. One article by Statice explained the three common types:

Synthetic text
Synthetic images and videos
Tabular synthetic data

Synthetic data is typically created via a generative model from the original dataset that produces synthetic copies resembling accurate data. The main generative models for synthetic data are Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and Autoregressive models.

If it needs to be simplified, synthetic data can be generated when there is not some or many real data. If there is no actual data, but a broad understanding of data set distribution, a random sample of any distribution can be created. The synthetic data’s quality depends on the engineer’s grasp of a specific data environment. Where real data does not exist, synthetic data can be the right solution. When there is real data, synthetic data is generated by a best-fit distribution. A hybrid synthetic data generation approach is used when only some real data exists, where part of the dataset is generated from assumed distributions and other parts from actual data.

What Are the Advantages and Disadvantages of Synthetic Data?

Organizations have more data than ever at their disposal, but the critical challenge is how to use the insights from the datasets for impactful solutions.

Organizations use big data tools and advanced analytics applications to generate value from their massive datasets. Synthetic data has a significant role in the development and improvement of critical applications.

Laptop with open applications that are connected — Photo by Buffik from Pixabay

Some of the advantages of synthetic data include:

Cost – With synthetic data, it becomes cheap and fast to produce new data once the generative model is set up. Collecting synthetic data is more cost-effective and efficient than collecting real data. It is said that this is especially applicable to the autonomous vehicle space, where it is expensive and time-consuming to collect real data.

Privacy – Real data may have usage constraints due to privacy rules or other regulations. Synthetic data can replicate all-important statistical properties of real data without exposing real data, thereby eliminating the issue. With personal information being removed, the data can’t be traced back to the original owner, so copyright and privacy infringements can be avoided. This is critical in synthetic data machine learning applications where realistic user behaviors are being simulated, and private information must be protected.

Testing – Synthetic data can play an important role in system training. Synthetic data is used to examine existing system performances but to train new systems on scenarios that are not represented in the real data as well. Synthetic data has the immunity to some common statistical problems, like item nonresponse, skip patterns and other logical constraints.

Synthetic data has various benefits, but it also has limitations. Some of them can be:

Output accuracy – Synthetic data only mimics the real data, it is not a replica, so it does not copy the original content exactly. This means that synthetic data may not cover some outliers that original data has.

Quality of the synthetic data is closely connected with the data source and data generation model that created it. Biases in the source data can be reflected in the synthetic data.

Verification server is needed to ensure the system has been appropriately trained and is not generating outputs due to assumptions built into the synthetic data.

Synthetic data generation requires time and effort – Though easier to create than actual data, synthetic data can still be time consuming and costly, if done right.

User acceptance – Many organizations may have cultural resistance regarding the adoption of the concept of synthetic data, so it may not be accepted as valid by users who have not witnessed its benefits before.

Based on Gartner‘s estimations, it’s expected that by 2024, 60% of the data used for the development of AI and analytics projects will be synthetic data.

Mostly AI claims that synthetic data can retain 99% of the information and value of the original dataset while protecting sensitive data from re-identification.

Who benefits the most from synthetic data? Positive experience can be found at automotive and robotics, financial services, healthcare, manufacturing, national security, and social media. Business functions that can benefit from synthetic data include: marketing, ML, agile development and DevOps, HR and others.

Synthetic data and real data in AI models — Photo source: aimultiple.com

Everyday Examples of Synthetic Data Usage

The best way to understand the value of synthetic data is to showcase some everyday examples.

It is said that Denmark and the other Nordic countries have high-quality health data that has the potential to enable the healthcare sector to detect diseases early, improve diagnosis, and create treatments tailored to individuals.

Finding new treatment options by analysing large quantities of data can be a problem for the healthcare sector because of the inability to share data. The University of Copenhagen is developing a method that can use original data to generate synthetic datasets to address this problem. The Novo Nordisk Foundation supports the project “Synthetic Health and Research Data (SHARED)” that should create synthetic data to be shared without compromising data security.

Another example is the Norwegian Survey on living conditions/European Health Interview Survey (EHIS). Due to confidentiality policies at Statistics Norway and the sensitive nature of health data, a new method takes advantage of the rich register data to establish synthetic data in a model-free way.

Alexa by Amazon on a desk next to a computer — Photo by Piotr Cichosz from Unsplash

“Hey, Alexa, how do you work?”. Now we know, though. Alexa is built based on natural language processing (NLP) that converts speech into words, sounds, and ideas. Amazon is using synthetic data to train Alexa’s language system. While talking about Amazon’s investment in generating synthetic data for its products and services, it is worth mentioning that Amazon Go also uses synthetic data to train cashierless store algorithms.

Similar examples are Google’s Waymo which uses synthetic data to train its autonomous vehicles, and the Toyota Research Institute for their dynamic scene understanding, specifically for autonomous driving.

One good example for the security of financial transactions is American Express, which uses synthetic financial data to improve fraud detection. The San Francisco-based company States Title uses synthetic data for faster and safer real estate transactions.

The number of user cases that benefit from the synthetic data for their developments is constantly increasing.

Conclusion

The creation and usage of synthetic data will only grow as our data becomes more complex and more closely guarded.

It is legitimate to ask, who else may benefit from synthetic data, which industries and companies, and how individuals can use synthetic data in their personal lives?

Despite its limitations, the benefits are functional and apply in many cases.

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bp_user-registered	13 years 8 months 8 days	This cookie is used to set which users can access the private pages of the website. It is a functional cookie.
bp_user-role	13 years 8 months 8 days	This is a functional cookie. It is used to set restriction to the user on acessing certain pages like back office, account page etc.
bp_ut_session	13 years 8 months 8 days	This is a functional cookie. This cookie is used to set restriction to the user on acessing certain pages like back office, account page etc.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Duration	Description
_gat_gtag_UA_62786802_1	1 minute	No description
CONSENT	16 years 9 months 21 days 15 hours 5 minutes	No description
ihc_workflow_restrictions_0	1 month	No description
ihcMedia	1 hour	No description

Synthetic Data – What Is It and What You Need to Know About It

What Is Synthetic Data and How Is It Generated?

What Are the Advantages and Disadvantages of Synthetic Data?

Everyday Examples of Synthetic Data Usage

Conclusion

Add comment

Cancel reply

Recap: Day 1 at Data Innovation Summit 2024

Decoding Data Modeling: A Pillar of Modern Data Stacks and AI Cost Efficiency – Interview with Serge Gershkovich, SqlDBM

Next-Generation AI: Deeper Experiments – Interview with Sina Nek Akhtar, Tech Lead, Data Analytics and ML at Google Cloud

Recent posts

Recap: Day 1 at Data Innovation Summit 2024

Decoding Data Modeling: A Pillar of Modern Data Stacks and AI Cost Efficiency – Interview with Serge Gershkovich, SqlDBM

Next-Generation AI: Deeper Experiments – Interview with Sina Nek Akhtar, Tech Lead, Data Analytics and ML at Google Cloud

Electrolux Continuing Journey to Data-driven Manufacturing Excellence – Interview with Klaas Dobbelaere, Electrolux

Navigating the Next Wave: Generative AI at Accenture – Interview with Mattias Aspelund & Julia Falk, Accenture

The Future of AI-Enabled Experiences – Interview with Dr. Ather Gattami, Leading Swedish AI Expert, AI Researcher at Bitynamics

AIAW Podcast E125 – Liza-Maria Norlin

AIAW Podcast E124 – All about #DBRX AI Model – Hagay Lupesko

Topics

Email Newsletter

Events

Hyperight

Synthetic Data – What Is It and What You Need to Know About It

What Is Synthetic Data and How Is It Generated?

What Are the Advantages and Disadvantages of Synthetic Data?

Everyday Examples of Synthetic Data Usage

Conclusion

Add comment

You may also like

Recent posts

Topics

Email Newsletter

Events

Hyperight