Ensuring data quality in a complex data environment

Data quality is considered as the highest commandment in data management. And it’s with a strong purpose. Only data of high quality is useful data, and to be quality it must be consistent and unambiguous. All data that is gathered, stored and consumed during business processes directly impacts the success of the business.

But as the amount of data and the number of data sources are exponentially growing, ensuring data quality across the entire enterprise is getting extremely complex. Companies are facing considerable challenges when it comes to maintaining a good quality of data. For this purpose, we invited Harald Smith, formerly Director of Product Marketing at Syncsort, to discuss the current state of data quality, common pitfalls, connecting with emerging tech and data governance and best practices that can help companies in their efforts to ensure quality data in their enterprise.

Hyperight: Let’s start by introducing yourself and telling us a bit about your background and Syncsort’s area of expertise.

Harald: I’m Harald Smith, Director of Product Marketing at Syncsort, where I’m responsible for strategy around Syncsort’s market-leading enterprise data quality and integration software solutions. Syncsort’s products provide a simple way for organisations to optimize, assure, integrate and advance data, with a particular emphasis on helping enterprises prepare data (including from legacy systems) for next-generation platforms and analytics.

My background includes experience in product management and solution architecture with a focus on accelerating customer value in the integration, management, and use of information. I also have extensive expertise in product and project management, data quality products and solutions, application development including Agile methodology and UX design, technical services, and business processes.

Hyperight: The amount of data that is generated and gathered in companies is increasing at an unprecedented pace. From your perspective, how much are companies successful in maintaining a high level of data quality in their enterprise and are they investing enough in it?

Harald: From my perspective, by and large, they are not. Collecting data from diverse and high volume sets of data means it becomes more difficult to ensure the data is high quality and also increases the difficulty in ensuring that the data is “fit for purpose”. As enterprises move towards real-world machine learning and artificial intelligence use cases, we’ll see more mature conversations about data quality happening as companies get serious about its impact.

Poor quality of data can cost a business an average of $15 million a year, according to Gartner, but the impact is becoming an even bigger risk as machine learning or artificial intelligence multiplies the impact exponentially. As we’re seeing increasing shifts to next-gen platforms like cloud, we’re also seeing more data quality issues emerge.

A new survey indicates nearly 80 percent of AI and machine learning projects have stalled due to issues with data quality and proper labelling of data attributes. This survey also noted two-thirds of respondents cited “bias or errors in the data” as a common issue and half reported, “data not in a usable form.” And then we fall into the classic “garbage in, garbage” out the reality of poor data quality, wherein moving the data to these new platforms (like blockchain and cloud) and replicating it further simply propagates the issues further along the data pipeline in more and more instances.

Hyperight: Let’s discuss data quality and emerging tech – what are the pitfalls? Are there any special considerations for data quality and emerging tech adoption?

Harald: Data quality has rarely been black or white – there’s always a contextual aspect to data. But for emerging technologies, there are new considerations that come into play. Current data quality solutions are being stretched as data has to be evaluated, integrated, de-duplicated, cleansed and matched across a variety of data sources in preparation for more advanced use cases like predicting customer behaviour, analysing risk or detecting fraud which requires analysing huge volumes. At the same time, issues such as bias and provenance are emerging that make it harder to discern whether certain data sources are even usable or valuable.

Emerging use cases such as machine learning or blockchain have new or different requirements and needs. Consider how simple, obvious correlations existent in data might become the “insight” that machine learning finds. Or how the immutability of blockchain may propagate a data quality issue that would otherwise be updated and resolved.

To solve the challenge, companies must make an ongoing commitment, which begins with creating an overarching strategy that puts data governance and data quality best practices front and centre, including the compliance requirements and the different data quality measures needed for different use cases, purposes and outcomes, before developing projects. Also, consider the ethical and corporate implications of these business initiatives. What happens if the AI and machine learning models receive biased, dirty, or inaccurate data? Biased or polluted outcomes could be lethal for an enterprise’s brand reputation, revenue growth or compliance requirements.

Hyperight: How can ML tools help manage Data Quality?

Harald: Machine learning has an important role to play in data quality, particularly to address the limitations of manual review in an effective period of time. Given the high volumes and ever-increasing variety of data all enterprises are dealing with now, humans simply do not have the time to assess all data and monitor it continuously for variation. We might well notice when a sensor fails or a data quality metric fails to meet a threshold and an alert is triggered. But what about when false data is injected into a sensor, or unexpected duplicates are passed through? A computer program can scan vast volumes of data in mere minutes or hours and evaluate it according to your needs, whether finding missing, mismatched or duplicative entries or uncovering new and often unexpected correlations.

However, you can only take advantage of these tools when you understand the business problems and processes, as well as the data requirements and data available to help address those problems – not to mention put in place the people who understand the data and its context, and the processes to help these people govern, validate, vet, and leverage that data.

Hyperight: What are some best practices for achieving data quality?

Harald: First and foremost, you need to think continuously about the business problems you are trying to solve. This is a central, operating best practice that helps make the organisation data literate and ensures everyone has a clear understanding of how they need to approach the analysis and use of data to meet enterprise needs.

Using the right tools is a crucial part, but equally important is human understanding of the data so that crucial judgement can be applied. Knowing what data exists isn’t enough; context around how it’s used becomes key, because a set of data that works just fine for one function does not necessarily mean it can be used for a different one. Particularly in cases where data is in specific silos within an organisation (or outside it if bringing in other third-party data), you need enough knowledgeable people about how to use and take advantage of the tools at hand to investigate and analyse data, often in ways that may not be immediately obvious. This depends on establishing and valuing a data-literate culture including people, process, and tools. And these three elements do not exist in isolation.

To ensure people become data-literate, they must be included in processes and communications that help them become informed and help them share what they know and have learned. Building out a “library” of knowledge about data, processes and approaches for using the tools at hand for data analysis is central to developing a data-literate culture.

Hyperight: What’s the relationship between data governance and data quality?

Harald: There’s a symbiotic relationship between data governance and data quality. Many people think of data governance in terms of how data should be archived, ensuring its security and that it was compliant with any laws or regulations. Of course, these are important components of a data governance strategy and process, but there’s been a shift in data governance thinking to fully embrace it as the overarching aspect of a data strategy, where it incorporates how data is used to facilitate accessibility, drive revenue and even monetise data.

That strategy will only be successful with strong data quality measures in place. Where data governance defines the framework for what data is relevant and pertinent to the business needs, and why; data quality provides the practices that implement the rules of that framework and measures the compliance of data against that in order to help compile the best standards and monitor the quality of the data over time. That continuous iterative cycle is what helps generate trust in data – trust that it can be leveraged effectively for the initiatives the organisation has in mind to increase revenue, reduce risk, meet compliance, and reduce costs.

Find out what’s the state of enterprise data quality today in Syncsort’s extensive survey that covers challenges and opportunities for companies that are trying to bring data quality across the entire enterprise.

See survey

Cookie	Duration	Description
__cfduid	1 month	The cookie is used by cdn services like CloudFare to identify individual clients behind a shared IP address and apply security settings on a per-client basis. It does not correspond to any user ID in the web application and does not store any personally identifiable information.
cookielawinfo-checbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
bp_user-registered	13 years 8 months 8 days	This cookie is used to set which users can access the private pages of the website. It is a functional cookie.
bp_user-role	13 years 8 months 8 days	This is a functional cookie. It is used to set restriction to the user on acessing certain pages like back office, account page etc.
bp_ut_session	13 years 8 months 8 days	This is a functional cookie. This cookie is used to set restriction to the user on acessing certain pages like back office, account page etc.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the wbsite is doing. The data collected including the number visitors, the source where they have come from, and the pages viisted in an anonymous form.

Cookie	Duration	Description
IDE	1 year 24 days	Used by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.
test_cookie	15 minutes	This cookie is set by doubleclick.net. The purpose of the cookie is to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	This cookie is set by Youtube. Used to track the information of the embedded YouTube videos on a website.

Cookie	Duration	Description
_gat_gtag_UA_62786802_1	1 minute	No description
CONSENT	16 years 9 months 21 days 15 hours 5 minutes	No description
ihc_workflow_restrictions_0	1 month	No description
ihcMedia	1 hour	No description

Ensuring data quality in a complex data environment

Add comment

Cancel reply

Recap: Day 1 at Data Innovation Summit 2024

Decoding Data Modeling: A Pillar of Modern Data Stacks and AI Cost Efficiency – Interview with Serge Gershkovich, SqlDBM

Next-Generation AI: Deeper Experiments – Interview with Sina Nek Akhtar, Tech Lead, Data Analytics and ML at Google Cloud

Recent posts

Recap: Day 1 at Data Innovation Summit 2024

Decoding Data Modeling: A Pillar of Modern Data Stacks and AI Cost Efficiency – Interview with Serge Gershkovich, SqlDBM

Next-Generation AI: Deeper Experiments – Interview with Sina Nek Akhtar, Tech Lead, Data Analytics and ML at Google Cloud

Electrolux Continuing Journey to Data-driven Manufacturing Excellence – Interview with Klaas Dobbelaere, Electrolux

Navigating the Next Wave: Generative AI at Accenture – Interview with Mattias Aspelund & Julia Falk, Accenture

The Future of AI-Enabled Experiences – Interview with Dr. Ather Gattami, Leading Swedish AI Expert, AI Researcher at Bitynamics

AIAW Podcast E125 – Liza-Maria Norlin

AIAW Podcast E124 – All about #DBRX AI Model – Hagay Lupesko

Topics

Email Newsletter

Events

Hyperight

Ensuring data quality in a complex data environment

Add comment

You may also like

Recent posts

Topics

Email Newsletter

Events

Hyperight