B2B Innovation: LLMs and RAG Pipelines with Language Expansion – Interview with Hwalsuk Lee, Upstage AI

In this interview, we speak with Hwalsuk, a speaker at NDSML Summit 2024, who represents Upstage! Upstage’s Solar LLM offers GPT-4-level performance with high efficiency and versatility.

September 12, 2024

B2B Innovation: LLMs and RAG Pipelines with Language Expansion

In this interview, we speak with Hwalsuk Lee from Upstage! Hwalsuk, a CTO and speaker at this year’s upcoming NDSML Summit, represents Upstage, which launched in October 2020. Upstage’s Solar LLM offers GPT-4-level performance with high efficiency and versatility, available via on-premises and API integration.

The company’s Document AI solution uses OCR technology to automate workflows and handle unstructured data, reducing costs and streamlining operations. At the summit, Lee will discuss LLM training trends, language expansion, and the impact of Retrieval-Augmented Generation (RAG) on LLM performance. He will also address challenges in RAG implementation and the role of high-quality data in model effectiveness.

Hyperight: Can you tell us more about yourself and your organization? What are your professional background and current working focus?

Hwalsuk Lee, speaker at NDSML Summit 2024

Founded in October 2020, Upstage boosts work efficiency with industry-leading document processing engines and large language models (LLMs). Our pre-trained LLM Solar delivers GPT-4-level performance with unparalleled speed and cost-efficiency. Available via on-premises and API integration through platforms like Amazon SageMaker JumpStart, Solar offers versatility and accessibility. It serves as an alternative to larger, more resource-intensive models developed by tech giants.

Our Document AI solution leverages AI-powered optical character recognition (OCR) technology to automate workflows and process unstructured data. This reduces operational costs and streamlines operations for our clients.

Solar, our flagship product, is a versatile and agile language model. It can be easily customized and fine-tuned to perform a wide range of language tasks in Korean, Japanese, English, and Thai. Our global customer base spans diverse industries, including education, healthcare, legal, finance, and telecommunications. We collaborate with these clients to build tailored LLMs, chatbots, and knowledge bases, leveraging Solar’s adaptability to meet specific industry needs. As our leading offering, Solar is spearheading our expansion into international markets. We’re currently focusing our initial efforts on entering the United States and Japan.

My professional journey has been deeply rooted in AI, with a focus on computer vision technologies. Prior to my role at Upstage, I led the Visual AI team at Naver Clova. During this period, we worked extensively on Optical Character Recognition (OCR) technologies. Document AI necessitates not only OCR but also natural language processing (NLP) technology to comprehend the meaning behind the recognized characters. Upstage has been at the forefront of building this technology, naturally evolving into the development of LLMs. This led to the creation of our proprietary Solar LLM, designed to meet the growing demands of LLMs.

Hyperight: During the NDSML Summit 2024, you will share more on B2B innovation and full-stack LLMs and RAG pipelines. What can the delegates at the event expect from your presentation?

During my presentation, delegates can look forward to a comprehensive overview of several exciting topics. Firstly, I will delve into the current state of LLM training. This will include the latest trends and methodologies for pre-training, upscaling, and fine-tuning processes as we see them in 2024.

Secondly, we’ll explore the significant advancements in language expansion. I’ll share insights into how recent innovations are enabling LLMs to support a much broader range of languages. These innovations are making them more versatile and inclusive for diverse applications.

Lastly, I’ll discuss the role of RAG, or Retrieval-Augmented Generation. We’ll look at how RAG is boosting LLM capabilities through improved OCR, layout analysis, and embedding techniques. I’ll also highlight some real-world applications of these advancements to demonstrate their practical impact.

Hyperight: Can you explain what full-stack LLMs are? How do they differ from traditional language models, especially in providing B2B solutions?

Full-stack LLMs are comprehensive technology platforms designed to enhance and mitigate the limitations of traditional language models. Particularly in a B2B context. The key issue in traditional LLMs is “hallucination,” where the model generates plausible but inaccurate information.

To address this, full-stack LLMs incorporate a set of modules that work together to improve accuracy and reliability. One approach is fine-tuning the LLM engine using the customer’s data, which helps the model better understand and respond to domain-specific queries. Another method is to augment the LLM with external information retrieval, known as Retrieval-Augmented Generation (RAG). This technique searches for relevant data to support the model’s response, reducing the likelihood of hallucination.

By integrating these modules, full-stack LLMs provide true B2B solutions with more accurate, reliable, and contextually relevant responses. This enhances the usability and value of language models in business applications.

Hyperight: How does Retrieval-Augmented Generation (RAG) enhance the performance and accuracy of LLMs, particularly in a B2B context?

RAG enhances LLM performance by analyzing user queries to identify key information. It then searches databases or the internet to retrieve relevant data, ensuring accurate responses and eliminating errors, assuming perfect search accuracy.

For example, if asked about today’s weather, an untrained LLM might generate an incorrect answer. However, with RAG, the model retrieves the latest weather data and provides an accurate response.

In a B2B context, RAG is particularly valuable. It allows models that haven’t been trained on your company’s data to retrieve specific internal information, resulting in more accurate and reliable answers. This is especially crucial for complex business problems that are closely tied to internal data.

Therefore, if you can’t train an LLM with your company’s data, using RAG is highly recommended. It ensures accurate, reliable, and data-driven responses.

Hyperight: As someone with experience in leading AI vision, what are some of the key challenges you’ve encountered when implementing RAG pipelines? How have you addressed them?

The primary challenge in building RAG pipelines is that customer data often isn’t ready for LLM integration. The data must be digitized and stored in a vector database, which is tough with unstructured data.

Many customers provide data as document images or text PDFs, where structural information is hard to be extracted. To resolve this, we leverage Upstage’s Document AI, using OCR to extract text and document parsing to retrieve structural information, converting it into HTML format.

Finally, we use an embedding model to store data in a vector database. We fine-tune our Solar LLM for high-performance embeddings in multiple languages, ensuring we meet the needs of our diverse client base.

Hyperight: How do you see the combination of full-stack LLMs and RAG influencing the future of enterprise applications? Particularly in terms of expanding language capabilities?

Customers who initially adopted LLMs faced challenges with hallucination. Later adopters of RAG struggled with linking data to LLMs. These challenges become more pronounced when the LLM’s performance in the customer’s primary language is subpar. The LLM remains a core component of any full-stack LLM solution.

As the field evolves, interest will focus on language performance within the customer’s specific domain. There will also be a focus on the language’s efficacy in addressing the problem they aim to solve. This shift highlights the need to tailor full-stack LLM and RAG solutions to meet unique language requirements. By doing so, we can expand language capabilities and overall utility for enterprise applications.

Hyperight: What role does data quality play in the effectiveness of RAG pipelines? How do you ensure high-quality data is used in training and deploying these models?

Hwalsuk Lee: Data quality is paramount to the effectiveness of RAG pipelines. The integrity of the input data significantly influences the performance of the models. For instance, when delivering table data to the LLM, providing structured information rather than an unorganized string of characters can lead to substantial improvements in performance.

Furthermore, the impact of data quality extends to the training phase of LLM models. Whether the data’s structure is incorporated into the training data or not can make a considerable difference. Therefore, it’s crucial to ensure the essential information in the data source is preserved when connecting it to the LLM.

To guarantee high-quality data in our training and deployment processes, we follow a rigorous data management protocol. This includes:

Source verification. We only use reliable and accurate data sources.
Data cleaning. We implement automated and manual data cleaning processes to eliminate errors, inconsistencies, and outliers.
Data validation. We validate the data against predefined schemas and rules to ensure its correctness and completeness.
Data lineage tracking. We maintain a clear record of data origins, transformations, and usage to ensure transparency and reproducibility.

Hyperight: What are some AI trends you expect to see in the upcoming 12 months?

In the B2B landscape, each company will strive to identify use cases for LLMs that create tangible business value over the next year. While some companies may discover these use cases independently, those lacking expertise will likely seek consulting services.

Once appropriate use cases are identified, companies will refine LLM-related technologies to address these needs more precisely. This will lead to the development of various technologies and products under the umbrella of vertical AI. For instance, in the financial sector, if LLMs can solve a particular high-value problem, we can expect to see the emergence of financial vertical AI solutions designed to address this use case comprehensively and without the need for customer intervention.

To achieve this, additional technology modules might need to be developed for specific domains or use cases, alongside general-purpose LLMs. Currently, general-purpose LLM model providers are prominent, but in the future, firms specializing in vertical AI are likely to achieve significant sales.

Catch Lee’s presentation, “B2B Innovation: LLMs and RAG Pipelines with Language Expansion,” at the NDSML Summit this October. Don’t miss this opportunity to explore the capabilities of full-stack LLMs and how they provide true B2B solutions!