Search Engine Archives | Oxide AI

Blog: AI and the New Dawn of Niche Search Engines

katia@oxide.ai — Fri, 28 Jul 2023 08:49:19 +0000

Blog: AI and the New Dawn of Niche Search Engines

July 28, 2023
Generative AI, Search Engine, Search intent, Search query parsing

Explore the emerging role of AI and LLMs in developing niche search engines that can challenge industry giants

Share This post

Imagine a world where AI tech is unlocking a whole new era in search capabilities! This is what emerging AI capabilities along with Large Language Models (LLMs) like ChatGPT could do – and they’re beginning to do it in a big way. They’re not just sprucing up chatbot entertainment and gaming, they’re poised to revolutionize search information-seeking and enhance it with rich linguistic interfaces.

Now, this is important: for nearly two decades, individual entrepreneurs and businesses have often felt reluctant to invest in specialized or topic-oriented search options that go beyond the scope of their own limited website, because of the towering dominance of giants such as Google, Bing, Yahoo, and a few others. Other search giants are more narrowly focused, holding vast amounts of data and content in their proprietary systems, with lots of resources that are of public interest, but only related to their specific services. Mostly, these are well-established, deep-pocketed, specialized providers like LinkedIn, eBay, Amazon, Netflix, Ancestry, large publishers, among others. This still reflects a relatively small group of companies, and rarely meets the needs of specialized audiences. But times are changing.

Thanks to the leaps and bounds AI has made recently, even a simple setup can stand shoulder-to-shoulder with the biggest of search engine juggernauts, provided that it’s offering a specialized vertical search. Success stories in the realm of vertical search are limited and have mainly been attributed to entities with significant resources and/or subject areas that are poorly covered by the giants.

What is “vertical” search? It is a search service that is highly relevant to a particular subject domain or user population. The vertical search service will often have specialized data and content of its own, along with the ability to go broad and deep with more widely available public information in its specific subject area of focus. In addition, it will often have search tools that make seeking, synthesizing and using information easier for the particular needs of its searchers. Most importantly, these capabilities and respect for its users foster a high level of trust and respect among the users, leading to a feeling of community within that specialized vertical niche.

In the expansive natural world, specialist species/populations carve out thriving niches that sustain them over extended periods of time. This analogy extends to the digital domain of search engines, too. Even amidst the vast digital landscape dominated by billion-dollar companies, there are numerous untapped niches waiting to be explored.

By zeroing in on a specific niche, you can not only survive but also excel over the long term, delivering outstanding quality and relevance within your chosen sphere.

In this blog post, we’re going to delve into this exciting opportunity, offering practical advice on how you can take part in this paradigm shift and help revolutionize search forever. So, ready to dive in? Let’s get to it!

WHY IS VERTICAL SEARCH SO CHALLENGING?

Implementing vertical search proves to be far more difficult than simply creating a web interface for a database, as numerous examples found online can attest to. One common example is a specialized search failing to return results due to misunderstanding typos, unrecognized keywords or not being able to understand synonyms. In the medical domain, not recognizing the patient’s and caregivers’ lived experience creates confusion and erodes trust in a specialized resource.

Providing exceptional search functionality requires more than just keyword recognition; understanding and responding to the user’s intent is crucial. For years, this has posed a significant challenge within the field of Natural Language Processing (NLP). It’s of paramount importance because without understanding what a user wants, delivering a satisfactory relevant service is virtually impossible. Understanding user intent is important, but currently it can be far beyond what a small search provider could afford.

However, intent recognition is not the only challenge. Anyone managing a specialized database needs to convert natural language queries into a data-compatible format to retrieve content. This was once a monumental challenge in the realm of search query parsing, AI and NLP, but the advent of current-generation Language Models is dramatically simplifying the task.

Blending internal and external information sources is critical to providing a sufficiently valuable vertical search experience. Interpreting, inter-relating and indexing information streams and knowledge models within a complex vertical domain is a task well suited to emerging AI capabilities.

Delivering relevant outputs is another significant challenge when implementing vertical search functionality. This is particularly challenging over extended search sessions with multiple queries. Modern AI models and tools show promise in making this task considerably more manageable, enabling greater automation.

Managing feedback, community support, and tuning (or course-correcting) requires significant human resources. Language Models can partially help automate some tasks, or at least make them more manageable.

The complexity of managing search functionalities extends well beyond what’s covered here, but you get the gist.

THE VALUE IN YOUR SPECIALIZED DATA

Possessing unique data/content stored in a database today inherently holds value, especially when it’s curated by subject specialists with domain knowledge and experience. This value has significantly increased over the past year, particularly if your data is hosted in a database or provided through a proprietary web search engine, as it isn’t accessible to traditional web crawlers.

The more specific and exclusive data you have that isn’t readily available on other web pages, the more beneficial it is for you and your users! It’s advised to keep it this way. There’s no need to generate static web pages anymore since that could potentially depreciate the value by becoming part of crawl-data and subsequently ending up in LLM-training, from which you can lose most of the benefit. Also, there is a risk of inappropriate or hallucinating uses of your data, which can cause it to be perceived as wrong or harmful, even if the underlying data itself is sound.

Another vital facet to pay attention to is the directly extractable value from your data. Computational AI techniques can introduce useful synthesis with high-value external public content. LLMs can be enhanced through a process known as “fine-tuning.” This gives you the leeway to increase relevance and value to your users. In this way, areas of your search that are supported by an LLM can be more usefully relevant than generic LLMs, which dilute the user experience because of their broad focus on general content.

Suffice it to say, there is a growing potential to extract value from your data and content, via vertical search. The knowledge and a range of AI tools are here.

In conclusion, stake your claim in the search space, a niche unreachable by the giants, where your specialization allows you to dominate. It’s beneficial to implement legal language in your terms of service stating that any utilization of your data for training AI models necessitates written permission. While this provision may not yet have been tested in court, proactively safeguarding your data is a wise course of action.

More To Explore

Blogpost

AI and the Progress Toward the Minimal and Relevant

May 2, 2025

Blogpost

Blog: LLMs and Their Environmental Footprint

December 8, 2024

The post Blog: AI and the New Dawn of Niche Search Engines appeared first on Oxide AI.

Blog: Are Search Engines Doomed to be Replaced by ChatGPT?

katia@oxide.ai — Thu, 15 Dec 2022 17:07:00 +0000

Blog: Are Search Engines Doomed to be Replaced by ChatGPT?

December 15, 2022
Generative AI, LLM, Search Engine

The latest chat bot, ChatGPT from OpenAI, is hailed as a know-it-all candidate that could soon change the way we search.

Share This post

NATURAL LANGUAGE AND LARGE LANGUAGE MODELS (LLMS)

Language is a multifaceted phenomenon that can be described on neurological, biological, psychological, social, anthropological and even historical levels. It is the heart of what it is to be human, arguably more so than our emotions or art or music (those being culture-specific and not universal, unlike speaking), or non-linguistic cognitive facilities (which we share, to varying degrees, with other animals). Even people who cannot speak are submerged in a world which humans have created over hundreds of millennia through the use of words. Somehow, it feels as if trying to reduce this to a single “paradigm” or “theoretical framework” is to miss the point, like talking about a painting solely in terms of the chemical composition of the paint.

One easily gets the impression that the entire domains of natural language processing (NLP) and natural language understanding (NLU) have been reduced to transformer-based chat bot programs. Without a doubt, the latest LLMs—Large Language Models—are fun to play with, especially when going beyond simple conversations and into the territory of more creative activities. This category of models is able to “create” simple art, prose, music and even animations/videos. The act of giving the model the right prompt is already an art form in and of itself.

We can also use this type of technology to write code. The quality of the output in this case is easier to assess, since machine-written code can be validated by running it and seeing whether it produces the desired results.

Having a productive coding session with OpenAI’s ChatGPT makes a lot of sense on the surface. As long as we can describe what we want with any precision, the bot will try to compose complete functions and programs; even if these contain bugs or are written in an idiosyncratic style, it is in many cases faster to start from a 90% working program and fix it manually than to write it from scratch. In the future, automatic coding will likely also take run-time error messages and compiler messages into account, so a correct version of the final program can be generated in a single or just a few rounds.

ChatGPT clearly solves many mundane, day-to-day coding tasks. It works exceedingly well where a very specific solution to an isolated problem is desired, such as creating a function for calculating some value. Asking for boilerplate code for talking to APIs also works extremely well, since there are so many examples it can “learn” how to perform this from. This way, ChatGPT-powered coding will likely save us a lot of time.

Once we look at more complex coding tasks, we can see that there still is plenty of room for improvement. Furthermore, we also have to remember the old software engineering joke: “so many programmers, and so few people to tell them what to do.” So, one could question whether ChatGPT is a truly profound, world-changing development; all it might lead to is a replacement of some low-level coders with AI. At the very least, it seems like creative non-coders might be able to utilize the technology to achieve things that previously lay beyond their grasp.

SEARCH AND LLM’S

Search engines today rely to a great extent on traditional NLP, including parsing, chunking and other techniques. The largest search engines may even make use of LLMs to support search in various ways (e.g., query expansion, query understanding). But at the present, LLMs are still confined to a supporting role within the domain of large-scale search.

This is because search is a slightly different problem than language generation. It is the process of locating information that already exists. Generative LLMs, on the other hand, build new things from old pieces of data. Even if we have an AI capable of generating responses to every possible query (which is easy to set up currently), it still does not mirror the actual, underlying reality, because reality is in many cases characterized by a lack of data and confusing noise. Granting an AI model enough introspection to know when it should not generate a certain output is a super-complex and unresolved problem. If search is taken as a truth-finding process, we can clearly see why trust in an LLM might be misplaced. For one, by “transforming” the underlying data in various ways, LLMs introduce extra interpretation layers between data and output that might not be what the user wants. LLMs are also notorious for synthesizing (often called “hallucinations”); worse yet, they may output the results of their hallucinating in an authoritative writing style that implies subject expertise, when such is not actually warranted.

In subjects such as health, this is dangerous and irresponsible. They are generating “answers,” not finding reliable answers. NLP can be useful for interpreting human statements and feelings, as one step in a more responsive—and responsible—process of seeking information. Other AI and search approaches then also need to be involved, alongside human interpretation. Context outside the scope of words matter profoundly for analysis and interpretation.

GOING ALL-IN WITH AN LLM AS A SEARCH ENGINE

An LLM-powered search engine would be an interesting exercise. Unfortunately, several problems need to be resolved before this is realistic. In addition to the issues raised above, we can also mention the sheer cost of running such a service. One can imagine that an LLM engine would have to rely heavily on advertising to make revenue. We all know from the past that this road has only led to search monopolies and stagnation over the last 20 years or so (Read more on the topic in Next Search Possible).
Many areas of the internet feature extremely few data points. If you try to build a catalog of all retail products on sale in online marketplaces, you will soon come to the realization that only a few percent (out of billions) come with any significant amount of useful data attached to them. For the rest, you will have to fill out gaps by means of guesswork. An LLM could make this arduous task somewhat simpler, but can we hope that it is accurate? And can we evaluate that compared with a human annotator or a simpler, rules-based algorithm for filling in missing data fields. When a buyer comes prepared to part with serious money to get hold of a rare item, it is questionable to what degree they would trust an LLM over their own or another human’s judgment.

Information reality is also about controlled sparseness, especially on the internet. Much information is fully locked down—by design—behind paywalls because the data is a financial asset to someone. Limiting information access to earn money by selling it is an ancient business model. Thus, an infinite number of cat images float around out there for an AI to learn how to synthesize felines from; but in contrast, try to construct a bot that will accurately answer knowledge-based questions and provide reliable services. If such a thing existed, many service businesses could likely collapse; hence, this information is tightly controlled and usually not available, or at least not for free. In this case, a traditional (non-LLM) search engine helps us find what is findable, ignores the rest since it cannot be reached, and it can do this without making things up or requiring huge data volumes.

As described above, areas where traditional search engines excel over theoretical search LLMs are rooted in trustworthiness and authority. LLMs do not, as a rule, care about truth—they generate their answers based on statistical relations mined from vast amounts of data fed to them in the training stage. For example, imagine you want to try going on a diet. How do you choose the most suitable one? The noise level in this space if enormous, as everyone knows who have tried searching anything diet-related. It’s a random game. We have biased input coming from all directions. The underlying science is weak. Commercial interests distort the available information. So, there are contradictions, ambiguity and bias everywhere. Is this really the kind of uninterpreted data we want to synthesize an answer from? Equating search with a generative statistical technology that understands absolutely nothing could potentially prove to be a big step backwards.

PROS AND CONS FOR LLM’S AS SEARCH ENGINES

Problems with LLMs as Search Engines

LLMs truly extends the concept of “one model to rule them all”. We’ve seen this over and over in the world of search and hopefully it will be replaced in the future by a more collaborative and open approach to search.
LLMs currently require massive training and are very far from actually managing near-real-time data acquisition, which is what we expect from search engines.
LLMs synthesize (generate) data. This synthesis, sometimes jokingly referred to as “hallucination,” may not always reflect reality. When searching, we look for reality, preferably from the source and not an interpretation/transformation.
LLMs are hard to validate. It may even be difficult to produce the same results based on a given input, something we require from most scientific research work. This is the reproducibility problem.
To block LLMs from producing results where they should not provide a result seems to be an unresolved problem. High frequency queries are currently managed by means of manual curation, but not so for queries far out on the long tail. The tail is very, very long in search.
LLMs can explain their outputs in the same way as they generate answers. But this is the same shaky statistical ground they stand on, far from understanding which normally leads to explanations. There is no simple way we can understand the underlying parameter space, either.
They are very energy-consuming. Training the largest LLMs takes massive amounts of energy. Having a development where everyone needs to train their own models is not sustainable, so a general discussion about access will arise. The “one model to rule them all” is far from sufficient for humanity.

Benefits with LLMs as Search Engines

Transformers and similar Deep Learning tech have many beneficial advantages in the context of search. The current state is primarily supporting roles in search, rather than actually replacing search engines. A few examples:
- Query auto-completion
- Query expansion
- Query understanding (for example to extract query intent)
- Transformation of input queries to increase recall & precision
- Summarization of text to extract useful facts and explanations
- Language translation
Guided search based on bot interaction, where a dialog with a user may lead to better formulated queries or insights. Using LLMs as an interface to search engines seems to be a promising approach, but of course at their current level tripped-up by the problems highlighted above.
Exploration and discovery functions, making use of creative synthesis.
“Entertainment Search” is a possible area for LLM’s today, that’s why we see so many fun chat bots emerging.

More To Explore

Blogpost

AI and the Progress Toward the Minimal and Relevant

May 2, 2025

Blogpost

Blog: LLMs and Their Environmental Footprint

December 8, 2024

The post Blog: Are Search Engines Doomed to be Replaced by ChatGPT? appeared first on Oxide AI.

Blog: Next Search Possible

katia@oxide.ai — Mon, 19 Sep 2022 15:37:00 +0000

Blog: Next Search Possible

September 19, 2022
Search Engine

The great search stagnation? Search needs a total rethink.

Share This post

THE VAST DATA EXPANSION

I think we can all agree that digital is on an exponential growth trajectory and that digital convergence is accelerating. Old-world physical products are drawn into the expanding black hole we call the Internet, where they can be copied, stored and distributed instantly and with almost zero cost. Many familiar physical objects have been converted to digital: archival cabinets, books, vinyl and CD records, video tapes, cameras, audio mixers, and musical instruments are just a few examples. On top of that, our social interactions, behaviors and transactions are increasingly moving online. All this results in a massive increase in data volume.

It’s not easy to intuitively grasp what this expansion of information entails. Our biological brains are severely limited when exposed to too much information, and quickly become numb. It often results in decision paralysis. Furthermore, we process the world slowly and sequentially, putting us at a disadvantage when dealing with the ever-changing flows of data generated by complex systems. We need new tools to offset these problems and empower us to better extract knowledge from data.

SEARCH IS ONE OF HUMANITY’S MOST IMPORTANT TOOLS

Modern search engines can perhaps best be described as tools for locating desired information. Arguably, they are among the most important and valuable recent inventions of humanity. They amplify us in a way that would’ve required practically infinite time, many millions of librarians and libraries the size of small cities in the pre-digital age.

Just like a telescope extends our vision by allowing us to see far-off objects like stars, search engines essentially change our perceptual capabilities by supporting us in the selection and acquisition of information. While this is helpful, it is important to be mindful of the fact that search engines are biased; they include some things and exclude others, and they typically rank results in ways that are influenced by commercial interests, popularity and other factors. Another disadvantage is that the most widely used search engines need to cater to the average user, hence the feeling that even as the pool of information expands rapidly, we’re still stuck in the same shallow end as always.

One could be forgiven for thinking that the Internet is small enough to be “understandable” given the way search engines limit and distort our perception of the available information space. When we look for health, we mostly end up with the same few well-known information sources. In some cases, this is desirable, since it gives more weight to reliable sources and lessens the feeling of information overload when researching a topic. But it does not promote novelty and discovery of previously unseen relations in data.

A more serious problem that occurs particularly in large search engines is that they are exploitable in various ways. The SEO industry, for example, skews the ranking of search results so that some items are unfairly favored (given a high rank) by manipulating the content that is fed into the search engine.

In addition, advertising as a primary revenue source has led search engine innovations astray by supplying steady revenue streams that reduce the developer’s desire to change for user benefit. Practically, this led to stagnation in search interaction models and in many cases hidden bias. When we search for anything more than a single targeted answer, we have to repeatedly modify what we search for. This is re-search. For advertisers, this model is what you want (many more views) and it takes a lot to change that. Manually validating resources and re-searching is, however, very time-consuming for users.

THE EVOLUTION OF SEARCH

https://oxide.ai/wp-content/uploads/2024/01/Blog_Animated-2-2.mp4

Ever since the invention of writing, we have had to deal with the problem of organizing records in useful ways. For most of history, we relied mostly on indices, knowledgeable librarians and various tagging systems to accomplish this. We can call such methods of retrieval and storage “content-oriented.” Their biggest advantage, aside from ease of implementation, is that they give “users” a high degree of transparency, and that they make intuitive sense to everyone.

With the advent of the computer, we were able to store data in databases, often local to a single PC. (Older people will still remember dBASE and similar technologies). Although much smaller, such databases provided a similar user experience to modern-day large-scale search engines via query functions and report generation. While not as encompassing, specialized databases allow for accurate, focused search.

There are actually many very good specialized databases and similar data sources on the Internet, but finding them is a different story. Most large search engines do not list or rank other search engines on the front page. While one may wonder why, this is hopefully not for malicious reasons but instead due to the fact that specialized search engines are not indexed by “generalist” search engines. There’s no content to add since a search engine is just a tool and not “information-rich.” Note that companies such as Amazon, eBay and Airbnb (to name a few) use static page rendering in their specialized databases to ensure that they will be indexed by larger search engines. They are exceptions, however; most specialized databases are not, although it is difficult to quantify their proportion since they are hard to find.

IS AI DESTROYING THE VALUE OF OLD-WORLD SEARCH?

Fast-forward to 2022 and the feeling of results saturation is greater than 5 or 10 years ago. To what extent is this impression driven by the increasing use of Machine Learning in search engine technology?

Machine Learning is more or less a cooler rebranding of statistics; and statistics, in turn, can be defined as the art of destroying data in order to make it more digestible for a human. A heavy reliance on statistics tends to privilege “average” results, which goes against our belief that the point of a search engine is to help users find information in the long tail. Search companies might yield slightly different results as they optimize their models in their own ways, but since the search market is cornered by very few large actors, this does not meaningfully increase diversity.

Average, over-saturated results lead to value loss. Nothing is to be found except what people with similar interests already know.

AI and ML will influence search significantly over the coming years. It is already driving the change from matching to mapping (vector-based) in different shapes and forms. So, we may see new ventures in AI-powered search, but we can be pretty confident that it will drive saturation if it is provided by only a few large organizations.

NEXT SEARCH IS COMING FAST

The future of search will hopefully bring many innovations. We can only pray this happens soon, because we are in desperate need of tools to amplify human cognition these days.

Classic 2D “paper style” interactions (e.g., with a ranking page) will likely lose their value as we move to rendering technologies such as Virtual Reality or Augmented Reality. VR comes with vast opportunities for newer and better ways of visualizing information; this will perhaps kill the keyword and link lists of today once and for all. It is likely that the growing prevalence in VR will seriously disrupt the search market as new actors innovate in this space over the next years.

Search needs a total rethink. We can only speculate what the trends and paradigm shifts that influence the search of tomorrow will be, but some of them are more likely:

Demand for significant increase in value from search by means of accuracy, exploration and discovery, interaction with content, transparency and explanations
Content will be the next focus, not just finding the resources and providing a link but actually extracting meaning
AI and ML will help us see that top-ten lists are yesterday and there is actually a manifold of search output types (like diagnostics, a decision, evidential support for something)
Virtual Reality & AR push us beyond keywords and simple ranked lists as we are freed from “paper” screens
Micro-transactions (cryptocurrency) will hopefully significantly reduce exploitation of human attention with advertising

Let the next 20-30 years of Internet become other than the “great search stagnation” we are experiencing right now. If you can code, then work on this problem.

More To Explore

Blogpost

AI and the Progress Toward the Minimal and Relevant

May 2, 2025

Blogpost

Blog: LLMs and Their Environmental Footprint

December 8, 2024

The post Blog: Next Search Possible appeared first on Oxide AI.