Search is easy, right? It can be, if you know exactly what you’re looking for, or if your data set is small, clean and static enough to index often. That’s not the reality for organizations competing in a global marketplace that runs on data insights. Data comes at you fast and furious, and it does so in a variety of different languages and formats. Data is often dirty, with human error contributing to frequent misspellings, abbreviations and transposition of numbers. Data is also dynamic and high volume, rendering pre-indexing data to ease downstream searches impractical.
Fuzzy search is the key to finding answers in a sea of dirty data and a critical part of most analytics exercises. It involves locating strings that approximately—not exactly—match a pattern, allowing you to match misspelled or abbreviated words. Given the number of variables fuzzy search algorithms must cover, it is no surprise that fuzzy search is complex and time-consuming, especially with the size of data workloads today.
Google makes this look pretty easy on the surface, but the reality is that Google goes to great lengths—and costs—to index and prepare data to deliver answers to users. Unfortunately, Google-scale hardware, power and staff resources are not something most businesses can afford to enable real-time data search and analytics.
Fuzzy Search: Hamming Distance and Edit Distance
Before we look at how to do fuzzy search faster and easier, it is valuable to understand the concepts of “hamming distance” and “edit distance” and their effects on fuzzy search latency and complexity.
Hamming distance is an easy to parallelize algorithm that compares two strings and counts the number of different byte positions in each string. This means that the strings “Michele” and “Mishele” have a hamming distance of one, since the 3rd byte in the first representation is “c” while the third byte in the second representation is “s.” Hamming distance is popular because a good implementation will run fast, even against unindexed data, especially if you use hardware-based parallel processing techniques. It is well suited to detecting slight differences between renditions of individual words, such as enabling a rapid real-time spell check.
Edit distance is a different means of describing how different two strings are from one another by counting character insertions, deletions, replacements and transpositions. Edit distance enables approximate string matching, phrase matching, natural language processing and automatic spelling correction. It is also a good fit for bioinformatics to quantify the similarity of DNA sequences.
For example, searching for “Michelle” with a given edit distance of two allows for results such as “Michele” (an edit distance of one, since a single “l” was deleted) and “Mischele” (a distance of two since a single “s” was added and an “l” was deleted). Unfortunately, you pay a price for edit distance. As edit distance increases, so does the complexity and latency of the fuzzy search using traditional software tools. Each search, with a higher edit distance, is more complex and requires more data preparation and analysis time.
Why Fuzzy Search Is Slow and Complex
Fuzzy search relies on software to create massive matrix- or tree-based architectures that determine whether the characters following a mismatch are similar enough to the original search for the word or phrase to be a match. Without indexing, you must create this architecture at the time of the search for every search. As you can imagine, this takes days or weeks, and is not workable given the pace of data changes.
Reducing Complexity by Limiting Fuzziness
One approach to managing complexity and latency involves limiting fuzziness or “edit distance.” Common open source search tools that use indexing, such as Lucene and Elasticsearch, confine searches to a maximum edit distance of two, due in large part to the time and size bottlenecks mentioned above. When you don’t know EXACTLY what you are looking for, an edit distance of two is likely not enough.
Consider, for example, a more complex search than Michelle. Let’s say a hospital needs to search your medical history for “heart arrhythmia” to provide proper treatment. Humans—especially those in stressful situations—make mistakes. A record that says “heat arhithma” is possible, and a maximum edit distance of two wouldn’t return the results the doctor needs to understand your medical history.
Indexing Isn’t the Solution. It’s the Problem.
Edit distance limits have been necessary because, as the edit distance increases, indexes grow, ballooning the size of the data that you must extract, transform, load, store and analyze. While indexing’s purpose is to speed up search, it can take several hours to weeks. This delay means if indexing takes days and data changes by the minute, your data will always be out-of-date. Before the advent of the Internet and Internet of Things, this was less of an issue. In our connected world, it’s a veritable nightmare.
Indexing can also be costly and resource-intensive. To keep up, companies throw more commodity clusters at the indexing challenge, but this adds huge costs in data center infrastructure, real estate, staff resources and power requirements. These added costs are a factor for all indexing exercises regardless of the type of data analytics ecosystem used—either traditional data center or cloud-based environment.
Low-Latency, Large-Scale Fuzzy Search Without Indexing
It’s clear the “Google way” of indexing data to enable fuzzy search isn’t always the best way. It’s also clear that limiting the fuzzy search to an edit distance of two won’t give you the answers you need or the most comprehensive view of your data. To get real-time fuzzy searches that return all relevant results you must use a data analytics appliance that is not constrained by the underlying sequential processing architectures that make up software parallelism. The key is hardware parallelism, not software parallelism, made possible by the hybrid FPGA/x86 compute engine at the heart of the Ryft ONE.
The Power of FPGA/x86 Hybrid Computing
At Ryft, we are all about using the right tool for the right job to achieve unmatched performance. That’s why we take advantage of FPGAs for speed and efficiency and an x86 for ease of integration with existing big data ecosystems for high performance analytics. FPGAs, or Field Programmable Gate Arrays, are so flexible and reconfigurable that they are capable of enabling massively parallel hardware operations tailored to the problem at hand to deliver unrivaled speed and low energy requirements. FPGA provides on-chip hardware parallelism that maps to the dataflow characteristics of an application’s algorithm.
By combining massively parallel FPGA processing with an x86-powered Linux front-end, 48 TB of storage, a library of algorithmic components and open APIs in a small 1U device, Ryft has created the first easy-to-use appliance to accelerate fuzzy search to match exact search speeds without indexing. You can also cluster several Ryft ONEs to analyze petabyte-scale data with the ease of managing a single box. With this power packed into a small, high-efficiency footprint, users wanting high performance analytics can now:
- Get real-time insights into any type of raw unstructured or structured data without ETL or indexing
- Analyze streaming and batch data together
- Accelerate Apache Spark ecosystems by 100X
- Speed petabyte-level workload analysis to 200 GB/second via clustering
- Slash operational costs by 70%
Fuzzy Search Is as Fast as Exact Search with Ryft
What difference can Ryft make on a popular application like fuzzy search? This video shows the results of performance testing done by Ryft’s vice president of products, Bill Dentinger. Here, he shows how the Ryft ONE makes fuzzy search as fast and easy as exact search on unstructured data—all while doubling the edit distance to four. The Ryft ONE was able to complete a hamming search at a data fabric rate of 12.2 Gigabytes per second while enabling a more comprehensive search than any other tool.
From Six Hours to Six Seconds With the Ryft ONE
To put this all in perspective, if you used conventional tools to search unindexed Wikipedia for a term that is commonly misspelled with a fuzzy distance of four it would take about six hours. If case sensitivity is a factor, it would balloon even further. Or you could run the same search using a single Ryft ONE with a search distance of four and get results in six seconds. Depending upon your use case, you could also cluster two or more Ryft ONEs to mine data from far larger workloads in milliseconds. Looking at it another way, it would take 3,000 commodity servers to match the performance of a single Ryft ONE, thanks to the massively parallel hardware computing power of the FPGA fabric in our hybrid FPGA/x86 compute appliance.
Find Fast Answers When You Don’t Know What You Don’t Know
With its unique balanced architecture, the Ryft ONE isn’t slowed down by conventional compute, storage or I/O bottlenecks. And it’s open and easy to integrate with existing software ecosystems running Spark and Hadoop, so you can supercharge the most complex, data-intensive workloads. This allows you to find any instance of any string of characters as an exact or a fuzzy match—no matter if the corresponding data record was created one second or one month ago.
Try Ryft Without Risk
What can Ryft do for your big data initiative? Schedule a free evaluation to put it to work on your most complex workloads.