[firebase-br] RES: RES: Consultando Texto Em Blob

PedroRS UOL alex.uol em uol.com.br
Sáb Set 2 23:26:39 -03 2006


>>"Também não entendi.
>>Se alguém puder dar mais detalhes ficarei muito grato."

Ok,

Vou tentar explicar de uma forma bem simplificada, mas há texto explicativos
detalhados sobre isso em:

"CONSTRUÇÃO DE UM PROTÓTIPO PARA INDEXAR DOCUMENTOS DE TEXTO LONGO
DIRECIONADO AO FIREBIRD"
http://www.edmilson.eti.br/publicacoes/monografia.pdf
"IMPLEMENTAÇÃO DE UM ÍNDICE INVERTIDO PARA INDEXAR DOCUMENTOS DE TEXTO LONGO
UTILIZANDO ÁRVORE B+ E FUNÇÃO HASH."
http://www.edmilson.eti.br/publicacoes/artigo_uneb.pdf

Vamos lá:

Primeiro tenho uma tabela com o texto armazenado num campo blob. Vamos
chamar essa tabela de 'tabela_txt' que tera a seguinte estrutura:

campo 'texto_blob', do tipo blob para armazenar texto
campo 'id' , inteiro cque seria a chave primaria desta tabela

A indexacao de texto consiste em se ter uma outra tabela (vou chamar de
'tabela_indexacao') com os seguintes campos:

campo 'Palavra_chave' do tipo varchar(150)
campo 'id_blob' do tipo integer (chave estrangeira)
campo 'id' , que seria a chave primaria desta tabela

no campo 'id_blob' eu guardo a chave primaria da tabela 'tabela_txt' do
registro que contem a palavra que estou pesquisando, ou seja, guardo o campo
'id' da tabela 'tabela_txt'

Por exemplo: na 'tabela_txt' tenho o seguinte registro

id=266
texto_blob='O parecer do Juiz foi favoravel ao sr.Josivaldo dos Santos'

na tabela 'tabela_indexacao' eu teria o seguinte:

id=1
palavra_chave='parecer'
registro_blob=266

id=2
palavra_chave='Juiz'
registro_blob=266

id=3
palavra_chave='favoravel'
registro_blob=266

id=4
palavra_chave='Josivaldo'
registro_blob=266

id=5
palavra_chave='Santos'
registro_blob=266

(indexo somente palavras que interessam, ou seja , nao tem sentido indexar
'O', 'do', 'foi', 'ao'....)

Se vc quiser pesquisar na  base de dados a palavra 'Josivaldo', voce usaria
a seguinte declaracao:

Select B.texto_blob from tabela_indexacao A inner join tabela_txt B on
B.id=A.id_blob
 where A.palavra_chave='Josivaldo'

Isso me retorna o registro que contem a palvra 'Josivaldo' no campo do tipo
blob (no caso, no campo 'texto_blob' da tabela 'tabela_txt')
Essa estrategia de pesquisa tem mais sentido para grandes quantidades de
texto armazenados em campos do tipo blob.

Apresentei o conceito extremamente basico e simplista , mas a implementacao
disso tem muitos mais detalhes a serem considerados (muitos mesmo....)

Transcrevi abaixo um texto interessante sobre o assunto :


3. Introductory Principles of Indexed Searching

    By Jim McKeeth <jim at mckeeth dot org>


Introduction
------------

There are really two main ways to search a large collection of text
documents. The simplest method would be to load each document and scan
through it for the search terms, this would be referred to as a full
text scan. The second, much faster method is to create an index and then
search the index. An index is a list of terms found in a document or set
of documents. Each word only appears once per document so it is much
shorter then the original document.


Creating An Index
-----------------

Finding The Words
-----------------

In order to create an index you must first parse the document. Parsing,
is the process of picking out the individual tokens (terms or words) in
a piece of text. A parser is a type of state machine. There are many
existing parsing routines available. "The Tomes of Delphi Algorithms and
Data Structures" by Julian Bucknall contains many very good parsers. An
example: http://www.delphi3000.com/articles/article_1265.asp

A simple parser would scan through a string of text, starting at the
beginning, looking at each character. If it is a letter or number then
it is part of a word, if it is white space or punctuation then it is a
separator. Each word is added to a list (i.e. TStringList) in the order
it is found in the document. Typically each word is converted to the
same case (upper or lower).

It is really important to consider what you are indexing and how your
index will be used when creating your index and parsing. For example if
you are parsing HTML then you want to exclude most tags (with the
obvious exception of META tags, which are handled specially). Other
times you might only want to index summery information about each
document.

Indexing The Words
------------------

Now that we have a parsed token list we need to index it. The simplest
index is just a list of each word found in a document, and a reference
to the document. This reference may be a URL, a document name or any
other unique identifier (a GUID or a foreign key to another table
describing the document). A more complex index may include the number of
times the word is found in the document or a ranking for where it is in
the document (in the title, keyword section, first paragraph, middle,
last, etc.) This additional information stored with each word is part of
what differentiates one search engine's performance from another.

Many times certain words are left out. These are called stop words. Stop
words are common words, words that will not be searched on, or words
that will not enhance the meaning of a search. An example of stop words
includes "THE, A, AN, AND, IF, BUT", words with numbers in them, or
anything else you want to filter out. Selecting stop words is another
point of separation of performance.

Some web search engines used to leave out words like "HTML" or "WEB"
because they were so common while other search engines would include
every word. Other search engines start with a dictionary list and only
index words found in that dictionary list. This leads to trouble when
you are indexing names, technical terms or anything else not found in
your original dictionary.

One time I was creating a search engine for a collection newsgroup
articles. I discovered that there was UUEncoded (similar to MIME or
Base64) binaries in the articles. This resulted in my parser finding
words that were hundreds of characters long and total gibberish. I
decided to omit any word longer then 50 characters or shorter then 4.
Making the choices about what to include and what to omit is an
important decision, and will vary based on what content you are
indexing.

So here is an example table structure for your index:

Table: WordList
---------------
Document: Number (foreign key to Documents table)
Word : String[20] (if our longest word is 20 characters)
Count : Number (how many times the word is found)

The primary key would be a compound of Document and Word since each word
is listed once per document.

Table: Documents
----------------
Document : Number(primary key index)
Title : string (title of document)
Location : string (URL or full filename and path)

Optionally you could include the entire document as a blob in this
table. You could also have other tables that lists terms (from the meta
section of the document) or include authors. Again this design choice
depends on the type of documents you are indexing and the purpose of
your search engine.


Searching Your Index
--------------------

Once all the indexes are stored in a database you need to be able to
search the index for a document. A simple SQL statement to search for a
document that contains a single word could look like this:

SELECT *
FROM WordList
WHERE Word = :v_search_term
ORDER BY Count DESC

This returns all documents containing your single search term and they
are ordered by the number of times the word is found. If you want to use
SQL then to search on multiple terms involves an join for each term.
Instead you could retrieve a list for each term and then merge them
manually. This is where you would support AND, OR or NOT key words.

If you want to allow phrase searching then you could search for each
word in the phrase and then search those documents for the phrase. The
same technique could be used for the NEAR key word. There are other more
advanced techniques to do this that are much quicker, but they are
beyond the scope of this document.

Once the hits are found and ranked then display the title of each
document, possibly a summary or the context of the hits, and provide a
way for your user to reach the document.


Variations
----------

One thing Google does a little differently is they look at how pages are
linked. This works really well with the hyper linked nature of the web.
For example if you search for Borland most pages that mention Borland
link to www.borland.com. This is assumed to indicate that
www.borland.com is a very important site about Borland. Google also
limits the number of hits you get on each domain.

Many search engines also rank pages higher if the search term appears in
the URL or title of the page. They also look at the description and
keywords meta tags for ranking. Some search engines will actually ignore
a word if it appears too often in a page. This weeds out sites that try
to artificially inflate their rankings.

Phonetics or Soundex is another technique that can be used. This could
be done with an additional table similar to the word table, but instead
store the soundex value for the words instead of the actual word.


Third Party Search Tools
------------------------

dtSearch  http://www.dtsearch.com/
--------

dtSearch provides a full range of text search products. These products
target everyone: from the end user to the media publisher and finally
the developer. Their dtSearch Text Retrieval Engine is the core of all
their products and they provide an API with Delphi examples. This
product is just about creating a full text index of documents and
searching it. They use their own data format. Using their API you can
create a Delphi application to index just about any document (PDF,
Office, ZIP, etc. and Text too!) and then search that index to find the
matching documents.

They recently added support for Linux and .NET but I haven't tested
these with Linux or Delphi for .NET yet.

Depending on what type of project you are working on, one of their other
product lines might be a good match. Their dtSearch Publish, for
example, lets you create a CD / DVD of searchable content quickly and
easily. While their dtSearch Web is idea for building a web search
engine.

Rubicon  http://www.fulltextsearch.com/
-------

Tamarack Associates' provides their flagship product Rubicon to Delphi
and C++Builder developers who want indexed searching without building a
system from scratch. Beyond the basics we cover here, they also offer
many advanced features and speed improvements.

If you visit their web site you can download a demo version of their
search tools or test their newsgroup search in Borland and Microsoft's
newsgroups. Not only does Rubicon allow searching of documents (It comes
with a parser to handle text, HTML, and RTF), but it will also allow you
to search your database for data. It supports logical expression (and,
or, not, near and like), as well as support for full phrases. It is all
pure Delphi source, and really offers the developer a very high level of
control over the process.

Rubicon's FastPhrase technology builds a much larger table that allows
it to perform phrase searches without performing a full text scan of the
target document. This is done by not only storing each word, but also
storing a concatenation of each word and it's two adjacent words, for a
total of up to there entries per word.

Rubicon boasts very comprehensive database compatibility. Supported
databases include the BDE and dbExpress as well as ADO, Advantage,
Apollo, Direct Oracle Access, DBISAM, FlashFiler, Halcyon, Interbase
Express, Interbase Objects, ODBC Express and Topaz. Other databases are
supported through building your own packages. When installing Rubicon
(either demo or full version) be sure to follow the directions to
install it correctly and so it will support your selected database.

Summary
-------

Based on my comparisons between dtSearch and Rubicon, they both perform
about the same in the search speed department. dtSearch's advantage is
in their additional product lines and simpler interface, but you suffer
from less configurability. Rubicon gives you access to every detail of
your search project by providing you with the complete Delphi source and
your choice of databases instead of dtSearch's closed database and DLL.


Database Search Features
------------------------

Many databases provide features to allow you to index and search
documents stored within the database. Oracle offers an additional add-in
to index documents. MySQL by default allows you to search stored text
documents just like you would any other field. DBISAM (a native Delphi
BDE replacement with no deployment) now has text indexing and searching
as well.


Conclusion
----------

Searching a shorter and well organized index is much quicker then
searching an entire document. Creating and searching an index takes a
lot more planning and effort up front, but quickly pays off if the text
is searched very often. Typically the larger and more complex the index,
the more effective the search. If your index gets too large or complex
then the search speed will degrade.

There are off the shelf searching tools available to end users and
developers alike. dtSearch and Rubicon are both extremely fast and
useful, but don't let that stop you from designing your own, especially
if you have a specialized need.

See also: http://www.dtsearch.com/dtsoftware.html#Art_of_The_Text_Query


Engº Pedro Alexandro Ramos dos Santos
RS Sistemas de Informática Ltda
-----Mensagem original-----
De: lista-bounces em firebase.com.br [mailto:lista-bounces em firebase.com.br] Em
nome de Jeferson Oliveira
Enviada em: sábado, 2 de setembro de 2006 09:04
Para: FireBase
Assunto: Re: [firebase-br] RES: Consultando Texto Em Blob

Também não entendi.
Se alguém puder dar mais detalhes ficarei muito grato.


Abraço!
Jeferson Oliveira

______________________________________________
FireBase-BR (www.firebase.com.br) - Hospedado em www.locador.com.br
Para editar sua configuração na lista, use o endereço
http://mail.firebase.com.br/mailman/listinfo/lista_firebase.com.br
Para consultar mensagens antigas: http://firebase.com.br/pesquisa





Mais detalhes sobre a lista de discussão lista