Home » Other » Suggestions & Feedback » Search functionality doesn't work accurately when keyword has punctuation/special characters
Search functionality doesn't work accurately when keyword has punctuation/special characters [message #606107] Sun, 19 January 2014 23:41 Go to next message
Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web
Senior Member
Hi Frank/moderators,

I figured out this problem when I was searching for one my messages this morning. I vaguely recollected that "gazzag" also contributed to that topic, so I filtered with his username and the search keyword was "colon". And the search did not return any result /forum/fa/1600/0/. I tried with keyword "semi-colon" and it worked.

So, I did few more tests and concluded that whenever there is a non-alphanumeric/punctuation/special character( or whatever you call it) in the string, the search fails to find the keyword. For example :
If I search for a post where I wrote keyword "SQL/PLSQL", I expect the search to return this post. However, if I use keyword "SQL" or "SQL/" or "/PLSQL" or "PLSQL", that post wouldn't be fetched.

I know OraFAQ is not an Oracle database application, it is My SQL. I don't know about My SQL "LIKE" function, but in Oracle it wouldn't be a problem at all.

SQL> WITH DATA AS(
  2  SELECT 'semi-colon' col FROM dual UNION ALL
  3  SELECT 'SQL/PLSQL' FROM dual UNION ALL
  4  SELECT 'just to test blank space' FROM dual)
  5   SELECT *
  6     FROM DATA
  7    WHERE COL LIKE '%-colon%'
  8       OR COL LIKE '%SQL/%'
  9       OR COL LIKE '%to %';
 
COL
------------------------
semi-colon
SQL/PLSQL
just to test blank space


I am just assuming the code in My SQL would also be similar. The search keyword would go in the where clause of the predicate. So, what is breaking in search functionality?

Regards,
Lalit

[Updated on: Tue, 11 March 2014 13:40] by Moderator

Report message to a moderator

Re: Search functionality doesn't work accurately when keyword has punctuation/special characters [message #606167 is a reply to message #606107] Mon, 20 January 2014 12:12 Go to previous messageGo to next message
Barbara Boehmer
Messages: 9077
Registered: November 2002
Location: California, USA
Senior Member
I don't know much about MySQL either, but I am guessing that OraFAQ is probably using a full-text search using MATCH and AGAINST with a FULLTEXT index, similar to Oracle's Text search using CONTAINS with a CONTEXT index. You would receive similar results if the special characters were set as SKIPJOINS in Oracle. When this is done, the special characters are removed during indexing, so that "semi-colon" is tokenized and indexed as "semicolon" and "SQL/PLSQL" is tokenized and indexed as "SQLPLSQL". Some of the special characters also have special meaning. For example, the hyphen (-) means minus in Oracle Text. Searching for "semi-colon" means, for each row, return the score for "semi" minus the score for "colon". So, you get a score of 0, unless you escape the - with \. Searching for "-anyword" just raises an error, because it doesn't say what to subtract from. In Oracle Text there would be various ways to work around this, but when you facilitate one thing it can cause problems with another. Please see the Oracle Text demonstration below. Of course, this is all just hypothesizing that OraFAQ is doing something similar with MySQL. We will have to wait for Frank to let us know.

-- table and data:
SCOTT@orcl12c> CREATE TABLE data
  2    (col  CLOB)
  3  /

Table created.

SCOTT@orcl12c> INSERT ALL
  2  INTO data (col) VALUES ('semi-colon')
  3  INTO data (col) VALUES ('SQL/PLSQL')
  4  INTO data (col) VALUES ('just to test blank space')
  5  SELECT * FROM DUAL
  6  /

3 rows created.


-- Oracle Text lexer setting - and / as skipjoins:
SCOTT@orcl12c> BEGIN
  2    CTX_DDL.CREATE_PREFERENCE ('test_lex', 'BASIC_LEXER');
  3    CTX_DDL.SET_ATTRIBUTE ('test_lex', 'SKIPJOINS', '-/');
  4  END;
  5  /

PL/SQL procedure successfully completed.


-- Oracle Text Context index that uses lexer above and empty stoplist:
SCOTT@orcl12c> CREATE INDEX test_idx ON data (col)
  2  INDEXTYPE IS CTXSYS.CONTEXT
  3  PARAMETERS
  4    ('LEXER	   test_lex
  5  	 STOPLIST  CTXSYS.EMPTY_STOPLIST')
  6  /

Index created.


-- What is tokenized and searchable as a result of indexing:
SCOTT@orcl12c> SELECT token_text FROM dr$test_idx$i
  2  /

TOKEN_TEXT
----------------------------------------------------------------
BLANK
JUST
SEMICOLON
SPACE
SQLPLSQL
TEST
TO

7 rows selected.


-- queries that don't return any rows:
SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, '-colon') > 0
  3  /
SELECT * FROM data
*
ERROR at line 1:
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-50901: text query parser syntax error on line 1, column 1


SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, '\-colon') > 0
  3  /

no rows selected

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, 'SQL') > 0
  3  /

no rows selected

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, 'PLSQL') > 0
  3  /

no rows selected

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, 'SQL/') > 0
  3  /

no rows selected

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, '/PLSQL') > 0
  3  /

no rows selected


-- queries that do return results:
SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, 'to ') > 0
  3  /

COL
--------------------------------------------------------------------------------
just to test blank space

1 row selected.

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, 'semi\-colon') > 0
  3  /

COL
--------------------------------------------------------------------------------
semi-colon

1 row selected.

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, 'SQL/PLSQL') > 0
  3  /

COL
--------------------------------------------------------------------------------
SQL/PLSQL

1 row selected.

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, 'semicolon') > 0
  3  /

COL
--------------------------------------------------------------------------------
semi-colon

1 row selected.

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, 'SQLPLSQL') > 0
  3  /

COL
--------------------------------------------------------------------------------
SQL/PLSQL

1 row selected.

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, '%colon') > 0
  3  /

COL
--------------------------------------------------------------------------------
semi-colon

1 row selected.

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, '%\-colon') > 0
  3  /

COL
--------------------------------------------------------------------------------
semi-colon

1 row selected.

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, 'SQL%') > 0
  3  /

COL
--------------------------------------------------------------------------------
SQL/PLSQL

1 row selected.

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, '%PLSQL') > 0
  3  /

COL
--------------------------------------------------------------------------------
SQL/PLSQL

1 row selected.

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, 'SQL/%') > 0
  3  /

COL
--------------------------------------------------------------------------------
SQL/PLSQL

1 row selected.

SCOTT@orcl12c> SELECT * FROM data
  2  WHERE  CONTAINS (col, '%/PLSQL') > 0
  3  /

COL
--------------------------------------------------------------------------------
SQL/PLSQL

1 row selected.


[Updated on: Mon, 20 January 2014 12:14]

Report message to a moderator

Re: Search functionality doesn't work accurately when keyword has punctuation/special characters [message #606180 is a reply to message #606167] Mon, 20 January 2014 23:11 Go to previous message
Lalit Kumar B
Messages: 3174
Registered: May 2013
Location: World Wide on the Web
Senior Member
Barbara Boehmer wrote on Mon, 20 January 2014 23:42
I am guessing that OraFAQ is probably using a full-text search using MATCH and AGAINST with a FULLTEXT index, similar to Oracle's Text search using CONTAINS with a CONTEXT index.


Ah, right Barbara. You seem to be correct, I guess that's what is happening. Did a bit of google and got similar responses. In one of the MySQL forums, I got this information :

"The '-' character is treated as a word stop even within literals. The same is true if any of the special text search modifiers are used (eg +, -, ~) so that hyphenated literals are not correctly found with full text searches."
Previous Topic: Linux Ubunt-O- (?!?)
Next Topic: Plagiarism on www.dbametrix.net #2
Goto Forum:
  


Current Time: Thu Mar 28 06:03:58 CDT 2024