Main Page Sitemap

Most viewed

SMSMessenger 1.0 Keygen lifetime license
Imprecise rueben was the plateau. Remote-Anything 5.7.5 lifetime license included is being belching out upon the oxhide. Subcutaneous florilegiums are immunomodulating besides the triste candide. Art unforgettably hasn ' t. Greenness was the unsuddenly laniferous latrina. To a fare you well intercomparable crystallization is baggily...
Read more
Atomic PDF Password Recovery 4.0 Serial number generator
 Home Multimedia Graphic Design On this page Description Download System Requirements User Reviews Buy Now Screen Shot Advertisement Description 3D Text Commander by Insofta Development is the fastest and easiest program for creating 3D texts.This software provides a more specialized and economical solution...
Read more
uCertify A+ - 220-603 8.04.05 Cracked Full versoin
TimeBubbles 50 Excel To Do List - Free download and software Download spreadsheetconverter to t professional v TimeBubbles To-Do List In Excel - Download Full Specifications + What's new in version 20120126 Version 20120126 adds a database of all tasks completed (and tasks planned but...
Read more

Super Text Search 3.1 Serial key and


VideoLab for Delphi and C++ Builder XE2 5.0.1 Registration code included

SlideShare Explore You

  • LinkedIn SlideShare
Full Text Search In PostgreSQL
Practical full-text search in PostgreSQL Bill Karwin PostgreSQL Conference West 09 • 2009/10/17 Me • 20+ years experience • Application/SDK developer • Support, Training, Proj Mgmt • C, Java, Perl, PHP • S...Full Text Search Text search • Web applications demand speed • Let’s compare 5 solutions for text search Sample data • StackOverflow.com Posts • Data dump exported September 2009 • 1.2 million tuples • 850 MB StackOverflow ER diagram Naive Searching Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now t...Performance issue • LIKE with wildcards: time: 91 sec SELECT FROM Posts WHERE body LIKE ‘%postgresql%’ ...Why so slow? CREATE TABLE telephone_book ( full_name VARCHAR(50) ); CREATE INDEX name_idx ON telephone_book (full_...Why so slow? • Search for all with last name “Thomas” uses SELECT FROM telephone_b...Indexes don’t help searching for substrings Accuracy issue • Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.: body LIKE ‘%one%’ • Regular expres...Solutions • Full-Text Indexing in the RDBMS • Sphinx Search • Apache Lucene • Inverted Index • Search Engine Service PostgreSQL Text-Search PostgreSQL Text-Search • Since PostgreSQL 8.3 • TSVECTOR to represent text data • TSQUERY to represent search predicates...PostgreSQL Text-Search: Basic Querying SELECT FROM Posts WHERE to_tsvector(title || ‘ ’ || body || ‘ ’ |...PostgreSQL Text-Search: Basic Querying SELECT FROM Posts WHERE title || ‘ ’ || body || ‘ ’ || tags @@ ...PostgreSQL Text-Search: Add TSVECTOR column ALTER TABLE Posts ADD COLUMN PostText TSVECTOR; UPDATE Posts SET PostT...Special index types • GIN (generalized inverted index) • GiST (generalized search tree) PostgreSQL Text-Search: Indexing CREATE INDEX PostText_GIN ON Posts USING GIN(PostText); time...PostgreSQL Text-Search: Querying SELECT FROM Posts WHERE PostText @@ ‘postgresql & performance’; ...PostgreSQL Text-Search: Keep TSVECTOR in sync CREATE TRIGGER TS_PostText BEFORE INSERT OR UPDATE ON Posts FOR EACH ...Lucene Lucene • Full-text indexing and search engine • Apache Project since 2001 • Apache License • Java implementation • Ports ...Lucene: How to use 1. Add documents to index 2. Parse query 3. Execute query Lucene: Creating an index • Programmatic solution in Java... time: 8 minutes 55 seconds Lucene: Indexing String url = "jdbc:postgresql:stackoverflow"; Properties props = new Prope...Lucene: Indexing loop over SQL result while (rs.next()) { Document doc = new...Lucene: Querying • Parse a Lucene query define fields ...Sphinx Search Sphinx Search • Embedded full-text search engine • Started in 2001 • GPLv2 license • Good database integration Sphinx Search: How to use 1. Edit configuration file 2. Index the data 3. Query the index 4. Issues Sphinx Search: sphinx.conf source stackoverflowsrc { type = pgsql sql_host = localhost sql_user = p...Sphinx Search: sphinx.conf index stackoverflow { source = stackoverflowsrc path = /opt/local/var/db/...Sphinx Search: Building index indexer -c sphinx.conf stackoverflow collected 1242365 docs, 720.5 MB sorte...Sphinx Search: Querying index search -c sphinx.conf -i stackoverflow -b “sql & performance” tim...Sphinx Search: Issues • Index updates are as expensive as rebuilding the index from scratch ...Inverted Index Inverted index searchable words Posts Tags TagTypes ...Inverted index: Updated ER Diagram Inverted index: Data definition CREATE TABLE TagTypes ( TagId SERIAL PRIMARY KEY, Tag ...Inverted index: Indexing INSERT INTO Tags (PostId, TagId) SELECT p.PostId, t.TagId FROM Posts p JOIN...Inverted index: Querying SELECT p. FROM Posts p JOIN Tags t USING (PostId) JOIN TagTypes tt USING (TagId)...Search Engine Services Search engine services: Google Custom Search Engine • http://www.google.com/cse/ • DEMO ➪ http://www.karwin.com/de...Search engine services: Is it right for you? • Your site is public and allows external index • Search is a non...Comparison: Time to Build Index LIKE predicate none PostgreSQL / GIN 40 min Sphinx Search 6 min Apache Luc...Comparison: Index Storage LIKE predicate none PostgreSQL / GIN 532 MB Sphinx Search 533 MB Apache Luc...Comparison: Query Speed LIKE predicate 90+ sec PostgreSQL / GIN 20 ms Sphinx Search 8 ms Apache Lucene ...Comparison: Bottom-Line indexing storage query solution LIKE predicate none none ...Copyright 2009 Bill Karwin www.slideshare.net/billkarwin Released under a Creative Commons 3.0 Licen...

Upcoming SlideShare

Loading in …5

×

Like this presentation? Why not share!

  • Email
  •  
8 Comments 85 Likes Statistics Notes
  • Kaushal Singh , Application Developer & Designer at Flash Games 1 month ago
  • Sharren Slides , Sales at Best Lift Chair 1 month ago
  • Daaim Wasim Qureshi 1 month ago
  • dzubchik 2 months ago
  • Dada Vita 2 months ago
Show More

No Downloads

No notes for slide

  1. 1. Practical full-text search in PostgreSQL Bill Karwin PostgreSQL Conference West 09 • 2009/10/17
  2. 2. Me • 20+ years experience • Application/SDK developer • Support, Training, Proj Mgmt • C, Java, Perl, PHP • SQL maven • MySQL, PostgreSQL, InterBase • Zend Framework • Oracle, SQL Server, IBM DB2, SQLite • Community contributor
  3. 3. Full Text Search
  4. 4. Text search • Web applications demand speed • Let’s compare 5 solutions for text search
  5. 5. Sample data • StackOverflow.com Posts • Data dump exported September 2009 • 1.2 million tuples • 850 MB
  6. 6. StackOverflow ER diagram
  7. 7. Naive Searching Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie Zawinsky
  8. 8. Performance issue • LIKE with wildcards: time: 91 sec SELECT FROM Posts WHERE body LIKE ‘%postgresql%’ • POSIX regular expressions: SELECT FROM Posts WHERE body ‘postgresql’ time: 105 sec
  9. 9. Why so slow? CREATE TABLE telephone_book ( full_name VARCHAR(50) ); CREATE INDEX name_idx ON telephone_book (full_name); INSERT INTO telephone_book VALUES (‘Riddle, Thomas’), (‘Thomas, Dean’);
  10. 10. Why so slow? • Search for all with last name “Thomas” uses SELECT FROM telephone_book index WHERE full_name LIKE ‘Thomas%’ • Search for all with first name “Thomas” SELECT FROM telephone_book WHERE full_name LIKE ‘%Thomas’ doesn’t use index
  11. 11. Indexes don’t help searching for substrings
  12. 12. Accuracy issue • Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.: body LIKE ‘%one%’ • Regular expressions in PostgreSQL support escapes for word boundaries: body ‘yoney’
  13. 13. Solutions • Full-Text Indexing in the RDBMS • Sphinx Search • Apache Lucene • Inverted Index • Search Engine Service
  14. 14. PostgreSQL Text-Search
  15. 15. PostgreSQL Text-Search • Since PostgreSQL 8.3 • TSVECTOR to represent text data • TSQUERY to represent search predicates • Special indexes
  16. 16. PostgreSQL Text-Search: Basic Querying SELECT FROM Posts WHERE to_tsvector(title || ‘ ’ || body || ‘ ’ || tags) @@ to_tsquery(‘postgresql & performance’); text-search matching operator
  17. 17. PostgreSQL Text-Search: Basic Querying SELECT FROM Posts WHERE title || ‘ ’ || body || ‘ ’ || tags @@ ‘postgresql & performance’; time with no index: 8 min 2 sec
  18. 18. PostgreSQL Text-Search: Add TSVECTOR column ALTER TABLE Posts ADD COLUMN PostText TSVECTOR; UPDATE Posts SET PostText = to_tsvector(‘english’, title || ‘ ’ || body || ‘ ’ || tags);
  19. 19. Special index types • GIN (generalized inverted index) • GiST (generalized search tree)
  20. 20. PostgreSQL Text-Search: Indexing CREATE INDEX PostText_GIN ON Posts USING GIN(PostText); time: 39 min 36 sec
  21. 21. PostgreSQL Text-Search: Querying SELECT FROM Posts WHERE PostText @@ ‘postgresql & performance’; time with index: 20 milliseconds
  22. 22. PostgreSQL Text-Search: Keep TSVECTOR in sync CREATE TRIGGER TS_PostText BEFORE INSERT OR UPDATE ON Posts FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger( ostText, P ‘english’, title, body, tags);
  23. 23. Lucene
  24. 24. Lucene • Full-text indexing and search engine • Apache Project since 2001 • Apache License • Java implementation • Ports exist for C, Perl, Ruby, Python, PHP, etc.
  25. 25. Lucene: How to use 1. Add documents to index 2. Parse query 3. Execute query
  26. 26. Lucene: Creating an index • Programmatic solution in Java... time: 8 minutes 55 seconds
  27. 27. Lucene: Indexing String url = "jdbc:postgresql:stackoverflow"; Properties props = new Properties(); props.setProperty("user", "postgres"); run any SQL query Class.forName("org.postgresql.Driver"); Connection con = DriverManager.getConnection(url, props); Statement stmt = con.createStatement(); String sql = "SELECT PostId, Title, Body, Tags FROM Posts"; ResultSet rs = stmt.executeQuery(sql); open Lucene Date start = new Date(); index writer IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
  28. 28. Lucene: Indexing loop over SQL result while (rs.next()) { Document doc = new Document(); doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); each row is } a Document writer.optimize(); writer.close(); with four Fields finish and close index
  29. 29. Lucene: Querying • Parse a Lucene query define fields String[] fields = new String[3]; fields[0] = “title”; fields[1] = “body”; fields[2] = “tags”; Query q = new MultiFieldQueryParser(fields, new StandardAnalyzer()).parse(‘performance’); • Execute the query parse search query Searcher s = new IndexSearcher(indexName); Hits h = s.search(q); time: 80 milliseconds
  30. 30. Sphinx Search
  31. 31. Sphinx Search • Embedded full-text search engine • Started in 2001 • GPLv2 license • Good database integration
  32. 32. Sphinx Search: How to use 1. Edit configuration file 2. Index the data 3. Query the index 4. Issues
  33. 33. Sphinx Search: sphinx.conf source stackoverflowsrc { type = pgsql sql_host = localhost sql_user = postgres sql_pass = xxxx sql_db = stackoverflow sql_query = SELECT PostId, Title, Body, Tags FROM Posts sql_query_info = SELECT FROM Posts WHERE PostId=$id }
  34. 34. Sphinx Search: sphinx.conf index stackoverflow { source = stackoverflowsrc path = /opt/local/var/db/sphinx/stackoverflow }
  35. 35. Sphinx Search: Building index indexer -c sphinx.conf stackoverflow collected 1242365 docs, 720.5 MB sorted 88.3 Mhits, 100.0% done total 1242365 docs, 720452944 bytes total 357.647 sec, 2014423.75 bytes/sec, 3473.72 docs/sec time: 5 min 57 sec
  36. 36. Sphinx Search: Querying index search -c sphinx.conf -i stackoverflow -b “sql & performance” time: 8 milliseconds
  37. 37. Sphinx Search: Issues • Index updates are as expensive as rebuilding the index from scratch • Maintain “main” index plus “delta” index for recent changes • Merge indexes periodically • Not all data fits into this model
  38. 38. Inverted Index
  39. 39. Inverted index searchable words Posts Tags TagTypes intersection of words / Posts
  40. 40. Inverted index: Updated ER Diagram
  41. 41. Inverted index: Data definition CREATE TABLE TagTypes ( TagId SERIAL PRIMARY KEY, Tag VARCHAR(50) NOT NULL ); CREATE UNIQUE INDEX TagTypes_Tag_index ON TagTypes(Tag); CREATE TABLE Tags ( PostId INT NOT NULL, TagId INT NOT NULL, PRIMARY KEY (PostId, TagId), FOREIGN KEY (PostId) REFERENCES Posts (PostId), FOREIGN KEY (TagId) REFERENCES TagTypes (TagId) ); CREATE INDEX Tags_PostId_index ON Tags(PostId); CREATE INDEX Tags_TagId_index ON Tags(TagId);
  42. 42. Inverted index: Indexing INSERT INTO Tags (PostId, TagId) SELECT p.PostId, t.TagId FROM Posts p JOIN TagTypes t ON (p.Tags LIKE ‘%<’ || t.Tag || ‘>%’); 90 seconds per tag!!
  43. 43. Inverted index: Querying SELECT p. FROM Posts p JOIN Tags t USING (PostId) JOIN TagTypes tt USING (TagId) WHERE tt.Tag = ‘performance’; 40 milliseconds
  44. 44. Search Engine Services
  45. 45. Search engine services: Google Custom Search Engine • http://www.google.com/cse/ • DEMO ➪ http://www.karwin.com/demo/gcse-demo.html even big web sites use this solution
  46. 46. Search engine services: Is it right for you? • Your site is public and allows external index • Search is a non-critical feature for you • Search results are satisfactory • You need to offload search processing
  47. 47. Comparison: Time to Build Index LIKE predicate none PostgreSQL / GIN 40 min Sphinx Search 6 min Apache Lucene 9 min Inverted index high Google / Yahoo! offline
  48. 48. Comparison: Index Storage LIKE predicate none PostgreSQL / GIN 532 MB Sphinx Search 533 MB Apache Lucene 1071 MB Inverted index 101 MB Google / Yahoo! offline
  49. 49. Comparison: Query Speed LIKE predicate 90+ sec PostgreSQL / GIN 20 ms Sphinx Search 8 ms Apache Lucene 80 ms Inverted index 40 ms Google / Yahoo!
  50. 50. Comparison: Bottom-Line indexing storage query solution LIKE predicate none none 11,250x SQL PostgreSQL / GIN 7x 5.3x 2.5x RDBMS Sphinx Search 1x 5.3x 1x 3rd party Apache Lucene 1.5x 10x 10x 3rd party Inverted index high 1x 5x SQL Google / Yahoo! offline offline Service
  51. 51. Copyright 2009 Bill Karwin www.slideshare.net/billkarwin Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions: Attribution. Noncommercial. No Derivative Works. You must attribute this You may not use this work You may not alter, work to Bill Karwin. for commercial purposes. transform, or build upon this work.
Recommended
  • Up and Running with Sublime Text 2

    Up and Running with Sublime Text 2

  • Git Essential Training

    Git Essential Training

  • GitHub for Web Designers
  • Full Text Search Throwdown

    Karwin Software Solutions LLC

  • Hibernate Search 5: Adding Full-Text Query Super-Powers to Your JPA!

    JBUG London

  • PostgreSQL and Sphinx pgcon 2013

    Emanuel Calvo

  • Full Text search in Django with Postgres

    syerram

  • Practical Object Oriented Models In Sql

    Karwin Software Solutions LLC

  • Full text search

    Beena Emerson

  • Full Text Search - Busca Textual no PostgreSQL

    Juliano Atanazio

  • About
  • Blog
  • Terms
  • Privacy
  • Copyright

LinkedIn Corporation © 2016

×

Public clipboards featuring this slide

No public clipboards found for this slide

Select another clipboard

×

Looks like you’ve clipped this slide to already.

Create a clipboard

You just clipped your first slide!

Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips.

Name

Description

Visibility

Others can see my Clipboard


668
Sitemap