Main Page Sitemap

Most viewed

Tansee iPhone Transfer Contact 5.3.1 with Keygen Activation
Verizon Iphone Activation Guide PDF - mActivation Of The Sacred Seals PDF - m We have managed to get easy for you to find a PDF Books without any stress. By storing or. 2/15 Verizon Iphone Activation Guide. Other Files Available to Download...
Read more
ASP/ODBC Config 1.00 plus Keygen
Cd burning - How to burn an audio CD in Windows 10 - Super UserTotal Video Converter burn video to DVD, SVCD, VCD, CD, Blu-rayAudacity - Free download and software reviews - CNET Download WavePad Audio Editor Edit your audio files and add effects to...
Read more
State Capitals Flashcards Software 7.0 not need Activation
Home » Software » WinEdt 9 Crack + Registration Code Free Download on November 23, 2015 in Software No Comments WinEdt 9 Crack + Registration Code Free Download WinEdt 9 Crack + Registration Code Full Version Free Download WinEdt 9 Crack + Registration Code...
Read more

Super Text Search 3.1 Serial key and


VideoLab for Delphi and C++ Builder XE2 5.0.1 Registration code included

SlideShare Explore You

  • LinkedIn SlideShare
Full Text Search In PostgreSQL
Practical full-text search in PostgreSQL Bill Karwin PostgreSQL Conference West 09 • 2009/10/17 Me • 20+ years experience • Application/SDK developer • Support, Training, Proj Mgmt • C, Java, Perl, PHP • S...Full Text Search Text search • Web applications demand speed • Let’s compare 5 solutions for text search Sample data • StackOverflow.com Posts • Data dump exported September 2009 • 1.2 million tuples • 850 MB StackOverflow ER diagram Naive Searching Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now t...Performance issue • LIKE with wildcards: time: 91 sec SELECT FROM Posts WHERE body LIKE ‘%postgresql%’ ...Why so slow? CREATE TABLE telephone_book ( full_name VARCHAR(50) ); CREATE INDEX name_idx ON telephone_book (full_...Why so slow? • Search for all with last name “Thomas” uses SELECT FROM telephone_b...Indexes don’t help searching for substrings Accuracy issue • Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.: body LIKE ‘%one%’ • Regular expres...Solutions • Full-Text Indexing in the RDBMS • Sphinx Search • Apache Lucene • Inverted Index • Search Engine Service PostgreSQL Text-Search PostgreSQL Text-Search • Since PostgreSQL 8.3 • TSVECTOR to represent text data • TSQUERY to represent search predicates...PostgreSQL Text-Search: Basic Querying SELECT FROM Posts WHERE to_tsvector(title || ‘ ’ || body || ‘ ’ |...PostgreSQL Text-Search: Basic Querying SELECT FROM Posts WHERE title || ‘ ’ || body || ‘ ’ || tags @@ ...PostgreSQL Text-Search: Add TSVECTOR column ALTER TABLE Posts ADD COLUMN PostText TSVECTOR; UPDATE Posts SET PostT...Special index types • GIN (generalized inverted index) • GiST (generalized search tree) PostgreSQL Text-Search: Indexing CREATE INDEX PostText_GIN ON Posts USING GIN(PostText); time...PostgreSQL Text-Search: Querying SELECT FROM Posts WHERE PostText @@ ‘postgresql & performance’; ...PostgreSQL Text-Search: Keep TSVECTOR in sync CREATE TRIGGER TS_PostText BEFORE INSERT OR UPDATE ON Posts FOR EACH ...Lucene Lucene • Full-text indexing and search engine • Apache Project since 2001 • Apache License • Java implementation • Ports ...Lucene: How to use 1. Add documents to index 2. Parse query 3. Execute query Lucene: Creating an index • Programmatic solution in Java... time: 8 minutes 55 seconds Lucene: Indexing String url = "jdbc:postgresql:stackoverflow"; Properties props = new Prope...Lucene: Indexing loop over SQL result while (rs.next()) { Document doc = new...Lucene: Querying • Parse a Lucene query define fields ...Sphinx Search Sphinx Search • Embedded full-text search engine • Started in 2001 • GPLv2 license • Good database integration Sphinx Search: How to use 1. Edit configuration file 2. Index the data 3. Query the index 4. Issues Sphinx Search: sphinx.conf source stackoverflowsrc { type = pgsql sql_host = localhost sql_user = p...Sphinx Search: sphinx.conf index stackoverflow { source = stackoverflowsrc path = /opt/local/var/db/...Sphinx Search: Building index indexer -c sphinx.conf stackoverflow collected 1242365 docs, 720.5 MB sorte...Sphinx Search: Querying index search -c sphinx.conf -i stackoverflow -b “sql & performance” tim...Sphinx Search: Issues • Index updates are as expensive as rebuilding the index from scratch ...Inverted Index Inverted index searchable words Posts Tags TagTypes ...Inverted index: Updated ER Diagram Inverted index: Data definition CREATE TABLE TagTypes ( TagId SERIAL PRIMARY KEY, Tag ...Inverted index: Indexing INSERT INTO Tags (PostId, TagId) SELECT p.PostId, t.TagId FROM Posts p JOIN...Inverted index: Querying SELECT p. FROM Posts p JOIN Tags t USING (PostId) JOIN TagTypes tt USING (TagId)...Search Engine Services Search engine services: Google Custom Search Engine • http://www.google.com/cse/ • DEMO ➪ http://www.karwin.com/de...Search engine services: Is it right for you? • Your site is public and allows external index • Search is a non...Comparison: Time to Build Index LIKE predicate none PostgreSQL / GIN 40 min Sphinx Search 6 min Apache Luc...Comparison: Index Storage LIKE predicate none PostgreSQL / GIN 532 MB Sphinx Search 533 MB Apache Luc...Comparison: Query Speed LIKE predicate 90+ sec PostgreSQL / GIN 20 ms Sphinx Search 8 ms Apache Lucene ...Comparison: Bottom-Line indexing storage query solution LIKE predicate none none ...Copyright 2009 Bill Karwin www.slideshare.net/billkarwin Released under a Creative Commons 3.0 Licen...

Upcoming SlideShare

Loading in …5

×

Like this presentation? Why not share!

  • Email
  •  
8 Comments 85 Likes Statistics Notes
  • Kaushal Singh , Application Developer & Designer at Flash Games 1 month ago
  • Sharren Slides , Sales at Best Lift Chair 1 month ago
  • Daaim Wasim Qureshi 1 month ago
  • dzubchik 2 months ago
  • Dada Vita 2 months ago
Show More

No Downloads

No notes for slide

  1. 1. Practical full-text search in PostgreSQL Bill Karwin PostgreSQL Conference West 09 • 2009/10/17
  2. 2. Me • 20+ years experience • Application/SDK developer • Support, Training, Proj Mgmt • C, Java, Perl, PHP • SQL maven • MySQL, PostgreSQL, InterBase • Zend Framework • Oracle, SQL Server, IBM DB2, SQLite • Community contributor
  3. 3. Full Text Search
  4. 4. Text search • Web applications demand speed • Let’s compare 5 solutions for text search
  5. 5. Sample data • StackOverflow.com Posts • Data dump exported September 2009 • 1.2 million tuples • 850 MB
  6. 6. StackOverflow ER diagram
  7. 7. Naive Searching Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems. — Jamie Zawinsky
  8. 8. Performance issue • LIKE with wildcards: time: 91 sec SELECT FROM Posts WHERE body LIKE ‘%postgresql%’ • POSIX regular expressions: SELECT FROM Posts WHERE body ‘postgresql’ time: 105 sec
  9. 9. Why so slow? CREATE TABLE telephone_book ( full_name VARCHAR(50) ); CREATE INDEX name_idx ON telephone_book (full_name); INSERT INTO telephone_book VALUES (‘Riddle, Thomas’), (‘Thomas, Dean’);
  10. 10. Why so slow? • Search for all with last name “Thomas” uses SELECT FROM telephone_book index WHERE full_name LIKE ‘Thomas%’ • Search for all with first name “Thomas” SELECT FROM telephone_book WHERE full_name LIKE ‘%Thomas’ doesn’t use index
  11. 11. Indexes don’t help searching for substrings
  12. 12. Accuracy issue • Irrelevant or false matching words ‘one’, ‘money’, ‘prone’, etc.: body LIKE ‘%one%’ • Regular expressions in PostgreSQL support escapes for word boundaries: body ‘yoney’
  13. 13. Solutions • Full-Text Indexing in the RDBMS • Sphinx Search • Apache Lucene • Inverted Index • Search Engine Service
  14. 14. PostgreSQL Text-Search
  15. 15. PostgreSQL Text-Search • Since PostgreSQL 8.3 • TSVECTOR to represent text data • TSQUERY to represent search predicates • Special indexes
  16. 16. PostgreSQL Text-Search: Basic Querying SELECT FROM Posts WHERE to_tsvector(title || ‘ ’ || body || ‘ ’ || tags) @@ to_tsquery(‘postgresql & performance’); text-search matching operator
  17. 17. PostgreSQL Text-Search: Basic Querying SELECT FROM Posts WHERE title || ‘ ’ || body || ‘ ’ || tags @@ ‘postgresql & performance’; time with no index: 8 min 2 sec
  18. 18. PostgreSQL Text-Search: Add TSVECTOR column ALTER TABLE Posts ADD COLUMN PostText TSVECTOR; UPDATE Posts SET PostText = to_tsvector(‘english’, title || ‘ ’ || body || ‘ ’ || tags);
  19. 19. Special index types • GIN (generalized inverted index) • GiST (generalized search tree)
  20. 20. PostgreSQL Text-Search: Indexing CREATE INDEX PostText_GIN ON Posts USING GIN(PostText); time: 39 min 36 sec
  21. 21. PostgreSQL Text-Search: Querying SELECT FROM Posts WHERE PostText @@ ‘postgresql & performance’; time with index: 20 milliseconds
  22. 22. PostgreSQL Text-Search: Keep TSVECTOR in sync CREATE TRIGGER TS_PostText BEFORE INSERT OR UPDATE ON Posts FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger( ostText, P ‘english’, title, body, tags);
  23. 23. Lucene
  24. 24. Lucene • Full-text indexing and search engine • Apache Project since 2001 • Apache License • Java implementation • Ports exist for C, Perl, Ruby, Python, PHP, etc.
  25. 25. Lucene: How to use 1. Add documents to index 2. Parse query 3. Execute query
  26. 26. Lucene: Creating an index • Programmatic solution in Java... time: 8 minutes 55 seconds
  27. 27. Lucene: Indexing String url = "jdbc:postgresql:stackoverflow"; Properties props = new Properties(); props.setProperty("user", "postgres"); run any SQL query Class.forName("org.postgresql.Driver"); Connection con = DriverManager.getConnection(url, props); Statement stmt = con.createStatement(); String sql = "SELECT PostId, Title, Body, Tags FROM Posts"; ResultSet rs = stmt.executeQuery(sql); open Lucene Date start = new Date(); index writer IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
  28. 28. Lucene: Indexing loop over SQL result while (rs.next()) { Document doc = new Document(); doc.add(new Field("PostId", rs.getString("PostId"), Field.Store.YES, Field.Index.NO)); doc.add(new Field("Title", rs.getString("Title"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Body", rs.getString("Body"), Field.Store.YES, Field.Index.ANALYZED)); doc.add(new Field("Tags", rs.getString("Tags"), Field.Store.YES, Field.Index.ANALYZED)); writer.addDocument(doc); each row is } a Document writer.optimize(); writer.close(); with four Fields finish and close index
  29. 29. Lucene: Querying • Parse a Lucene query define fields String[] fields = new String[3]; fields[0] = “title”; fields[1] = “body”; fields[2] = “tags”; Query q = new MultiFieldQueryParser(fields, new StandardAnalyzer()).parse(‘performance’); • Execute the query parse search query Searcher s = new IndexSearcher(indexName); Hits h = s.search(q); time: 80 milliseconds
  30. 30. Sphinx Search
  31. 31. Sphinx Search • Embedded full-text search engine • Started in 2001 • GPLv2 license • Good database integration
  32. 32. Sphinx Search: How to use 1. Edit configuration file 2. Index the data 3. Query the index 4. Issues
  33. 33. Sphinx Search: sphinx.conf source stackoverflowsrc { type = pgsql sql_host = localhost sql_user = postgres sql_pass = xxxx sql_db = stackoverflow sql_query = SELECT PostId, Title, Body, Tags FROM Posts sql_query_info = SELECT FROM Posts WHERE PostId=$id }
  34. 34. Sphinx Search: sphinx.conf index stackoverflow { source = stackoverflowsrc path = /opt/local/var/db/sphinx/stackoverflow }
  35. 35. Sphinx Search: Building index indexer -c sphinx.conf stackoverflow collected 1242365 docs, 720.5 MB sorted 88.3 Mhits, 100.0% done total 1242365 docs, 720452944 bytes total 357.647 sec, 2014423.75 bytes/sec, 3473.72 docs/sec time: 5 min 57 sec
  36. 36. Sphinx Search: Querying index search -c sphinx.conf -i stackoverflow -b “sql & performance” time: 8 milliseconds
  37. 37. Sphinx Search: Issues • Index updates are as expensive as rebuilding the index from scratch • Maintain “main” index plus “delta” index for recent changes • Merge indexes periodically • Not all data fits into this model
  38. 38. Inverted Index
  39. 39. Inverted index searchable words Posts Tags TagTypes intersection of words / Posts
  40. 40. Inverted index: Updated ER Diagram
  41. 41. Inverted index: Data definition CREATE TABLE TagTypes ( TagId SERIAL PRIMARY KEY, Tag VARCHAR(50) NOT NULL ); CREATE UNIQUE INDEX TagTypes_Tag_index ON TagTypes(Tag); CREATE TABLE Tags ( PostId INT NOT NULL, TagId INT NOT NULL, PRIMARY KEY (PostId, TagId), FOREIGN KEY (PostId) REFERENCES Posts (PostId), FOREIGN KEY (TagId) REFERENCES TagTypes (TagId) ); CREATE INDEX Tags_PostId_index ON Tags(PostId); CREATE INDEX Tags_TagId_index ON Tags(TagId);
  42. 42. Inverted index: Indexing INSERT INTO Tags (PostId, TagId) SELECT p.PostId, t.TagId FROM Posts p JOIN TagTypes t ON (p.Tags LIKE ‘%<’ || t.Tag || ‘>%’); 90 seconds per tag!!
  43. 43. Inverted index: Querying SELECT p. FROM Posts p JOIN Tags t USING (PostId) JOIN TagTypes tt USING (TagId) WHERE tt.Tag = ‘performance’; 40 milliseconds
  44. 44. Search Engine Services
  45. 45. Search engine services: Google Custom Search Engine • http://www.google.com/cse/ • DEMO ➪ http://www.karwin.com/demo/gcse-demo.html even big web sites use this solution
  46. 46. Search engine services: Is it right for you? • Your site is public and allows external index • Search is a non-critical feature for you • Search results are satisfactory • You need to offload search processing
  47. 47. Comparison: Time to Build Index LIKE predicate none PostgreSQL / GIN 40 min Sphinx Search 6 min Apache Lucene 9 min Inverted index high Google / Yahoo! offline
  48. 48. Comparison: Index Storage LIKE predicate none PostgreSQL / GIN 532 MB Sphinx Search 533 MB Apache Lucene 1071 MB Inverted index 101 MB Google / Yahoo! offline
  49. 49. Comparison: Query Speed LIKE predicate 90+ sec PostgreSQL / GIN 20 ms Sphinx Search 8 ms Apache Lucene 80 ms Inverted index 40 ms Google / Yahoo!
  50. 50. Comparison: Bottom-Line indexing storage query solution LIKE predicate none none 11,250x SQL PostgreSQL / GIN 7x 5.3x 2.5x RDBMS Sphinx Search 1x 5.3x 1x 3rd party Apache Lucene 1.5x 10x 10x 3rd party Inverted index high 1x 5x SQL Google / Yahoo! offline offline Service
  51. 51. Copyright 2009 Bill Karwin www.slideshare.net/billkarwin Released under a Creative Commons 3.0 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ You are free to share - to copy, distribute and transmit this work, under the following conditions: Attribution. Noncommercial. No Derivative Works. You must attribute this You may not use this work You may not alter, work to Bill Karwin. for commercial purposes. transform, or build upon this work.
Recommended
  • Up and Running with Sublime Text 2

    Up and Running with Sublime Text 2

  • Git Essential Training

    Git Essential Training

  • GitHub for Web Designers
  • Full Text Search Throwdown

    Karwin Software Solutions LLC

  • Hibernate Search 5: Adding Full-Text Query Super-Powers to Your JPA!

    JBUG London

  • PostgreSQL and Sphinx pgcon 2013

    Emanuel Calvo

  • Full Text search in Django with Postgres

    syerram

  • Practical Object Oriented Models In Sql

    Karwin Software Solutions LLC

  • Full text search

    Beena Emerson

  • Full Text Search - Busca Textual no PostgreSQL

    Juliano Atanazio

  • About
  • Blog
  • Terms
  • Privacy
  • Copyright

LinkedIn Corporation © 2016

×

Public clipboards featuring this slide

No public clipboards found for this slide

Select another clipboard

×

Looks like you’ve clipped this slide to already.

Create a clipboard

You just clipped your first slide!

Clipping is a handy way to collect important slides you want to go back to later. Now customize the name of a clipboard to store your clips.

Name

Description

Visibility

Others can see my Clipboard


668
Sitemap