When a client needs to index PDF files for search, the best solution is to use Apache Solr with the Search API Attachments module. In this blog post, I will explain how to setup Solr on Pantheon and how to configure Solr and Search API Attachments.
What is Solr?
Apache Solr is a fast open-source Java search server.
Using Solr with Drupal 8 on Pantheon
Pantheon provides Apache Solr with most plans, including Sandbox, though it is not included in the Basic plan.
Pantheon offers complete instructions for enabling Solr with Drupal 8 on its platform. Here are some of the key steps.
To enable Apache Solr on Pantheon, go to Settings > Add Ons
Add the Search API Pantheon module as a required dependency.
composer require "drupal/search_api_pantheon ~1.0" --prefer-dist
Commit the modules to the server.
search_api_pantheon
search_api
Search_api_solr
Add the Search Server.
/admin/config/search/search-api/add-server
Create your search index.
/admin/config/search/search-api/add-index
Adding the Search API Attachments Module and Config
At this point you will have a working Solr search server in your Pantheon site. The next step is to add the Search API Attachments module and config.
Running Search API Attachments with Solr requires Tika. Tika is a Java library that can extract metadata from PDF documents and create a searchable index for Solr.
Install Search API Attachments
composer require drupal/search_api_attachments
Go to the Search API Attachments settings page at /admin/config/search/search_api_attachments and enter the following fields:
Extraction method: Tika Extractor
Path to java executable: java
Path to Tika .jar file: /srv/bin/tika-app-1.18.jar
Verify that your site is able to extract text from documents by clicking “Submit and test extraction.”
At this point we have Solr and Search API Attachments working with Tika on Pantheon.
Go to your recent index to create a processor and enable the file attachments.
Adding PDF Fields
I use Media Entity to add PDF fields. You can either add PDF fields to your Content Type as the Media Type “File”, or you can create them as the “File” Type in your Content Type for the PDF file.
Media Entity Field
File Type Field
Based on the fields you added in your search index, and based on the name of your fields, select your Search API Attachments in the General fields section.
Re-index with the new fields. You will then be able to search text inside the PDF attachments fields.
Displaying the Search Results in a View
Now that we have our Solr, indexing, and Search API Attachment settings working, it’s time to display the results. You’ll need to create a View with content from your Solr index.
Add Fields and Filter Criteria to display search results in a View page.
Your Full Text Search Filter criteria will allow fields to be searched by keywords.
Setting up Solr with Lando Locally
We now have everything working on Pantheon, but if you need to test locally, you can setup local config with Lando based on this example.
name: sitename recipe: drupal8 drupal: true config: webroot: . php: '7.2' xdebug: true drush: ^8 proxy: solr: - solr.example.lndo.site:8983 # Provides a nice lndo url for the solr web interface # at http://admin.solr.lndo.site - admin.solr.lndo.site:8983 services: database: type: mysql:5.6 solr: # Use solr version 5.5 type: solr:5.5 # Optionally allow access to the database at localhost:9999 # You will need to make sure port 9999 is open on your machine # # You can also set `portforward: true` to have Lando dynamically assign # a port. Unlike specifying an actual port setting this to true will give you # a different port every time you restart your app portforward: 9999 # Optionally declare the name of the solr core. # This setting is only applicable for versions 5.5 and above core: freedom config: conf: modules/contrib/search_api_solr/solr-conf/5.x appserver: extras: # Apache Tika - apt-get update -y - apt-get install -y openjdk-8-jre-headless - apt-get install -y openjdk-8-jdk - mkdir -p /app/srv/bin && cd /app/srv/bin - cd /app/srv/bin && curl -OL http://archive.apache.org/dist/tika/tika-app-1.16.jar - apt-get remove openjdk-8-jdk -y tooling: drush: service: appserver cmd: - "drush" - "/app/vendor/bin/drush --root=/app/ --uri=http://site-name.lndo.site"
Configure the Solr Server as you did with Pantheon.
Go to the Search API Attachments settings
/admin/config/search/search_api_attachments
Once you submit the test, you should get the same green option and be ready to work in your local environment with Solr and Search API Attachment.