XPS Full Text Search in MOSS 2007 and Windows Server 2008

5. April 2010

Late in 2007 the XML Paper Specification (XPS) was published.  The means to create, view and print XPS files are integrated in you Windows OS, dare I say it is as ubiquitous as PDF and you may not even know it.  If you don’t have the XPS features installed you can get it free from Microsoft or one of the many 3rd party vendors deploying solutions for XPS.

Get an XPS Viewer  XPS Showcase

What is an XPS document?

The XML Paper Specification itself is platform independent, openly published, and available royalty-free and Microsoft has integrated XPS-based technologies into Microsoft Windows Vista operating system and the 2007 Microsoft Office system. Microsoft brings additional document value to its customers, partners, and the computing industry through the XPS-based technologies.

An XPS document is any file that is saved to the XML Paper Specification, or .xps, file format. You can create XPS documents (.xps files) by using any program that you can print from in Windows; however, you can view XPS documents only by using the XPS Viewer, which is included in this version of Windows.

An XML Paper Specification (XPS) document is a document format you can use to view, save, share, digitally sign, and protect your document’s content. An XPS document is like to an electronic sheet of paper: You can’t change the content on a piece of paper after you print it, and you can’t edit the contents of an XPS document after you save it in the XPS format. In this version of Windows, you can create an XPS document in any program you can print from, but you can only view, sign, and set permissions for XPS documents in the XPS Viewer.

XPS FTS in MOSS 2007

The XPS format is great for SharePoint.  Not only for view-ability but for it’s Full Text Search-ability (FTS).  The following is a step-by-step guide to enabling and configuring XPS IFilter support in Windows Server 2008 and MOSS 2007.  Note: I have a single server running all my MOSS farm services.  If I had a distributed farm with my MOSS Index service running on a dedicated server I would enable and configure the XPS feature on the dedicated Index server.

Step-By-Step

1. On the MOSS Index server launch the Server Manager and select Add Features, select the XPS Viewer and click Next.

image

2. Click Install.

image

3. Click Close.

image

4. From the SharePoint 3.0 Central Administration select the Share Service Provider/Search Settings and under Crawling select File types and then select New File Type.

image

5. Enter xps and click on OK.

image

6. You will see the new file type in the list.

image

7. You can also confirm it has been enabled by reviewing the registry setting for the following key.

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0 \Search\Applications\<SITE-GUID>\Gather\Portal_Content\Extensions\ExtensionList

image

8. Next you’ll need to enter the following details in the HKLM hive for the XPS ifilter.

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\Filters\ .xps]
        Default = (value not set)
        Extension = xps
        FileTypeBucket REG_DWORD = 0x00000001 (1)
        MimeTypes = application/xps

image

9.  In addition, you’ll need to add and set the Class ID for the XPS iFilter in the following HKLM hive location.

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\Filters\Extension\.xps]

             Set the "Default" value to the CLSID of XPS IFilter.

             Default REG _SZ = {1E4CEC13-76BD-4ce2-8372-711CB6F10FD1}

image

image

10.  Next, stop and start the Office SharePoint Server Search service.

C:\>net stop osearch
C:\>net start osearch

image

11. Next, run a full or incremental crawl. If you’re interested, keep an eye on the C:\Users\<search-service-account>\AppData\Local\Temp\gthrsvc folder and you’ll see MOSS crawl writing the images to this folder to index.  This of course is why you need a beefy server for Indexing, lot’s of file IO.

image

12.  Once the crawl is complete you can verify the XPS files have been crawled via the Crawl Log’s URL Summary.

image

Searching

When I search for the keyword “galleries” in MOSS Advanced Search I get hits from the result FTS.

image

 image

 


For More Information

XML Paper Specification: Overview

ECMA International XPS Specification and Reference Guide

ECMA International XPS White Paper

Microsoft MSDN XPS Team BLOG

iFilter, MOSS, Search, SharePoint, Crawl, Windows Server 2008, XPS , , , , , ,

Windows TIFF IFilter and SharePoint 2007

2. April 2010

The TIFF file format has been around for awhile, original created for scanner devices in the mid ‘80s, it’s seen many revisions but continues to be the de facto standard for scanners.  Prior to the release of the Windows TIFF IFilter I wouldn't think twice about scanning to PDF as the searchable target format for SharePoint, in many cases I would still recommend PDF.   It’s nice to have TIFF back as an option for clients that need full text search ability in SharePoint, with the Windows TIFF IFilter, a built-in (I would not say free) Windows Server 2008 and Windows 7 feature, you can do FTS with TIFF.

Windows TIFF IFilter Overview

Windows TIFF IFilter enables you to search for Tagged Image File Format (TIFF) documents based on text content. Windows TIFF IFilter supports all TIFF documents that are complaint with Adobe TIFF Revision 6.0 specifications, and includes the most frequent compressions, such as LZW, JPG, CCITT v4, CCITT v6, uncompressed, and so forth.

When loaded, Windows TIFF IFilter performs Optical Character Recognition (OCR) processing of TIFF images, and then provides the recognized text to the caller for building the search index.

Windows TIFF IFilter can be used by Indexing Service (for Desktop Search), Microsoft Office SharePoint Server 2007 or later, Microsoft SQL Server 2008, and Microsoft SQL Server 2005.

Search result considerations
Windows TIFF IFilter focuses on text-based documents, which means that searching will be more successful for documents that contain clearly identifiable text (for example, black text on a white background), and less successful for documents that contain mixed content (for example, artistic text or text inside of pictures). Additionally, low-quality images and mixed languages can negatively impact OCR processing, and consequently, lower the quality of the search results.

Source: Microsoft TechNet

Before – Advanced Search

Before the Windows TIFF IFilter in installed and configured you’ll not get any hits on the document via a full text search.

image

image

Step-By-Step

Note: All my services are installed on a single Windows 2008 R2 Standard server.  If you have a distributed MOSS farm, you’ll need to install and configure the Windows TIFF IFilter on the Index server of your farm.

1. From the Server Manager select the Features node and select Add Feature.

image

2. Select the Windows TIFF IFilter and select Next. (Click Next to the Install Window)

image

3. Select Install.

image

4. Select Close.

image

5. Now you’ll need to add a new File Service Role.

image

6. Select the File Service role and select Next.

image

7. Select the Windows Server 2003 File Services.  Indexing Service will automatically select.  Click Next.

image

8. Select Install.

image

9. Select Close.

image

10.  Now you’ll need to install and start the Indexing Service.  In the Run command type MMC.EXE and click OK.

image

11. From the File menu select Add/Remove Snap-in…

image

12. Select the Indexing Service in the Available snap-ins and select Add.

image

13.  You’ll be prompted for select a Computer.  Select the default, Local Computer and select Finish.

image

14.  Select OK.

image

15. Close the Console (you don’t need to save).

16. Verify the service is installed and running.

image

17. Now run a Full Crawl.  You may have to do this a few times.

 

image

After – Advanced Search

After the Full or Incremental Crawl has completed you can now perform an Advanced MOSS Search and you find the document.

image

image

You can also verify the TIFF document has been crawled via the Crawl Log.

image

The Windows TIFF IFilter Settings

If you need to change the setting on the OCR for the IFilter you can do so via the Local Group Policy Editor for the Language and page OCR.  For the page OCR you can change it to OCR every page, but this will impact server performance so use with caution.

image

 image

 


 

Windows TIFF IFilter Installation and Operations Guide

How to install and configure the Indexing Service on a Windows Server 2008-based computer

Crawl, Search, TIFF , ,

Maximum Crawl Size

27. June 2009

By default the maximum crawl size a MOSS Search Server is 16 MB. In many situations you'll have documents that exceed the 16 MB default threshold.

In the example below I've created and uploaded a 20+ MB searchable PDF of a series of Leo Tolstoy's famous novels all acquired thought the Gutenberg Online Project.

When I use the MOSS Advanced Search looking for the word Gutenberg in the contents of the file (assuming the PDF was Full Test Indexed with the PDF iFilter from Foxit Software I have installed on my Index Server), I'm not able to find my file.

A quick look at the Crawl Log alerts me to a warning for my PDF that "The file reached the maximum download limit. Check that the full text of the document can be meaningfully crawled." And therefore I'll not be able to search within the contents of the file via a Full Text Search.

To allow my PDF that is larger than 16 MB to be crawled and fully indexed I need to add the MaxDownloadSize dword value to the "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager" hive location and set it to the decimal data value of 25 for MB.

I then need to reboot my server (or dedicated index Server).

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Global\Gathering Manager]

"MaxDownloadSize"=dword:00000019

 

Note: Hexadecimal 19 = Decimal 25.

After a new Full crawl I reissue the search for "Gutenberg" and I now find my PDF and my Crawl Log does not have the warning message.

Crawl, MOSS, Search , ,