OpenCms 6.0 documentation :: Search configuration: Documenttypes

OpenCms 6.0 interactive documentation:

Search configuration: Documenttypes

[Alkacon Documentation]

OpenCms search documentation

* Documenttypes

Documenttype configuration

A documenttype node specifies which document factory should be used to pull the contents of an OpenCms resource with a distinct resource type and/or mimetype into a Lucene index document. For any matching combination of the specified resource types and the specified mimetypes, the given document factory is used.

<documenttype>
	<name>...</name>
	<class>...</class>
	<mimetypes>
		<mimetype>...</mimetype>
		...
	</mimetypes>
	<resourcetypes>
		<resourcetype>...</resourcetype>
		...
	</resourcetypes>
</documenttype>

Configuration nodes

The following nodes are used to specify a documenttype:

the <name> node gives the documenttype a unique name
the <class> node specifies the package/class name of the document factory
either zero or more <mimetype> nodes specify a mimetype for resource contents handled with the given document factory. When indexing a resource, its mimetype is derived from the extension of the resource name.
one ore more <resourcetype> nodes specify an OpenCms resource type of resources handled with the given document factory

Example 1

This example shows how to configure a documenttype for PDF documents:

<documenttype>
	<name>pdf</name>
	<class>org.opencms.search.documents.CmsDocumentPdf</class>
	<mimetypes>
		<mimetype>application/pdf</mimetype>
	</mimetypes>
	<resourcetypes>
		<resourcetype>binary</resourcetype>
		<resourcetype>plain</resourcetype>
	</resourcetypes>
</documenttype>

Example 2

This example shows how to configure a documenttype for a COS module:

<documenttype>
	<name>news</name>
	<class>com.opencms.legacy.CmsCosDocument</class>
	<mimetypes/>
	<resourcetypes>				
		<resourcetype>com.alkacon.news.CmsNewsContent</resourcetype>
	</resourcetypes>					
</documenttype>

Available document classes

Currently, these document factories are part of the OpenCms search package:

org.opencms.search.documents.CmsDocumentGeneric
Extracts index data from a VFS resource. This factory extracts only the property data like title, description and keywords, not the content and is used as base class of the other document factories.
org.opencms.search.documents.CmsDocumentPlainText
Extracts index data from a document in plain text format.
org.opencms.search.documents.CmsDocumentRtf
Extracts index data from a document in Rich Text (RTF) file format.
org.opencms.search.documents.CmsDocumentPdf
Extracts index data from a document in Adobe Portable Document Format.
org.opencms.search.documents.CmsDocumentMsExcel
Extracts index data from a document in Microsoft Excel 97(-2002) file format (BIFF8).
org.opencms.search.documents.CmsDocumentMsPowerPoint
Extracts index data from a document in Microsoft Powerpoint file format.
org.opencms.search.documents.CmsDocumentMsWord
Extracts index data from a document in Microsoft Word 97 file format.
org.opencms.search.documents.CmsDocumentXmlPage
Extracts index data from a resource of type xmlpage.
All tags in the content are filtered away, so the xmlpage elements can contain both XML and HTML data.
org.opencms.search.documents.CmsDocumentXmlContent
Extracts index data from a resource of type xmlcontent.
com.opencms.legacy.CmsPageDocument
Extracts index data from a resource of type page (belonging to the former xml template mechanism).
com.opencms.legacy.CmsCosDocument
Extracts index data from any cos resource based on the OpenCms CmsMasterDataSet class.

Available resource types

Currently, OpenCms uses the following resource types:

binary (org.opencms.file.types.CmsResourceTypeBinary)
folder (org.opencms.file.types.CmsResourceTypeFolder)
image (org.opencms.file.types.CmsResourceTypeImage)
jsp (org.opencms.file.types.CmsResourceTypeJsp)
page (com.opencms.legacy.CmsResourceTypePage)
plain (org.opencms.file.types.CmsResourceTypePlain)
pointer (org.opencms.file.types.CmsResourceTypePointer)
xmlpage (org.opencms.file.types.CmsResourceTypeXmlPage)
xmlcontent (org.opencms.file.types.CmsResourceTypeXmlContent)