Documenttype configuration
A documenttype node specifies which document factory should be used to pull the contents of an OpenCms resource
with a distinct resource type and/or mimetype into a Lucene index document. For any
matching combination of the specified resource types and
the specified mimetypes, the given document factory is used.
<documenttype>
<name>...</name>
<class>...</class>
<mimetypes>
<mimetype>...</mimetype>
...
</mimetypes>
<resourcetypes>
<resourcetype>...</resourcetype>
...
</resourcetypes>
</documenttype>
Configuration nodes
The following nodes are used to specify a documenttype:
- the <name> node gives the documenttype a unique
name
- the <class> node specifies the package/class
name of the document factory
- either zero or more <mimetype> nodes specify a
mimetype for resource contents handled with the given document factory. When
indexing a resource, its mimetype is derived from the extension of the
resource name.
- one ore more <resourcetype> nodes
specify an OpenCms resource type of resources handled with the given
document factory
Example 1
This example shows how to configure a documenttype for PDF documents:
<documenttype>
<name>pdf</name>
<class>org.opencms.search.documents.CmsDocumentPdf</class>
<mimetypes>
<mimetype>application/pdf</mimetype>
</mimetypes>
<resourcetypes>
<resourcetype>binary</resourcetype>
<resourcetype>plain</resourcetype>
</resourcetypes>
</documenttype>
Example 2
This example shows how to configure a documenttype for a COS module:
<documenttype>
<name>news</name>
<class>com.opencms.legacy.CmsCosDocument</class>
<mimetypes/>
<resourcetypes>
<resourcetype>com.alkacon.news.CmsNewsContent</resourcetype>
</resourcetypes>
</documenttype>
Available document classes
Currently, these document factories are part of the OpenCms search
package:
- org.opencms.search.documents.CmsDocumentGeneric
Extracts index data from a VFS resource. This factory
extracts only the property data like title, description and keywords, not the
content and is used as base class of the other document factories. - org.opencms.search.documents.CmsDocumentPlainText
Extracts index data from a document in plain text format.
- org.opencms.search.documents.CmsDocumentRtf
Extracts index data from a document in Rich Text
(RTF) file format.
- org.opencms.search.documents.CmsDocumentPdf
Extracts index data from a document in Adobe Portable
Document Format.
- org.opencms.search.documents.CmsDocumentMsExcel
Extracts index data from a document in Microsoft Excel
97(-2002) file format (BIFF8). - org.opencms.search.documents.CmsDocumentMsPowerPoint
Extracts index data from a document in Microsoft
Powerpoint file format. - org.opencms.search.documents.CmsDocumentMsWord
Extracts index data from a document in Microsoft Word 97
file format. - org.opencms.search.documents.CmsDocumentXmlPage
Extracts index data
from a resource of type xmlpage. All tags in the content are filtered away, so the xmlpage
elements can contain both XML and HTML data. - org.opencms.search.documents.CmsDocumentXmlContent
Extracts index data from a resource of type xmlcontent. - com.opencms.legacy.CmsPageDocument
Extracts index data from a resource of type page
(belonging to the former xml template mechanism). - com.opencms.legacy.CmsCosDocument
Extracts index data from any cos
resource based on the OpenCms CmsMasterDataSet class.
Available resource types
Currently, OpenCms uses the following resource types:
- binary (org.opencms.file.types.CmsResourceTypeBinary)
- folder (org.opencms.file.types.CmsResourceTypeFolder)
- image (org.opencms.file.types.CmsResourceTypeImage)
- jsp (org.opencms.file.types.CmsResourceTypeJsp)
- page (com.opencms.legacy.CmsResourceTypePage)
- plain (org.opencms.file.types.CmsResourceTypePlain)
- pointer
(org.opencms.file.types.CmsResourceTypePointer)
- xmlpage
(org.opencms.file.types.CmsResourceTypeXmlPage)
- xmlcontent
(org.opencms.file.types.CmsResourceTypeXmlContent)
|