forked from DSpace/DSpace
-
Notifications
You must be signed in to change notification settings - Fork 0
QREPO-193 create StructuredTextExtractionFilter, add it to filter.plu… #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
msutya
wants to merge
5
commits into
qulto-7.6.1
Choose a base branch
from
QREPO-193
base: qulto-7.6.1
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
5 commits
Select commit
Hold shift + click to select a range
2403da2
QREPO-193 create StructuredTextExtractionFilter, add it to filter.plu…
beb19f0
QREPO-193 create StructuredTextExtractionFilterTest, add test pdf, up…
bb223d3
QREPO-193 refactor based on review, add licence, use property for ver…
01ae185
QREPO-193 finish testing, add constructor to Page and Pages, refactor…
5e27a94
QREPO-193 add license to StructuredPdfTextExtractionFilterTest.java
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
104 changes: 104 additions & 0 deletions
104
dspace-api/src/main/java/org/dspace/app/mediafilter/StructuredPdfTextExtractionFilter.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,104 @@ | ||
| /** | ||
| * The contents of this file are subject to the license and copyright | ||
| * detailed in the LICENSE and NOTICE files at the root of the source | ||
| * tree and available online at | ||
| * | ||
| * http://www.dspace.org/license/ | ||
| */ | ||
| package org.dspace.app.mediafilter; | ||
|
|
||
| import com.fasterxml.jackson.databind.SerializationFeature; | ||
| import com.fasterxml.jackson.dataformat.xml.XmlMapper; | ||
| import java.io.ByteArrayInputStream; | ||
| import java.io.ByteArrayOutputStream; | ||
| import java.io.File; | ||
| import java.io.FileInputStream; | ||
| import java.io.FileWriter; | ||
| import java.io.IOException; | ||
| import java.io.InputStream; | ||
| import java.nio.charset.StandardCharsets; | ||
| import java.nio.file.Files; | ||
| import java.nio.file.Path; | ||
| import java.sql.SQLException; | ||
| import java.text.SimpleDateFormat; | ||
| import java.util.ArrayList; | ||
| import java.util.Date; | ||
| import java.util.List; | ||
| import org.apache.commons.lang.StringUtils; | ||
| import org.apache.logging.log4j.LogManager; | ||
| import org.apache.logging.log4j.Logger; | ||
| import org.apache.pdfbox.multipdf.Splitter; | ||
| import org.apache.pdfbox.pdmodel.PDDocument; | ||
| import org.apache.pdfbox.text.PDFTextStripper; | ||
| import org.apache.tika.exception.TikaException; | ||
| import org.apache.tika.metadata.Metadata; | ||
| import org.apache.tika.parser.AutoDetectParser; | ||
| import org.apache.tika.sax.BodyContentHandler; | ||
| import org.apache.tika.sax.ContentHandlerDecorator; | ||
| import org.dspace.app.mediafilter.model.Page; | ||
| import org.dspace.app.mediafilter.model.Pages; | ||
| import org.dspace.authorize.AuthorizeException; | ||
| import org.dspace.content.Bitstream; | ||
| import org.dspace.content.Item; | ||
| import org.dspace.content.service.BitstreamService; | ||
| import org.dspace.core.Context; | ||
| import org.dspace.services.ConfigurationService; | ||
| import org.dspace.services.factory.DSpaceServicesFactory; | ||
| import org.springframework.beans.factory.annotation.Autowired; | ||
| import org.xml.sax.SAXException; | ||
|
|
||
| public class StructuredPdfTextExtractionFilter extends MediaFilter { | ||
| private final Splitter splitter = new Splitter(); | ||
| private final XmlMapper xmlMapper = new XmlMapper(); | ||
| private final SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd HH_mm_ss_SSS"); | ||
|
|
||
| @Override | ||
| public String getFilteredName(String oldFileName) { | ||
| return oldFileName + ".xml"; | ||
| } | ||
|
|
||
| @Override | ||
| public String getBundleName() { | ||
| return "STRUCTURED_TEXT"; | ||
| } | ||
|
|
||
| @Override | ||
| public String getFormatString() { | ||
| return "XML"; | ||
| } | ||
|
|
||
| @Override | ||
| public String getDescription() { | ||
| return "Extracted Structured Text"; | ||
| } | ||
|
|
||
| @Override | ||
| public InputStream getDestinationStream(final Item item, final InputStream source, final boolean verbose) | ||
| throws Exception { | ||
|
|
||
| PDDocument document = PDDocument.load(source); | ||
| List<PDDocument> splitPages = splitter.split(document); | ||
|
|
||
| PDFTextStripper stripper = new PDFTextStripper(); | ||
| List<Page> pageTexts = new ArrayList<>(); | ||
|
|
||
| for (int i = 0; i < splitPages.size(); i++) { | ||
| Page page = new Page(i + 1, stripper.getText(splitPages.get(i))); | ||
| pageTexts.add(page); | ||
| } | ||
|
|
||
| Pages pages = new Pages(pageTexts); | ||
|
|
||
| xmlMapper.enable(SerializationFeature.INDENT_OUTPUT); | ||
| File tempFile = File.createTempFile("dspacetextextract" + dateFormat.format(new Date()), ".xml"); | ||
| xmlMapper.writeValue(tempFile, pages); | ||
|
|
||
| return Files.newInputStream(Path.of(tempFile.getAbsolutePath())); | ||
| } | ||
|
|
||
| @Override | ||
| public boolean preProcessBitstream(Context c, Item item, Bitstream source, boolean verbose) throws SQLException { | ||
| return "application/pdf".equals(source.getFormat(c).getMIMEType()); | ||
| } | ||
|
|
||
| } |
45 changes: 45 additions & 0 deletions
45
dspace-api/src/main/java/org/dspace/app/mediafilter/model/Page.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| /** | ||
| * The contents of this file are subject to the license and copyright | ||
| * detailed in the LICENSE and NOTICE files at the root of the source | ||
| * tree and available online at | ||
| * | ||
| * http://www.dspace.org/license/ | ||
| */ | ||
| package org.dspace.app.mediafilter.model; | ||
|
|
||
| import java.util.Objects; | ||
|
|
||
| public class Page { | ||
|
|
||
| private int pageNumber; | ||
| private String text; | ||
|
|
||
| public Page(){} | ||
| public Page(int pageNumber, String text) { | ||
| this.pageNumber = pageNumber; | ||
| this.text = text; | ||
| } | ||
|
|
||
| public int getPageNumber() { | ||
| return pageNumber; | ||
| } | ||
|
|
||
| public String getText() { | ||
| return text; | ||
| } | ||
|
|
||
| @Override | ||
| public boolean equals(final Object o) { | ||
| if (this == o) | ||
| return true; | ||
| if (o == null || getClass() != o.getClass()) | ||
| return false; | ||
| final Page page = (Page) o; | ||
| return pageNumber == page.pageNumber && Objects.equals(text, page.text); | ||
| } | ||
|
|
||
| @Override | ||
| public int hashCode() { | ||
| return Objects.hash(pageNumber, text); | ||
| } | ||
| } | ||
46 changes: 46 additions & 0 deletions
46
dspace-api/src/main/java/org/dspace/app/mediafilter/model/Pages.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,46 @@ | ||
| /** | ||
| * The contents of this file are subject to the license and copyright | ||
| * detailed in the LICENSE and NOTICE files at the root of the source | ||
| * tree and available online at | ||
| * | ||
| * http://www.dspace.org/license/ | ||
| */ | ||
| package org.dspace.app.mediafilter.model; | ||
|
|
||
| import com.fasterxml.jackson.dataformat.xml.annotation.JacksonXmlElementWrapper; | ||
| import com.fasterxml.jackson.dataformat.xml.annotation.JacksonXmlProperty; | ||
| import com.fasterxml.jackson.dataformat.xml.annotation.JacksonXmlRootElement; | ||
| import java.util.List; | ||
| import java.util.Objects; | ||
|
|
||
| @JacksonXmlRootElement(localName = "pages") | ||
| public class Pages { | ||
|
|
||
| @JacksonXmlProperty(localName = "page") | ||
| @JacksonXmlElementWrapper(useWrapping = false) | ||
| private List<Page> pageList; | ||
|
|
||
| public Pages(){} | ||
| public Pages(final List<Page> pageList) { | ||
| this.pageList = pageList; | ||
| } | ||
|
|
||
| public List<Page> getPageList() { | ||
| return pageList; | ||
| } | ||
|
|
||
| @Override | ||
| public boolean equals(final Object o) { | ||
| if (this == o) | ||
| return true; | ||
| if (o == null || getClass() != o.getClass()) | ||
| return false; | ||
| final Pages pages = (Pages) o; | ||
| return Objects.equals(pageList, pages.pageList); | ||
| } | ||
|
|
||
| @Override | ||
| public int hashCode() { | ||
| return Objects.hash(pageList); | ||
| } | ||
| } |
94 changes: 94 additions & 0 deletions
94
...e-api/src/test/java/org/dspace/app/mediafilter/StructuredPdfTextExtractionFilterTest.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,94 @@ | ||
| /** | ||
| * The contents of this file are subject to the license and copyright | ||
| * detailed in the LICENSE and NOTICE files at the root of the source | ||
| * tree and available online at | ||
| * | ||
| * http://www.dspace.org/license/ | ||
| */ | ||
| package org.dspace.app.mediafilter; | ||
|
|
||
| import static org.junit.Assert.*; | ||
| import static org.mockito.Mockito.*; | ||
|
|
||
| import com.fasterxml.jackson.dataformat.xml.XmlMapper; | ||
| import java.io.ByteArrayInputStream; | ||
| import java.io.InputStream; | ||
|
|
||
| import java.sql.SQLException; | ||
| import org.dspace.app.mediafilter.model.Page; | ||
| import org.dspace.app.mediafilter.model.Pages; | ||
| import org.dspace.content.Bitstream; | ||
| import org.dspace.content.BitstreamFormat; | ||
| import org.dspace.content.Item; | ||
| import org.dspace.core.Context; | ||
| import org.dspace.services.ConfigurationService; | ||
| import org.dspace.services.factory.DSpaceServicesFactory; | ||
| import org.junit.Test; | ||
|
|
||
| public class StructuredPdfTextExtractionFilterTest { | ||
|
|
||
| private static final StructuredPdfTextExtractionFilter filter = new StructuredPdfTextExtractionFilter(); | ||
| private static final XmlMapper xmlMapper = new XmlMapper(); | ||
|
|
||
| @Test | ||
| public void testGetFilteredName() { | ||
| assertEquals("multipage_test.pdf.xml", filter.getFilteredName("multipage_test.pdf")); | ||
| } | ||
|
|
||
| @Test | ||
| public void testGetBundleName() { | ||
| assertEquals("STRUCTURED_TEXT", filter.getBundleName()); | ||
| } | ||
|
|
||
| @Test | ||
| public void testGetFormatString() { | ||
| assertEquals("XML", filter.getFormatString()); | ||
| } | ||
|
|
||
| @Test | ||
| public void testGetDescription() { | ||
| assertEquals("Extracted Structured Text", filter.getDescription()); | ||
| } | ||
|
|
||
| @Test | ||
| public void testGetDestinationStream() throws Exception { | ||
| Item item = mock(Item.class); | ||
|
|
||
| InputStream resultStream = filter.getDestinationStream(item, getMultiPagePDF(), true); | ||
|
|
||
| assertNotNull(resultStream); | ||
|
|
||
| InputStream expectedInputStream = getExpectedXml(); | ||
| Pages expectedPages = xmlMapper.readValue(expectedInputStream, Pages.class); | ||
| Pages resultPages = xmlMapper.readValue(resultStream, Pages.class); | ||
|
|
||
| assertEquals(expectedPages, resultPages); | ||
|
|
||
| resultStream.close(); | ||
| } | ||
|
|
||
| @Test | ||
| public void testPreProcessBitstream() throws SQLException { | ||
| Context context = mock(Context.class); | ||
| Item item = mock(Item.class); | ||
|
|
||
| Bitstream source = mock(Bitstream.class); | ||
| BitstreamFormat bsFormat = mock(BitstreamFormat.class); | ||
| when(source.getFormat(context)).thenReturn(bsFormat); | ||
| when(bsFormat.getMIMEType()).thenReturn("application/pdf"); | ||
|
|
||
| assertTrue(filter.preProcessBitstream(context, item, source, true)); | ||
|
|
||
| when(bsFormat.getMIMEType()).thenReturn("image/png"); | ||
| assertFalse(filter.preProcessBitstream(context, item, source, true)); | ||
| } | ||
|
|
||
| private InputStream getMultiPagePDF() { | ||
| return getClass().getResourceAsStream("multipage_test.pdf"); | ||
| } | ||
|
|
||
| private InputStream getExpectedXml() { | ||
| return getClass().getResourceAsStream("multipage_expected_result.xml"); | ||
| } | ||
|
|
||
| } |
27 changes: 27 additions & 0 deletions
27
dspace-api/src/test/resources/org/dspace/app/mediafilter/multipage_expected_result.xml
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,27 @@ | ||
| <pages> | ||
| <page> | ||
| <pageNumber>1</pageNumber> | ||
| <text>A Text Extraction Test Document
 | ||
| for
 | ||
| DSpace
 | ||
| This is a text. For the next sixty seconds this software will conduct a test of the DSpace text
 | ||
| extraction facility.
 | ||
| This is only a text. This is a paragraph that followed the first that lived in the document that
 | ||
| Jack built.
 | ||
| Lorem ipsum dolor sit amet. The quick brown fox jumped over the lazy dog. Yow! Are we
 | ||
| having fun yet?
 | ||
| This has been a test of the DSpace text extraction system. In the event of actual content you
 | ||
| would care what is written here
 | ||
| </text> | ||
| </page> | ||
| <page> | ||
| <pageNumber>2</pageNumber> | ||
| <text>This is still a text.
 | ||
| This is only a text, but on a separate page. This is a paragraph that followed the first that
 | ||
| lived in the document that Jack built.
 | ||
| Lorem ipsum dolor sit amet. The quick brown fox jumped over the lazy dog.
 | ||
| This has been a test of the DSpace structured text extraction system. In the event of actual
 | ||
| content you would care what is written here
 | ||
| </text> | ||
| </page> | ||
| </pages> |
Binary file added
BIN
+39.4 KB
dspace-api/src/test/resources/org/dspace/app/mediafilter/multipage_test.pdf
Binary file not shown.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.