Sign up now! · Forgot password?
RSS/Atom feed Twitter

approches used for language detection on images ...

This discussion is connected to the gimp-user-list.gnome.org mailing list which is provided by the GIMP developers and not related to gimpusers.com.

2020-01-28 17:21:37 UTC (8 months ago)
postings
3
contact
Send private message

approches used for language detection on images ...

I worked on corpora research and text cleansing can be done relatively straightforwardly. The problem is with images, images containing texts, which language, ...
Could you point me in the right direction? (I am a mathematician, so Math is not a problem for me at all)
Thank you

JWein (via www.gimpusers.com/forums)
Liam R E Quin
2020-01-29 00:47:22 UTC (8 months ago)

approches used for language detection on images ...

On Tue, 2020-01-28 at 18:21 +0100, JWein wrote:

I worked on corpora research and text cleansing can be done relatively
straightforwardly. The problem is with images, images containing texts, which
language, ...
Could you point me in the right direction? (I am a mathematician, so Math is
not a problem for me at all)
Thank you

You need (1) feature extraction, finding the writing, (2) OCR of some sort, to turn pictures of letters into letters, and then (3) the linguistic analysis.

However, many images contain metadata in plain text (OK, XML or whatever) that may include language and location information.

I'm interested in the text cleansing, can you tell me more (off list maybe?)

Thank you!

slave liam

Liam Quin - web slave for https://www.fromoldbooks.org/
with fabulous vintage art and fascinating texts to read.
Click here to have the slave rewarded with extra work.
2020-01-29 12:52:02 UTC (8 months ago)
postings
3
contact
Send private message

approches used for language detection on images ...

You need (1) feature extraction, finding the writing, (2) OCR of some sort, to turn pictures of letters into letters, and then (3) the linguistic Analysis.

Hey Liam:

Thank you, and yes, I could guess the way to go would be through the steps you outline, but I am pretty sure some other gimp developers have trodden those paths before and may have some tips to share.

However, many images contain metadata in plain text (OK, XML or whatever) that may include language and location information.

Most of the kinds of texts I work on are image based pdf files which were scanned as images

I'm interested in the text cleansing, can you tell me more (off list maybe?)

"text cleansing" or "text normalization" (as they also call it, but which to most people is another phase of "cleansing", for example, making sure that the text is "normalized", e.g., in a java.text.Normalizer.Form way) means removing all the bsing visual distraction and the ephemeral comercial nonsense from pages.

https://www.google.com/search?q="text+cleansing"

For example, gutenberg.org, has taken the effort to textualize lots of books, but they include some nonsensical header and footer, use breaklines (something necessary in those times people used main frames which displays were 80 character wide, ...)

This kind of nonsense has become the new normal. I work as a teacher and I see it as abusive specially when done to students and people who are just trying to get something done. Companies internally block certain sites, types of content, pages and sections of pages, it is about time that people start doing it more aggressively on their own. Some other people tell you about "user agreements", "morallity" and about "capitalism going down if people start doing that more aggressively" ;-)

I do the same kinds of things you do but these times I am more interested in texts especially if they relate to education. Mine of my research efforts relates to a corpus of the Regents exams (going back to the 1860's). They contain plenty of intertextual pictures and zero comma nada annotations, frequent language switch in the texts . . .

JWein (via www.gimpusers.com/forums)
2020-01-29 12:59:21 UTC (8 months ago)
postings
3
contact
Send private message

approches used for language detection on images ...

ONE of my research efforts relates to a corpus of the Regents exams (going back to the 1860's).

https://en.wikipedia.org/wiki/Regents_Examinations

. . . frequent language switching mostly in the sentences of multilingual texts . . .

JWein (via www.gimpusers.com/forums)
Ofnuts
2020-01-29 13:19:41 UTC (8 months ago)

approches used for language detection on images ...

On 1/29/20 1:52 PM, JWein wrote:

You need (1) feature extraction, finding the writing, (2) OCR of some sort, to turn pictures of letters into letters, and then (3) the linguistic Analysis.

Hey Liam:

Thank you, and yes, I could guess the way to go would be through the steps you outline, but I am pretty sure some other gimp developers have trodden those paths before and may have some tips to share.

Gimp is not really about OCR.

You would also have to define the range of languages you are interested in. For instance you can't OCR Cyrillic without knowing it's cyrillic, because many glyphs are undistinguishable from similar latin glyphs but have a different Unicode point, and can be unrelated characters.

Liam R E Quin
2020-01-29 20:07:10 UTC (8 months ago)

approches used for language detection on images ...

On Wed, 2020-01-29 at 13:52 +0100, JWein wrote:

You need (1) feature extraction, finding the writing, (2) OCR of some
sort, to turn pictures of letters into letters, and then (3) the linguistic Analysis.

Hey Liam:

Thank you, and yes, I could guess the way to go would be through the steps you
outline, but I am pretty sure some other gimp developers have trodden those
paths before and may have some tips to share.

I doubt it.

There _are_ somepeople who use GIMP to clean up images preparatory to running OCR on them, or have been in the past, but there are much better programs for that.

I asked you about text cleansing (cleaning) because it has different meanings in different contexts; i'm *certainly* not interested in losing the page apparatus or hyphenation information, although in my own work i mark them so software can skip them whe wanted.

If you're doing an academic study of a book “manifestation” such things are important, but i had rather use the Text Encoding Initiative as a model than Michael Hart’s flailing Gutenberg project.

I do the same kinds of things you do

I doubt that, at least from your description, but some of it may be a language issue in reading the tone of your message. If you are doing natural language processing and semantic-Web-style text mining your needs for texts overlap with my personal projects but not so much with GIMP, which is a bitmap image editor. For example, detecting Greek words and phrases included in a 30,000 page OCR's text by analyzing the page images would interest me (and detecting italics for that matter); if i ever have a spare few days i plan to try the (then) latest Tesseract for that.

Liam Quin - web slave for https://www.fromoldbooks.org/