Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to derive the actual number of words per line for each chapter? #36

Open
AlGantori opened this issue May 2, 2020 · 11 comments
Open

Comments

@AlGantori
Copy link

If I understand the main page description these "scripts" render from a font an image of the page then builds the rectangle bounds for each words (glyphs) generated (Correct?)

Does it also build one line bitmap at a time for a 15 lines/rows per page madina mushaf?

It sounds overly complex if all I want is the word count per line for each chapter.
Something like the following (showing word count for Fatiha, Baqara,)

   "1": [4,5,4,4,4,5,3],
   "2": [7,5,4,8,6,6,
        9,9,7,9,8,8,9, 8,9,10,9,10,8,7, 7, ..... ],
@ahmedre
Copy link
Contributor

ahmedre commented May 2, 2020

yes, your understanding is correct. and yes, it builds one line at a time.
for what you want to do, i'd download and get the database from this repo and then get this data with a query (or with a script that just does this for each page). if i recall, the table here should contain the line information as well.

@AlGantori
Copy link
Author

Are you sure this kind of data is not already available in some XML/JSON resource?

I have done indirectly some node.js based development but I don't recognize the commands installation notes like the following:

ppm install dmake
ppm install dbd-mysql
ppm install yaml

are these expected to be executed inside some CLI? or some linux distro?
Thanks for helping out because at this point I am clueless.
I am running in Windows7

@ahmedre
Copy link
Contributor

ahmedre commented May 2, 2020

you don't need to do any of those commands nor run this script itself - just download the database and import it and write a script yourself.

@AlGantori
Copy link
Author

AlGantori commented May 2, 2020

By database you mean download the sql folder in this repo.

I have MySQL Workbench, it's a beast I never got acquainted with all of its terms Open Model, ???
It seems oriented to open dbs over some network connection, I am having hard time making it open a local file. It managed to open schema.mwb and throws me into the err diagram mode, I want to see the tables and data.

Which of these files should I be attempting to open?

image

Would you suggest a better tool than MySQL Workbench 5.2.44 CE?

By me writing scrips you mean write SQL queries to retrieve info, perhaps from glyph_line_page table?

Thank you for holding my hands thru this.

@AlGantori
Copy link
Author

AlGantori commented May 2, 2020

  • So after upgrading MySQL Workbench to 8.0.20 it quit working and has error, these companies Oracle, Microsoft just ruin any piece of software they grab onto and make it a monster. At the end, I just ditched this garbage/ziballa software installing so much junk on my system.

  • I have not worked with SQL from like 1987 Sybase SQL prior to MS SQL 👎

  • I guess you have to have a local server installed, HeidiSQL insisted on a connection.

  • So I installed XAMPP as the server.

  • Was able to open the 02-database.sql

  • Is the glyph_ayah the table I should be deriving word count from?

  • Since I am purely guessing can you confirm the row below refers to the first verse of Fatiha in this case Basmallh and the row is actually the verse number which in my case I won't be counting as a word.

image

Will tajweed markings (eg. small-meen etc..) be appearing as separate rows in this table or lumped with the previous word (as a single glyph)?

I feel like this is terrible, I would have to query and group count on ayah_number and minus one for the aya_number (hindi thingy) to get my word count???

I have a feeling I am going about this the wrong/difficult way

@AlGantori
Copy link
Author

AlGantori commented May 2, 2020

  • Oh that table does not give the line_number, glyph_page_line is what I should be looking into, for this line wrapping (word/token count) I am after.
  • This particular 1st line of chapter2 (but numbered as the 3rd line because counting sura title + basmallah as page lines/rows)
  • We have 7 words + 2 tajweed markers + 1 verse-number = 10 tokens

image

This would be its data, matching the 7 words + 2 tajweed markers + 1 verse-number = 10 tokens

image

How can I derive/detect that glyph_id = 264 is a verse number, I do not want to count???

@AlGantori
Copy link
Author

AlGantori commented May 2, 2020

Specifically for Page#2 this database is about this particular layout

image

Matching query

SELECT COUNT(line_number) FROM `glyph_page_line` WHERE page_number = 2 GROUP BY line_number;

The raw/net count of tokens per line follows:

image

I happen to be working with the Tajweed version page2 is a bit different, that's ALRIGHT I will handle that.

Again my current road block is detecting a token is a verse number???

@ahmedre
Copy link
Contributor

ahmedre commented May 3, 2020

the glyph table will tell you what "type" the glyph is - so you can exclude the ayah markers that way.

@AlGantori
Copy link
Author

AlGantori commented May 3, 2020

Wow I can't believe I am doing a 3 way join to get this, it appears that all verse-numbers are typed as "end"

image

image

Mission almost accomplished !!! ALLAHU AKBAR !!!

@ahmedre
Copy link
Contributor

ahmedre commented May 3, 2020

awesome al7amdulillah! make sure to not include other things like pauses (so just include words).

@AlGantori
Copy link
Author

AlGantori commented May 3, 2020

  • In my current hacking I am including the Tajweed-marker (here a pause) and verse-numbers as well in the rendering. It's just that my algorithm is based on word count and Uthmani script (word/token sequences) which has the Tajweed-markers inline in the case the two "pause".

  • I probably should use a similar approach in my JSON and expand it perhaps and tag the Tajweed tokens with their type and perhaps even sub-type.

  • I don't see verse-number as belonging to a hard-coded line position but rather a floating one. I would not be calling verse-number "end" as I want to flow them either as aaya-prefix (default) or suffix in future renderings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants