Skip to content

This repository hosts proposal for GSOC 2020 for Boost.org

Notifications You must be signed in to change notification settings

gopi487krishna/proposal-gsoc-2020

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

boost.astronomy.FITS_Module

Name : Gopi Krishna Menon

College : Dronacharya College of Engineering, Gurgaon, Haryana

Course : Computer Science Engineering

Degree Program : B.TECH ( Bachelor of Technology )

Email : [email protected]

Availability :

How much time do you plan to spend on your GSOC?

I intend to spend about 36 Hours per week ( 6 Hours every day * 6 Days ) during the 3 Month Period. On weekends if required, I can spend more than 6 hours. Mostly i work during the evening hours from 6:00pm to 11:00pm but during holidays i can work early in the morning also from 3:30 am to 6:00 am .

What are your intended start and end dates?

I can start working on the project after 10th-15th May 2020 as after 15th May my exams will be over and my 3 month summer vacation will start by that time, so I will be able to fully concentrate on the project. Also I will require about 1 week's time ( During Community Bonding period or after ) to get to the drawing board with my mentor and design the higher level API from scratch in order to build API of high quality with ample amount of flexibility in it. I intend to finish the project work on or before August 10 2020.

What other factors affect your availability ?

During the month of July I would request 2 days leave as i would have to go for an exhibition along with my team in Bangalore for MC Afee Scholars Programme. Other than that I am completely free during the summers and don't have any other commitments as such.

Background Information

I am a 3rd Year undergraduate in Computer Science and Engineering from Dronacharya College of Engineering, Gurgaon Haryana. Some of the courses that I have pursued during my first five semesters are :

  • Object Oriented Programming in C++
  • Principles of Operating System
  • Digital System Design
  • Digital Electronics
  • Theory of Computation
  • Core Java
  • Discrete Mathematics
  • Computer Architecture and Organization
  • Data Structures and Algorithms in C++
  • Microprocessors and Interfacing
  • Computer Graphics

Academic Performance

  • 3rd Semester :

    • Total Marks Secured : 926/1150
    • University Rank         : 16
    • College Rank             : 2
  • 4th Semester :

    • Total Marks Secured : 936/1150
    • University Rank         : 12
    • College Rank             : 1
  • 5th Semester :

    • Total Marks Secured : 986/1150
    • University Rank         : 2 ( Merit List not yet declared but based on Result Gazette)
    • College Rank             : 2

I also secured 100/100 Marks in Computer Science in Class 12 CBSE Board Exams. I completed my high school education from St.Thomas Sr.Sec School , Bahadurgarh, Haryana.

During my first and second semester I had scored around 870 and 836 respectively and did not secure any Rank from the university or college.

Internship/Jobs/Courses Audited

  • I organized and mentored Advanced C++ Programming Course in my college for 6 weeks during my first year with students from 2nd, 4th and 6th semester
  • I audited Introduction to C++ course from National Programme on Technology Enhanced Learning ( NPTEL ) and was in the 5% toppers category
  • Last summer I also audited a course in Ethical Hacking from Internshala

My Articles

Structured Binding in C++ : https://www.geeksforgeeks.org/structured-binding-c/

Attributes in C++ : https://www.geeksforgeeks.org/attributes-in-c/

std::Any class in C++ : https://www.geeksforgeeks.org/stdany-class-in-c/

Programming Interests

I have a considerable amount of knowledge in C++ and am quite comfortable working in C++ as C++ has been the daily driver for all my needs ( From last 2 years I have been working more on System Programming Projects than in other fields ). Also along with C++ I have good proficiency in languages such as C#, Visual Basic, RUST, Java.

Reason for Contributing to Boost Libraries

C++ is the best language that I have worked on till now and out of curiosity, I wanted to know how the libraries and new features were introduced in the C++ language. This is when I was introduced to boost libraries for C++. I learned that Boost wrote code that was exceptional in quality and performance and has become the gold class industry standard for libraries in the C++ language. Along with that I also got to know that most of the features in C++ came from the boost libraries that had been written several years ago (2001,2002,2003). From that onwards, I became a big fan of Boost and always wanted to contribute to boost and someday have my own libraries in the boost and C++ standard itself. Also, I saw the real power of Open Source from boost. I believe that making a project open source and freeware not only makes it free for the people but also improves the quality of the code drastically. This the main reason why I want to contribute to "boost" libraries.

Have you done any previous work in this area before or on similar projects?

Yes, I have been working on a similar project from quite sometime ( Conceptually different but technically similar ). The project's name is GML (Generalized Markup Language). It is basically a pseudo programming language with an open model that allows you to code new features in the language using C++ . It provides better readability and support for both structuring the data and for writing programs in it. It provides about 80% performance boost from most XML parsers ( Rapid XML from boost is still 20% faster than us but lacks a lot of features ) with a ton of additional features and has a 80% less file size footprint. A XML file of 100KB would roughly be around 20KB in size in GML. This project recently got selected with 36 other projects in the MC Afee Scholars Programme from Foundation of Advancement in Education in Research ( FAER) among a group of 400 participants. Although conceptually the projects are not similar, a lot of technical details related to parsing and storage are similar.

What are your plans beyond this Summer of Code time frame for your proposed work?.

Even after the GSOC period, I will continue to contribute to the boost. Astronomy until the library gets fully accepted into boost and becomes the gold standard FITS library for C++.

Please rate, from 0 to 5 (0 being no experience, 5 being expert), your knowledge of the following languages, technologies, or tools:

  • C++ 98/03 (traditional C++) : 4
  • C++ 11/14 (modern C++)     : 4
  • C++ Standard Library            : 4
  • Boost C++ Libraries              : 3
  • Git                                          : 3

What software development environments are you most familiar with (Visual Studio, Eclipse, KDevelop, etc.)?

Visual Studio with Microsoft Windows is the daily driver for all of my needs and this is the one that i am most familiar with. Also I am proficient in Vs Code or Sublime if I have to work on the linux platform.

What software documentation tool are you most familiar with (Doxygen, DocBook, Quickbook, etc.)?

I am familiar with Doxygen and have been using it for my GML Project with Morsa MCSS framework for automatic documentation generation of my project.

Project Proposal

The external API for FITS module provides a powerful yet easy to use API for reading and writing the FITS data.

It consists of a collection of functions that facilitates the user in communicating with the internal classes of a FITS module. In other words, it binds each of the parser_classes and provides a wrapper over them through which we can communicate with the different parsers inside the fits module.

How would the interface be Used ?

To read a FITS file we just use the read function with filename and file mode as argument ( Kindly refer to competency test solution for a basic implementation )

fits.read(filename, filemode);

Basically we can support three types of file modes (This will require some changes in the codebase of parsers) namely

  1. Stream Mode ( Medium Files )
  2. Buffered Mode ( Small Files )
  3. Memory Mapped Mode ( Large Files)

The code given below is pseudocode but a similar working implementation for a small part of FITS module is given as the solution for the competency test . Kindly refer to the competency test solution for the same. Actual implementation may vary by a little, Also enforcing this kind of interface requires a some changes in the main codebase.

In order to access the primary HDU the user will have to simply write

auto primary_hdu= get<primary_hdu>(fits[0]); 
// In case the user dosent know the indexes of hdu's
auto primary_hdu= get<primary_hdu>(fits["primary_hdu"]);

After obtaining the instance the user can do anything such as

auto bscale_val= get<double>(primary_hdu["BSCALE"]);
if(bscale_val.has_value()){
std::cout<<"BSCALE : " << *bscale_val; 
} 

OR

auto primary_dataarr=get<something>(primary_hdu.data());

Making changes to a particular HDU is also pretty easy. Just make changes in the instance itself i.e

auto& primary_hdu=get<primary_hdu>(fits[0]);
auto& ascii_table=get<ascii_table>(fits[1]);

primary_hdu.insert("BSCALE",2.0,"Set the Scaling Factor");
// Other Extentions can also be modified at the same time
ascii_table.insert("SOMETHING", SOMEVALUE,"Comment");
ascii_table.setData(new_data_array); // New data array is  basically serialized version of data to string.

// After all the changes have been made
fits.writeChangesTo(filename);// How this works is explained below

Implementation details ( Only Summarized)

read() : The read function takes a filename and file mode as argument and internally calls each of these classes ( ascii_table, binary_table, primary_hdu,image_extention ) to parse the fits file.

Now introducing the file mode creates a host of problems in itself but is not disposable either. ( The performance of memory_mapped file is way more than that of normal fstreams Kindly consult the competency test solution for a benchmark result ). So in order to support multiple file modes we would have to provide a small wrapper interface that encapsulates the (fstream, memory_map, buffer ) interface and provides same type of calls for reading data or setting the pointers in both the interfaces ( This is actually quite easy because of the design of all three interfaces ).

Internally the read's functions something like this ( This pseudocode is to only demonstrate the core algorithm the actual implementations will differ )

read(){

// Create an instance of primary_header ( prime_header)
// If parsing of primary header becomes successful push the prime_header object into a vector of HDUs and check for the value of
// extend keyword. Else return false
hdu_collection.push_back(prime_header);
auto extention_present= get<bool>(prime_header["EXTEND"]);
if (extention_present){
while( we_reach_end_of_file)/*2880 byte boundary should be respected */
{
// Try reading the first card on a new 2280 byte boundary ( Extentions always start on a boundary of 2880 bytes ) .

if(keyword is XTENSION and value in list_of_recognized_extentions){

// Create instance of that perticular type class with file offset provided to it ( Should begin right from XTENTION so file
// pointer must be moved 80 bytes back )

// if parsing of known extention header becomes successul push the extention header into the hdu_collection else return false
hdu_collection.push_back(some_known_extention_header);
// File pointers have automatically been updated 
}
else{

// Create an instance of unknown_extention class ( This class is similar to others but the only difference is that it can store
//only header information and tell us how many bits to skip.
// If parsing header was successful get the number of bytes to skip to next extention
auto no_of_bytes_to_skip= unknown_xtention.no_of_bytes_to_skip();
// no_of_bytes_to_skip() is computed  as (ABS(BITPIX)*GCOUNT*(PCOUNT+NAXIS1 X NAXIS2 X NAXIS3 X....X NAXISm ))/ 8 ( 8bit byte)
//Push unknown xtention to HDU collection.
// Based on number of bytes to skip store the data array as a string buffer only
hdu_collection.push_back(unknown_xtention);
// set filepos to no_of_bytes_to_skip
// continue to next iteration
}}}}

operator [] (int index) or operator[] (std::string hdu_name)

This function returns a reference to the particular HDU class object that the user wants to access or make changes to.

Usage:

auto& prime_header= get<prime_header>(fits[0]); // Internally { return hdu_collection[index]; )
// Now user can do anything with this instance
// OR

auto ascii_table= get<ascii_table>(fits["ascii_table"]);// Internally { Traverse over the entire vector and get the names of
//each HDU stored and simultaneously check if the header is the one which we want ( The cost can be reduced by using find_if
//with a vistior pattern and a lamda to get HDU name  }

writeToFile(filename)

writeToFile() function basically writes the updated data to a new file ( Data can be written back to original file but that can cause some problems ). There are two ways to basically implement this function in fits module

  • If the cards are implemented as a chunk of 80 bytes strings stored internally ( Kindly refer to Improvements with the Existing Code Base section below to find out why it is bad ) then writing becomes pretty easy. Iterate over the hdu_collection and for each HDU serialize the header and data as string and write into the file( No problems with unknown extentions as unknown_extention class objects already store a serialized version of data for unknown extensions).
  • If the cards are implemented as keyword and value based then the changes of each hdu are recorded in their respective scheduled_vector ( stores offset with card that is prepared as a chunk of 80 bytes or serialized string for data ) . Now the original file is opened again and data is written byte by byte. If a particular offset from any of the HDU scheduled vector occurs then that data is written instead of copying down the selected bytes from file. ( This approach is a bit slower as compared to the previous one ).

Improvements with the Existing Code Base

After reading out the code base for boost::astronomy::io i found a few things that could be improved in the library

Use of boost::spirit::qi instead of lexical of boost::lexical cast for conversion of value_string to its respective type

boost::lexical cast is a part of boost conversion framework but it is too slow for our needs as compared to boost::spirit::qi::parse.

This is because boost::lexical cast internally uses streams for handling custom types ( Its specification for custom types does require that and streams from their nature have a significant amount of overhead.) , Also boost::lexical cast provides a lot features that are not of our use but carry significant overhead for even simple types such as int.

Also we are always trying to convert string to some type which literally means we are trying to parse a values. For this boost provides a special purpose library for writing parsers called boost::spirit::qi and one of the functions in this library is called parse which takes linear time ( 0 overhead ) for processing numeric values. Being barebones in nature this is the fastest in town with considerable performance difference over lexical cast.

Strict Conformance to the FITS standard

Currently the codebase for fits module does not confirm to the fits standard in many places such as testing whether the keywords are in correct format (8 byte chunks are copied but are not checked whether they confirm to fits standard - This even introduces a bug which has been described later ) or checking whether the required keywords are present in order. Although it does not affect really for most of the applications, It can significantly create a problem if someone is trying to write a fits validator application. There is no way to confirm that the file follows the fits standards. This can be largely improved by introducing a change in design i.e using parsing policy instead of baking the rules right into the code.

Use of Parsing Policy instead of hard coding the rules

Currently the parsing rules are baked into the fits module . The fits standard is a pretty old standard and with time many changes/ features may be introduced requiring the parser to support those as well. One way is to update the code base for newer changes but can there be a better solution? A good solution that i feel would be to use parsing policy that can either be written by the user or provided from boost itself. In simple terms a parsing policy defines the rules on how the file needs to be parsed . Its like an interface but only thing is the interface is enforced at compile time rather than runtime with 0 performance overhead. This will not only support boost but will also support companies that want to make their custom proprietary format based on the original FITS standard. We will provide a default one that confirms strictly to the standard and others can make one if necessary. This would provide an immense amount of flexiblity and that would ultimately increase the user base for this library (as custom parsing rules ( like parsing value using some other library instead of boost because of better performance etc.) are supported without hurting performance). This would provide a small help in making boost the gold standard of C++ libraries.

To better understand this design pattern and its advantages kindly refer to the competency test solution.

Parts of design that can be changed should be allowed to change

Parsing values when they are read and then stored instead of storing 80 bytes chunks as cards

Currently the fits module stores its cards as a string of 80 bytes in a vector ( without any checks ). There are some problems associated with it in terms of bug, time and space required ).

Bug :

Everytime 80 bytes are read and pushed into the vector without any checking. This introduces a bug i.e if the card is blank i.e It just contains comment and or nothing else, that 80 bytes is also pushed into the vector. Calling key() function would just return 8 blanks . In in worst case key() can also return some of the comment part if comment starts before 8th position which can be misinterpreted as keyword ( Due to lack of checks) and in turn provide false info. Also it wastes additional 80 bytes for no reason. Blanks must be handled separately and accordingly should not be stored at all.

Space Problems :

Every card requires 80 bytes but is it really necessary? What i am trying to say is instead of storing 80 bytes chunks we can parse the keyword and value separately and then store them in an unordered map.

Now the obvious problem is how to save a value of different types in a single variable ? Well the low cost compile time solution is to use std::variant. A type with std::variant would look something like this

typedef std::variant<std::monostate,bool,long long,double,std::string,std::complex<long long>,std::complex<double>> value_type;

Although the type looks large ( A variant is a type-safe union and the memory required by union is the size of its largest type mentioned in it ), the actual size consumed by an object of value_type is just 40 bytes in both MSVC and gcc. The real size should have been around 24 bytes only but is more due to String Buffer Optimization( 32 bytes buffer).

Now comparing the size differences:

(SBO Enabled)

  • Current Case : 80n ( Where n is the number of cards )
  • With variant : (32 (Keyword) + 40(values) ) n = 72n
  • Difference : 8 bytes

(SBO Disabled - I don't know of implementations that allows that. We would have to use string class without SBO optimization )

  • Current Case : Still 80n
  • With variant : ( 8 (keyword) + 24 (values) )n = 32n
  • Difference : 48 bytes !!

Without the use of variant also this problem can be solved . After trimming the spaces and comments ( optional ) from card we can use temporary swap idiom (constant complexity) or shrink_to_fit( Complexity unspecified ) that will optimize the space

std::string(card).swap(card); // Card's capacity now becomes length()
//or
card.shrink_to_fit(); //  Does the same thing in some implementations

But still the spaces within the card are not accounted for and may still be large.

Time :

Both of the problems stated above are not as critical as this one. Every time a call to the value_of ( ) function takes place, the respective card from the vector is fetched in constant time ( use of unordered_map ). Till here its fine. But then the code calls the lexical cast to convert the value to the required type and this is done each time value_of ( ) is called ( Even if the same keyword is looked up multiple times by the user ) we can cache the values parsed once but again that would require extra space the total of which it will be quite large.

Moreover if this function is inline ( which it will be for the most cases ) a significant amount of code bloat is encountered as well.

Why parse the same value again and again if it can be first parsed once and then stored. Then the lookup cost will be very small ( order of magnitudes smaller ).

This is the reason why i suggest to parse and store the values during reading itself. One question that comes in mind is how the type of values will be detected for each keyword. Simple we wont detect it, rather we will try parsing the value to each of the types in this order ( bool , std::string, long long, std::complex<long long>, double, ,std::complex<double> ) It may seem like thats costly but actually its not ( As the average times will be less and the reason given in next point) if the parsing is done in the same order with boost::spirit::qi::parse ( This is because the results of previous test can be reused for next types ( In competency test solution this advantage is not utilized yet but future version of it will ).

Also one another advantage ( I cannot prove it as of now ) is that same kind of operations ( read card, parse keyword, value) is being repeated again and again and that means the CPU can cache it completely which will surely improve the average value parsing time .

Kindly refer to the competency test solution for a basic implementation of this style of parsing with parsing policy ( The value_type is not baked into the code but is exposed by the policy class making it highly flexible )

Proposed Milestones and Schedule

In the time duration of 90 days ( Not exact )

May : Improving the Design of FITS module and writing internal documentation for the previous part of the project.

June : Writing the complete external API along with its documentation and start testing the newly written code

July : Perform benchmarks and profiling of the new features written and improve the hotspots region with more optimized code. Along with that, full library testing will take place by testing the code with a significant number of FITS files ( both with or without conformance to the FITS standard.

Programming Competency

Problem Statement

Develope a small class/function which will be able to read the primary header (Plus points for developing ways to enter the header-value in the primary header and writing into the file)

Solution

Kindly visit Fits Reader (github repo that hosts the solution for Programming Competency Test ) for instructions and implementation details on how to use the Fits Reader API (README).

Just to summarize the Fits Reader API can :

  • Read Primary Header or any other extension header of a FITS File
  • Write updated primary header data back to the FITS file
  • Supports custom parsing policies that allows the programmer to specify his own set of parsing rules for FITS ( This is the very reason all different types of headers can be parsed and stored using this API).

Some parts of the library have not been optimized completely but will be optimized out in the next revision

About

This repository hosts proposal for GSOC 2020 for Boost.org

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published