ECE-1021

HOMEWORK #7

(Last Mod: 27 November 2010 21:38:39 )

ECE-1021 Home


Program A: Data Analysis, Sorting, and Searching

Download the program randgen.c from the Example Code page and run it. This program will ask for the name of a text file, how many data values you want printed to the file, and what the minimum and maximum values are that you desire. It will then write that many data values to the file.

You are to write a program that then asks the user for a file name, opens that file, and prints out the following information about the file:

DATA FILE: mydata.dat

 

NUMBER OF PTS:      2000

MINIMUM VALUE:  1.283944

MAXIMUM VALUE: 77.387448

 

MEAN:  23.127845   STD DEV: 5.35867

 

sigmas    lower    upper    combined

  0.5     17.78%   13.29%    30.97%

  1.0     32.80%   29.12%    61.92%

The table at the end of the print out above lists the fraction of the total data that lies within the indicated number of standard deviations of the mean. In other words, it says that 17.78% of the values were between the mean and (mean-0.5*sigma) less than the mean while 13.29% were between the mean and (mean+ 0.5*sigma). The combined value is simply the fraction of the values that lied within that many standard deviations of the mean regardless of what side they were on and hence is the sum of the lower and upper values. Your table should print data for 0.5 to 3.0 sigmas.

For simplicity's sake, you may assume that the data file will contain values less than 1000 and no smaller than 0.001 - of course you have control over that when you generate your data file. Plan your output accordingly and assume that the grader will use data sets at both ends of that range. It is acceptable to simply print all data values to six decimal places. Percentages should be reported to two decimal places.

After you have printed out the above information, generate a histogram from the data. The number of bins used and the bin size should be reasonable for the data set. There should be an odd number of bins and the mean should be located in the middle of the center bin. Since you need to use an array to store the contents of each bin, you may place a reasonable cap of 100 on the maximum number.

Your histogram should start with the first bin that has data and continue until the last bin that has data is displayed. Bins outside of this range should not be displayed at all. Empty bins within this range must be displayed.

The following is an example format - note that the values used are random junk - it's the format that is important.

HISTOGRAM

  BINS:  15

  WIDTH: 0.456532

  %/*: 0.48%

  MAX %: 24.23%

======================================================================

9999.020435 to 9999.020675 |****

9999.020435 to 9999.020675 |*******

9999.020435 to 9999.020675 |***********

9999.020435 to 9999.020675 |*************

9999.020435 to 9999.020675 |*********************

9999.020435 to 9999.020675 |********************

9999.020435 to 9999.020675 |*****************************************

9999.020435 to 9999.020675 |*********************************

9999.020435 to 9999.020675 |*******************************

9999.020435 to 9999.020675 |************************************

9999.020435 to 9999.020675 |**********************

9999.020435 to 9999.020675 |******************

9999.020435 to 9999.020675 |************

9999.020435 to 9999.020675 |***

9999.020435 to 9999.020675 |****

======================================================================

 

The %/* value is how many percentage points each asterisk on a line represents. This number should be chosen so that the largest bin uses close to the maximum number of asterisks that can be printed without causing the line to wrap. The above format would permit approximately fifty asterisks to be used.

 

After that, your program should sort the data from minimum to maximum. You should give the user an option of which sorting algorithm to use and then print out the method and how long it took. After that, generate a table of percentile break points such as the following

 

SORT METHOD: Bubble Sort

SORT TIME:  32.345 seconds.

 

PERCENTILE GROUPS

 MAX    8916.981726

  99    8723.324293

  95    8237.982745

  90    7923.239423

  85    7392.029384

  80    6829.293948

  75    6072.283841

  70    5582.398384

  ...

  10    2313.349858

   5    2109.928374

   0    1928.384885

 

The 0th percentile is, by definition, that value that is greater than zero percent of the values in the data set. Hence, it is simply the lowest value in the data set. There is no such thing as the 100th percentile as this would require a value that is larger than all values in the data set, including itself, which is impossible. But it is still useful to know what the absolute largest value is and so that is simply called "MAX" and is printed above the 99th percentile score.

 

Finally, your program should ask the User for a lower and an upper limit and then print out to the screen all values that are between those limits. Values that exactly match the limits should be included. Prior to the printed list, you should indicate how many values were found and the percentile range of the data subset.

 

LIMIT MIN: 3400.000000

LIMIT MAX: 5800.000000

 

RANGE MIN: 3412.928384 (17 %ile)

RANGE MAX: 5766.601849 (72 %ile)

 

DATA SIZE: 763 points

 

3412.928384

....

5766.601849