This README file contains information about the script
to generate the tidy dataset from the UCI HAR data set.
You must have an unaltered, unzipped copy of the UCI HAR data in your
working directory. Thus your working directory must contain a subdirectory
UCI HAR Dataset.
run_analysis.R script produces a file called
CodeBook.md) in the working directory.
How to run
- Make sure the UCI data set is in your working directory
- At the R prompt, type
pathis the path where the script is located. If it is in the working directory, simply type
UCI HAR Data Set
Here, the UCI HAR data set is briefly described, to facilitate
understanding of the
The UCI HAR data set consists of two separate data sets: the training set and the test set, which are structurally the same, but differ in the number of data points. The training set has 7352 data points while the test set has 2947 points. Each data point consists of: a subject ID, an activity label, and 561 measurements derived from the raw sensor data. Note that this raw data (which lies in the 'Intertial Signals' directories) is not used to construct the tidy data set. This data is described below.
The first data point component is the subject ID. There are 30 subjects
and the subject ID is an integer from 1 to 30. This data is stored in
The next component of a data point is the activity. There are six possible activities as listed below:
The activity data is stored in the files
These files contain the activity index (1 - 6) for each data point. The
mapping from activity index to activity label, as shown in the list
above, is stored in the file
The bulk of each data point consists of the 561 measurements. These
measurements are described in the file
features_info.txt and a listing
of the measurements is in the file
Description of run_analysis.R
run_analysis.R was used to generate the tidy data set. Below
is a description of the script's major processing steps.
- Training and test data are merged. Thus, a set with 7352+2947=10299 data points is obtained.
- The 561 measurements are reduced to 86 measurements.
In general, all means and standard deviations are kept, and all other
measurements are discarded. More specifically, any measurement that
is a mean or standard deviation of some quantity is kept. These
inclue the measurements whose names end in
std(). Also, there are seven angle measurements which are angles between various mean quantities. These measurements are also kept.
- The measurement names are cleaned up. Parentheses are discarded, hyphens are converted to dots, etc.
- Activity indices are replaced with cleaned-up activity labels (lower case with spaced instead of underscores).
- Each of the 86 retained measurments is averaged by subject and activity. Thus, we are left with 30 x 6 = 180 values of each of the 86 measurements.
- This mean data is written to a csv file with 180 rows and 88 columns