Created: 2014-05-19 15:17
Updated: 2014-05-20 23:50



This README file contains information about the script run_analysis.R, used to generate the tidy dataset from the UCI HAR data set.


You must have an unaltered, unzipped copy of the UCI HAR data in your working directory. Thus your working directory must contain a subdirectory called UCI HAR Dataset.


The run_analysis.R script produces a file called tidy_dataset.csv (described in CodeBook.md) in the working directory.

How to run

  1. Make sure the UCI data set is in your working directory
  2. At the R prompt, type source("path/run_analysis.R"), where path is the path where the script is located. If it is in the working directory, simply type source("run_analysis.R")

UCI HAR Data Set

Here, the UCI HAR data set is briefly described, to facilitate understanding of the run_analysis.R script

The UCI HAR data set consists of two separate data sets: the training set and the test set, which are structurally the same, but differ in the number of data points. The training set has 7352 data points while the test set has 2947 points. Each data point consists of: a subject ID, an activity label, and 561 measurements derived from the raw sensor data. Note that this raw data (which lies in the 'Intertial Signals' directories) is not used to construct the tidy data set. This data is described below.

Subject ID

The first data point component is the subject ID. There are 30 subjects and the subject ID is an integer from 1 to 30. This data is stored in the files subject_test.txt and subject_train.txt.

Activity Labels

The next component of a data point is the activity. There are six possible activities as listed below:


The activity data is stored in the files y_test.txt and y_train.txt. These files contain the activity index (1 - 6) for each data point. The mapping from activity index to activity label, as shown in the list above, is stored in the file activity_labels.txt.


The bulk of each data point consists of the 561 measurements. These measurements are described in the file features_info.txt and a listing of the measurements is in the file features.txt.

Description of run_analysis.R

The script run_analysis.R was used to generate the tidy data set. Below is a description of the script's major processing steps.

  1. Training and test data are merged. Thus, a set with 7352+2947=10299 data points is obtained.
  2. The 561 measurements are reduced to 86 measurements. In general, all means and standard deviations are kept, and all other measurements are discarded. More specifically, any measurement that is a mean or standard deviation of some quantity is kept. These inclue the measurements whose names end in mean() or std(). Also, there are seven angle measurements which are angles between various mean quantities. These measurements are also kept.
  3. The measurement names are cleaned up. Parentheses are discarded, hyphens are converted to dots, etc.
  4. Activity indices are replaced with cleaned-up activity labels (lower case with spaced instead of underscores).
  5. Each of the 86 retained measurments is averaged by subject and activity. Thus, we are left with 30 x 6 = 180 values of each of the 86 measurements.
  6. This mean data is written to a csv file with 180 rows and 88 columns (described in CodeBook.md)
Cookies help us deliver our services. By using our services, you agree to our use of cookies Learn more