Field study techniques produce enormous amounts of data—a problem referred to as an
“attractive nuisance” (Miles, 1979). The purpose of this data is to provide insight into
the phenomenon being studied. To meet this goal, the body of data must be reduced to
a comprehensible format. Traditionally, this is done through a process of coding. That
is, using the goals of the research as a guide, a scheme is developed to categorize the
data. These schemes can be quite high level. For instance, a researcher may be interested
in noting all goals stated by a software engineer during debugging. On the other
hand the schemes can be quite specific. A researcher may be interested in noting how
many times grep was executed in a half-hour programming session. Once coded, the
data is usually coded by another researcher to ensure the validity of the rating scheme.
This is called inter-coder or inter-rater reliability. There are a number of statistics that
can be reported that assess this, the most common is Kendall’s tau.
Audio and videotape records are usually transcribed before categorization,
although transcription is often not necessary. Transcription requires significant cost
and effort, and may not be justified for small, informal studies. Having made the
decision to transcribe, obtaining an accurate transcription is challenging. A trained
transcriber can take up to 6 hours to transcribe a single hour of tape (even longer
when gestures, etc. must be incorporated into the transcription). An untrained transcriber
(especially in technical domains) can do such a poor job that it takes
researchers just as long to correct the transcript. While transcribing has its problems,
online coding of audio or videotape can also be quite time inefficient as it can take
several passes to produce an accurate categorization. Additionally, if a question surfaces
later, it will be necessary to listen to the tapes again, requiring more time.
Once the data has been categorized, it can be subjected to a quantitative or qualitative
analysis. Quantitative analyzes can be used to provide summary information
about the data, such as, on average, how often grep is used in debugging sessions.
Quantitative analyzes can also determine whether particular hypotheses are
supported by the data, such as whether high-level goals are stated more frequently
in development than in maintenance.
When choosing a statistical analysis method, it is important to know whether
your data is consistent with assumptions made by the method. Traditional, inferential
statistical analyzes are only applicable in well-constrained situations. The type of
data collected in field studies often requires nonparametric statistics. Nonparametric
statistics are often called “distribution-free” in that they do not have the same
requirements regarding the modeled distribution as parametric statistics. Additionally,
there are many nonparametric tests based on simple rankings, as opposed to strict
numerical values. Finally, many nonparametric tests can be used with small samples.
For more information about nonparametric statistics, Seigel and Castellan (1988)
provide a good overview. Briand et al. (1996) discuss the disadvantages of nonparametric
statistics versus parametric statistics in software engineering; they point out
that a certain amount of violation of the assumptions of parametric statistics is legitimate,
but that nonparametric statistics should be used when there are extreme violations
of those assumptions, as there may well be in field studies.
Qualitative analyzes do not rely on quantitative measures to describe the data.
Rather, they provide a general characterization based on the researchers’ coding
schemes. Again, the different types of qualitative analysis are too complex to detail
in this paper. See Miles and Huberman (1994) for a very good overview.
Both quantitative and qualitative analysis can be supported by software tools. The
most popular tools for quantitative analysis are SAS and SPSS. A number of different
tools exist for helping with qualitative analysis, including NVivo, Altas/ti, and
Noldus observer. Some of these tools also help with analysis of video recordings.
In summary, the way the data is coded will affect its interpretation and the possible
courses for its evaluation. Therefore it is important to ensure that coding schemes
reflect the research goals. They should tie in to particular research questions.
Additionally, coding schemes should be devised with the analysis techniques in mind.
Again, different schemes will lend themselves to different evaluative mechanisms.
However, one way to overcome the limitations of any one technique is to look at the
data using several different techniques (such as combining a qualitative and quantitative
analyzes). A triangulation approach (Jick, 1979) will allow for a more accurate
picture of the studied phenomena. Bratthall and Jørgensen (2002) give a very nice
example of using multiple methods for data triangulation. Their example is framed in
a software engineering context examining software evolution and development. In fact,
many of the examples cited earlier, use multiple methods to triangulate their results.
As a final note, with any type of analysis technique, it is generally useful to go
back to the original participant population to discuss the findings. Participants can
tell researchers whether they believe an accurate portrayal of their situation has
been achieved. This, in turn, can let researchers know whether they used appropriate
coding scheme and analysis techniques.
- Data clustering
- Clustering
- Characterization of data
- Data Mining choice
- Techniques of Data mining
- Data MIning classification
- Knowledge Discovery Process
- Developing the TARtool
- Data mining Motivation
- what is DATA MINING ?
0 comments:
Post a Comment