Skip to content

Mining in Microseconds

AdrienneDC edited this page Jan 20, 2014 · 2 revisions

###Business Problem and Example Use Case

Financial institutions incur major losses due to fraudulent transactions. Criminals blend fraudulent transactions with normal transactions, making them difficult to detect. Real-time fraud detection requires consideration of a variety of factors in real time. Financial institutions use advanced techniques such as machine learning, to flag fraudulent transactions from normal transactions. Different modalities are used to flag transactions as regular/normal or fraudulent. Also, profiles of card owners are used in conjunction with transaction behavior to determine abnormalities in transaction behavior. Sample fraud detection applications and data sources will be provided.

###Suggested Hacks Apply advanced analytic models on streaming high-volume data. IBM SPSS and InfoSphere Streams are integrated to provide the ability to learn advanced analytic models from historical data sets on storage (database, IBM PureData Systems, filesystems) and apply those models in real-time on streaming high-volume data. This is a widely applicable advanced analytics paradigm. Can you think of new business domains where this concept can be applied and extended? Below are a variety of interesting business domains, data sources and data sets which you can download, experiment with and extend using the above mentioned InfoSphere Streams and IBM SPSS integration capabilities (refer to the individual websites below for all the usage instructions, including the citation requirements, you can download a trial version of IBM SPSS here - http://www-01.ibm.com/software/analytics/spss/downloads.html).

InfoSphere Streams can also do analytics using open source R and other modeling tools. The InfoSphere Streams mining toolkit uses PMML, predictive model markup language.

  1. Data sets and models created from SPSS for financial, predictive maintenance, and next best offer / web transaction domains - see data set.
  2. KDD Data sets: http://www.sigkdd.org/kddcup/index.php
  •  a) Click stream and purchase data from a real retailer
    
  •  b) Customer relationship prediction dataset
    
  •  c) Customer recommendations
    
  1. Yelp academic data set: https://www.yelp.com/academic_dataset - Data and reviews of the 250 closest businesses to 30 universities to explore and research
  2. Buzz in social media data set: http://ama.liglab.fr/datasets/buzz/
  3. Factual -- location information: http://factual.com/products#location-data; Database of over 65 million local businesses and points of interest in 50 countries accessible via API or download
  4. A few different data sources / data hubs
    •  a) http://knoema.com/
      
    •  b) http://www.datawrangling.com/some-datasets-available-on-the-web
      
    •  c) http://aws.amazon.com/datasets
      
    •  d) https://opendata.socrata.com/
      
  5. Searching for "open data sets" or "data sets for machine learning" or "open data sets for mining" yields a variety of interesting possibilities as well
  6. Streaming data sets:
    •  a) Earthquake - http://www.emsc-csem.org/service/rss/
      
    •     http://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php
      
    •  b) Bitcoin - http://bcchanger.com/bitcoin-currency-price-feeds
      
  •  c) Air Travel - http://services.faa.gov/docs/services/airport/
    
  •  d) Stock Data - http://opendata.stackexchange.com/questions/862/which-real-time-open-data-apis-do-you-know
    
Clone this wiki locally