Category Archives: algorithm

How to run my personal cluster algorithm (java code) inside apache spark

I have a code from other student it's about k-medoids clustring algorithm, and i should run it inside apache Spark, i know that spark is based in RDD, but the algorithm is complete and i do not know how to change it to RDD's for example this is a part of code

public Queue<ArrayList<Object>> getFSwoosh() {
    Queue<ArrayList<Object>> resultat = new LinkedList();
    Map<String, Map<String, ArrayList<Object>>> P = new HashMap();
    Map<String, Set<String>> N = new HashMap();

    ArrayList<Object> courant = null;
    ArrayList<Object> buddy;
    ArrayList<Object> doc;
    ArrayList<Object> fusion;
    // Preparation of the sets "P" and "N"
    for (RegleSimilarite regle : this.regles) {
        String feature = regle.getSimilarites().get(0).getAttribut();
        for (int i = 1; i < regle.getSimilarites().size(); i++) {
            feature += "+" + regle.getSimilarites().get(i).getAttribut();
        P.put(feature, new HashMap());
        N.put(feature, new HashSet());

    while (!liste.isEmpty() || courant != null) {
        if (courant == null) {
            courant = liste.remove();
        buddy = null;
        // New feature values are registered (feature, value) -> null
        for (String feature : P.keySet()) {
            //System.out.println("verification de nouveaux documents");
            String valeur = Pfv(feature, courant);
            if (P.get(feature).get(valeur) == null) {
                P.get(feature).put(valeur, courant);
                //System.out.println("Nouvelle valeur rencontrée");
        // If the value has already been encountered (feature, value) -> (document! = Current)
        for (String feature : P.keySet()) {
            //System.out.println("Verification de P s'il existe deja un document!=courant");
            String valeur = Pfv(feature, courant);
            if (P.get(feature).get(valeur) != courant) {
                buddy = P.get(feature).get(valeur);
                //System.out.println("Valeur trouvé dans P !!");
        // If we do not find a match in P, we look in I '(result)
        if (buddy == null) {
            int index = 0; // indice de la regle similarite = ordre de la feature !? ( a tester )
            for (String feature : P.keySet()) {
                //System.out.println("Verification pour match "+feature+"=="+this.regles.get(index));
                String valeur_feature = Pfv(feature, courant);
                if (!N.get(feature).contains(valeur_feature)) {
                    if (!resultat.isEmpty()) {
                        Iterator<ArrayList<Object>> it = resultat.iterator();
                        while (it.hasNext()) {
                            doc =;
                            if (this.matchFSwoosh(doc, courant, index)) {
                                buddy = doc;
                    if (buddy == null) {
        if (buddy == null) {
            courant = null;
        } else {
            //System.out.println("Fusion de "+courant+" "+buddy);
            fusion = this.merge(courant, buddy);

            for (String feature : P.keySet()) {
                for (String valeur : P.get(feature).keySet()) {
                    if (P.get(feature).get(valeur) == courant || P.get(feature).get(valeur) == buddy) {
                        P.get(feature).put(valeur, fusion);
            courant = fusion;
    return resultat;

There is a way to run the code inside spark just like there, if not there is a way to run this complex code in RDDs Please if there are any other suggestion Thank you

Need to fine tune 3 Apache Spark algorithms that will boost my Spark algorithm writing at scale

I am new to Spark and learning how to process data at scale, and after going through APIs these are my first attempts to develop some complex algorithms.

I want to know is there any design problems with algorithms i designed ? Will these algo work ok processing data in parallel ? or can we make some changes to enhance them

Here is the log file with fields Date Time Event IP User

1/1/2000 21:00:00, LOGIN,, user1
1/1/2000 21:02:00, LOGIN,, user1
1/1/2000 21:05:00, LOGOUT,, user1
1/1/2000 21:10:00, LOGIN,, user2
1/1/2000 21:10:00, PUT,, user2
1/1/2000 21:12:00, LOGIN,, user3
1/1/2000 21:15:00, LOGOUT,, user2
1/1/2000 21:16:00, GET,, user2
1/1/2000 21:22:00, LOGOUT,, user3
1/1/2000 21:23:00, LOGOUT,, user1

Scenario 1. 1) Return IP that received most distinct user logins result. For example in above data the IP received most distinct users are

Here is the algorithms that I developed

   // Step 1    
    // First read file and create RDD 
    val EventsRDD = sc.textFile("file:///d:/serverlog")
            r=> {
                    val EventData = r.split(",");

                    (EventData(0).trim, EventData(1).trim, EventData(2).trim, EventData(3).trim);         
    //Now we have an RDD with each record of format  (Date  Time   Event    IP    User)

    // Step 2       
    //      For first algorithm arrange data into RDD of format     (IP address, (EventType, User))      IP become the key and (EventType, User) become value 
    val v = a=>  (a._3, (a._2, a._4)  ) )       //  a._3 = IP address     a._2 = Event     a._4 = User               
    .groupByKey()                                              //   now group by IP Address
    .mapValues {
        //now process records (EventType, User)    for each IP
        a => a.flatMap (z => if(z._1 == "LOGIN") Some(z._2) else None )     // If event is LOGIN then return User otherwise NONE

    // Step 3   
    // We have IP and list of users who login that IP
    // Convert records of each IP into set so that duplicate users are eliminated  
    val v3 = v.mapValues(a=>a.toSet)       //
    .map(a=> (a._1, a._2.size) );          //  now get the number of unique users in each list

    //Step 4
    //get the IP with highest number of unique users
    val result = v3.reduce( (a,b) =>  if(a._2 > b._2)     (a._1, a._2)  else  (b._1, b._2) )        

    println ("========================= Results is : " + result + "=============================");

Above algorithim produces the required result that is IP has highest number of unique users who login that server.

Now i want to know does this algorithim scale and data can be processed in parallel on multiple machines ? is there any changes that will speed up the process in parallel ?

Scenario 2. Return user that at one point had highest number of sessions open result for example: user1 had 2 sessions open at one time

Here is my algorithm

    Step 1.
    //First convert data from earlier RDD to the format  (UserID, (EventType, IP) )
    val tmp1 = a=>  (a._4, (a._2, a._3)  ) )
    .groupByKey()             //group by user    

    // Step 2.
    // Now process events of each user    initialize 2 variable   c (current count)      and    m (max count)
    // the idea is 
    // 1.  if LOGIN event then increment c   if c > m then set m = c         
    // 2.  if LOGOUT event then decrement c   
    // that means at end i will have the max number of login at one time by all users in m 
    val tmp2 =  tmp1.mapValues { a=>  var c=0; var m=0; { z=>    if(z._1 == "LOGIN") { c=c+1; if(c>m)m=c; };       if(z._1 == "LOGOUT") { c=c-1; };     };     m  }

    //Step 3.   get he user with highest number of login 
    val tmp3 = tmp2.reduce( (a,b) =>  if(a._2 > b._2)     (a._1, a._2)  else  (b._1, b._2) )        //get the IP with highest number of unique users

    println ("========================= Results Are : " + tmp3 + "=============================");

Scenario 3. Return average session length in seconds (time between login and logout event for same user) per IP result for example: ( 300s), ( 450s), ( 60s)

for this algorithm i have no idea. please some body design an algorithm with comments. I guess we need to process time series data in this case. What is the concept of processing data in time series in spark ?

If somebody help me fine tune first 2 algorithms and develop the third algorithm i guess i can will get the basis and i will be better prepare developing spark algorithm in future. SOlving these 3 algorithms are vital for my learning spark


Quickest language to write an iterative algorithm [on hold]

I am looking to write an algorithm, similar to StumbleUpon, which loops through website URIs until it happens upon a valid one, then makes the website available to the browser. I was thinking of maybe using a combination of Python and Apache or Express.js but I'm not sure. What coding languages would be capable of doing something like this and which would be the most efficient, as obviously efficiency is key for this particular application.

file structure with google’s algorithm

Dear all googling stackoverflowers

I have a question regarding file structure with google's algorithm.

Would it make a difference to use a physical file structure for your site opposed to using a rewrite rule in a htaccess file? Would google know the difference? For instance


Or using a rewrite rule to change the filename to a directory


I'm not great with htaccess stuff but I think the script would be?

RewriteEngine On

RewriteCond %{SCRIPT_FILENAME} !-d 
RewriteCond %{SCRIPT_FILENAME} !-f

RewriteRule ^(\w+)$ ./$1.php

Would there be any difference to the behaviour in which google and other search engines finds and displays your urls?

Concidering I'm reading page 8 of this resource:

Database system and data structure for mapping user and products that have condition

now I am facing a problem and I hope that someone can help me. Assume that I have products and each product has own conditions. With user profiles, we will run a cron-job to map users and suitable products. For example:

Product A requires: - Positive conditions: Age is over 30, Work as IT - Negative conditions: Country is not Vietnam So just users have suitable age, IT jobs and not from VN can apply this product.

Now I’m considering Hbase + apache kylin to build OLAP cubes but it’s quite difficult for me. So anyone can provide the best method for this problem? Thanks all.

How to give more memory to Apache

I am using PHP with Apache 2.4 on Centos 7 to run a very CPU intensive probabilistic model. It took 8 days to run it on my home computer so I decided to move it in to a Cloud service. So I configured it and now it takes only 5 hours. Great!

BUT there is a problem, while processing the algorithm I checked the "free" and "top" command on the Cloud and it reported being using around 24% of my CPU and 10% of the available memory! I am paying around $ 1.50 per hour to a server at Linode with 96 GB and 20 CPU cores, so I it would be better to use 100% of the available CPU.

I know it depends A LOT on the algorithm, but check this code below. It is a very heavy CPU intensive algorithm, and it should use 100% of CPU but no, it uses only 24% (according to "top").



for ($i=0;$i<10000;$i++) {

    for ($ii=0;$ii<10000;$ii++) {

        for ($iii=0;$iii<10000;$iii++) {

            for ($iiii=0;$iiii<10000;$iiii++) {






So, what is going on? How can I make my Apache use more CPU and memory?