C code to index large text library and find similar -- 2

$200-600 USD

Cerrado

Publicado

hace más de 5 años

$200-600 USD

Pagado a la entrega

I need a mini-app (Compiled C on Linux) that groups similar sentences together. I have 100,000 sentences (say in a PostgresSQL DB, Unicode text). It must perform VERY fast - by indexing each root-word to a 16bit integer (which would reduce its memory foot print), then re-creating a new data structure with sentence delimeters and sentence length. Group into buckets of similar sentence length. Then iterate through doing word-by-word comparisons (16bit comparisons). Two algos are acceptable:- 1. Simple - Take a source sentence and iterate through XORing word by word (irrespective of word order or word frequency). If there are more than x words outstanding - then it is NOT a similar sentence. X in this case would be 25% of the number of total words. We leave such large gap so that we don't need to worry about word roots. From the smaller data set - we then proceed to do a classic levenstechn comparison - but with an upper bound of x deviation - meaning after it detects more than say 10% deviation - it exists that comparison. Here it is a character by character comparison. The app should communicate with a folder of .gz files that contain the text and it could use a text boundary to distinguish each sentence. The output would need to be a new text file that sorts every sentence into groups of similarity - separated by a text boundary. I need something very soon. A mediocre algorithm is fine. To be awarded: explain in 1-2 sentences your proposed approach, and bid a base amount plus a bonus on completion. Come in cheap, and get the big reward after you have delivered.

ID del proyecto: 17629535

Información sobre el proyecto

11 propuestas

Proyecto remoto

Activo hace 6 años

¿Buscas ganar dinero?

Dirección de email

Beneficios de presentar ofertas en Freelancer

Fija tu plazo y presupuesto

Cobra por tu trabajo

Describe tu propuesta

Es gratis registrarse y presentar ofertas en los trabajos

11 freelancers están ofertando un promedio de $422 USD por este trabajo

@hbxfnzwpf

I am very proficient in c and c++. I have 16 years c++ developing experience now, and have worked for more than 7 years. My work is online game developing, and mainly focus on server side, using c++ under Linux environment. I made many great projects using c++, for example, I made the tools which could convert java codes into c++ scripts, of course garbage collection included, this was very similar to a compiler, and was very complex. I also made our own mobile game using c++, I can show you the demo of client, if you like. I am very proficient in java also. I have a very good review on Freelancer.com, I never miss a project once I accept the job, you can check my review. Trust me, please let expert help you.

$400 USD en 5 días

4,9

(202 comentarios)

7,3

@dinhfreedom

Dear sir. Your project attracted my attention at first glance, because I've extensive experience in C Programming. I'm really confident about your project, and very eager to join your project. If we have a chance to cooperate, I'll do my best to provide wonderful result. Looking forward to your response. Best Regards.

$400 USD en 10 días

4,8

(78 comentarios)

6,7

@erShashi

Hi, I must say very interesting and challenging project. I have done some work on the similar project and did research on how Twitter search works on large volume. I would suggest lucene search library to create indexes over you data , and then implement algo to get appropriate results. Lucene works awesome with large volume if chosen indexes are good. About me , I am ex-Microsoft employee and have 8+ years experience in software development and customization usingwide range of Microsoft technologies (C#, ASP.NET, MVC,WPF, Window form, Sql Databases, Azure, Sync framework etc.), Mobile technologies - Android, Xamarin and server side language node.js and Golang. Since i have previous experience in such applications, so I think it will help in this project, if selected. About my previous work, you can visit my profile to see feedback from previous employer. Let me know, If you find me suitable for this project and share complete details. If you want more details, we can discuss over chat/ skype. Regards, Shashi

$588 USD en 25 días

4,9

(42 comentarios)

5,4

@freelancerSolvit

.................................................................................................................................................................................................................

$444 USD en 10 días

5,0

(32 comentarios)

4,8

@mdolgun

Hello, I am expert on C/C++/Python/Data Structures/Algorithms For word indexing, i propose using trie structure (character tree). Leaf nodes would carry the index value. We could also use a hash table for indexing, but trie has advantage of allowing approximate matching of n-differences (insert/delete/substitute character). After finding word indexing we can also use another trie structure (word tree) to find the most similar sentence again up to n-differences (insert/delete/substitute word). If such a sentence is found then it is inserted into leaf bucket. Note that this is not an optimal algorithm, because an optimal algorithms like hierarchical clustering has O(n3) complexity which infeasible for large data sets. We can talk details of input/output format. I can deliver a working code in 3 days, but for performance optimization I need 7 days. I suggest to have two mile stones: a working program (3 days), optimized program (7 days). Best Regards

$200 USD en 7 días

4,8

(5 comentarios)

4,7

@magadhmindslx

Dear Sir, I have gone through project description and interested taking it up. Posted bid amount is indicative and a more accurate I can give once more details are shared. Looking forward to hear from you. Thanks

$200 USD en 10 días

5,0

(15 comentarios)

3,2

@mbenkendorf

Dear Employer Due to my own interest in such natural language processing problems, I already developed your described approach into a first unoptimized protoype to see how fast it can process and group 100k sentences. Since there was no sample data attached, I took the first 100.000 sentences of Shakespeares works. It takes about 120 seconds on my machine (Intel Core i7-6700) to group the 100k sentences, mainly because some buckets of sentences with same length had 10.000-15.000 entries (I don't know how the productive data is structured in comparison). Perhaps something like cosine similarity can bring an improvement in speed. Otherwise, my approach is very simple and straightforward: the input sentences are loaded into an in-memory structure, afterwards the sentences are assigned to buckets determined by their length so that each unprocessed sentence must only be compared to those sentences in the bucket where his length falls into(plus maybe the one below and above). Best regards

$444 USD en 3 días

5,0

(4 comentarios)

3,4

@TobiObadiah

Hi there, Interesting project you have there. Here is my approach. I have data structure library in C which is in development but will meet this project needs as some of the data structures have been implemented. The program will read the sentences into a list, stack. The goal will be to optimize the comparison( of words ) of the sentences. Indexing root-words will not be a problem as lots of ways already come to mind, say hashing, etc. I am pretty confident in my approach to solving this.

$300 USD en 4 días